link: https://arxiv.org/pdf/2007.00114
What have they written about?
They talk about some initial exploration and experimentation performed to develop the FathomNet database, which is a large ocean images-based database.
Primarily, the goal was to try out methods to understand how a localisation-based annotated dataset could be developed from the existing seed data that they had.
The main technical problem that they tried to work on was understanding how a dataset of single-labelled images could be used to produce box-annotated images to enable object detection tasks.
The Dataset
Their seed dataset comprised primarily of more than 26000 hours of high resolution deep-sea videos from ROVs. This was stored stored, annotated, and maintained as a part of the Video Annotation and Reference System (VARS) at MBARI.
For initial exploration they selected the most occurring 18 midwater genera and 17 benthic genera.
This data consisted of video frames which were given a label according to the species most dominantly observed in it. Often, even the not most dominant species was used as a label. Images often also had multiple instances or classes of interest in a single video frame but still only a single label was given.
The data consisted of iconic and non-iconic views of organisms.
Iconic: Natural images in which only a single object of interest exists in the image. Where the image is usually zoomed in to include only the class of interest.
Non-Iconic view: Much more natural images in which multiple objects of interest might exist in the image and the class of interest won’t be particular zoomed in onto. Much more representative of inference and natural deployment settings.
ML Experimentation
They experimented with ML algorithms to speed up their data set generation/bounding-box annotation process. 3 types of models were created: Image Classification, Weakly Supervised Localisation, Object Detection.
Image Classification
Trained a ResNet50 model trained on ImageNet separately for benthic and midwater imagery
Midwater model: 15 species, ~1000 images for each genus. For extending to multi-label classification, they took the top3 or topN categories and would use those classes to tag an image. Top1 accuracy was 85.7% and Top3 was 92.9%. Using this model on actual midwater transect data showed very poor performance since the model was trained on iconic zoomed-in shots of species while transect data was full of non-iconic imagery better suited for object detection task.
Benthic model, 15 classes, ~33000 images, top1 and top3 accuracies were ~72% and ~93%. Good jump between accuracies observed as benthic data is more noisily labelled. There are multiple classes/concepts in many images but single label given to image.
Weakly Supervised Localisation