TYPE Methods

PUBLISHED 02 June 2023

DOI 10.3389/fmars.2023.1171625

Generalised deep learning

OPEN ACCESS model for semi-automated
Xuemin Cheng,
Tsinghua University, China
length measurement of fish
Jorge Paramo,
in stereo-BRUVS
University of Magdalena, Colombia
Kai Wang,
Shanghai Ocean University, China Daniel Marrable 1*, Sawitchaya Tippaya 1, Kathryn Barker 1,
*CORRESPONDENCE Euan Harvey 2, Stacy L. Bierwagen 3, Mathew Wyatt 4,
Daniel Marrable
Scott Bainbridge 3 and Marcus Stowar 3
RECEIVED 22 February 2023
Curtin Institute for Computation, Curtin University, Perth, WA, Australia, 2 Curtin University, School of
ACCEPTED 16 May 2023
Molecular and Life Sciences, Perth, WA, Australia, 3 Australian Institute of Marine Science, Townsville,
QLD, Australia, 4 Australian Institute of Marine Science, Indian Ocean Marine Research Centre,
The University of Western Australia, Perth, WA, Australia
Marrable D, Tippaya S, Barker K, Harvey E,
Bierwagen SL, Wyatt M, Bainbridge S and
Stowar M (2023) Generalised deep learning
model for semi-automated length Assessing the health of fish populations relies on determining the length of fish in
measurement of fish in stereo-BRUVS. sample species subsets, in conjunction with other key ecosystem markers;
Front. Mar. Sci. 10:1171625.
doi: 10.3389/fmars.2023.1171625 thereby, inferring overall health of communities. Despite attempts to use
artificial intelligence (AI) to measure fish, most measurement remains a manual
© 2023 Marrable, Tippaya, Barker, Harvey, process, often necessitating fish being removed from the water. Overcoming this
Bierwagen, Wyatt, Bainbridge and Stowar. limitation and potentially harmful intervention by measuring fish without
This is an open-access article distributed
under the terms of the Creative Commons disturbance in their natural habitat would greatly enhance and expedite the
Attribution License (CC BY). The use, process. Stereo baited remote underwater video systems (stereo-BRUVS) are
distribution or reproduction in other
widely used as a non-invasive, stressless method for manually counting and
forums is permitted, provided the original
author(s) and the copyright owner(s) are measuring fish in aquaculture, fisheries and conservation management.
credited and that the original publication in However, the application of deep learning (DL) to stereo-BRUVS image
this journal is cited, in accordance with
accepted academic practice. No use, processing is showing encouraging progress towards replacing the manual
distribution or reproduction is permitted and labour-intensive task of precisely locating the heads and tails of fish with
which does not comply with these terms.
computer-vision-based algorithms. Here, we present a generalised, semi-
automated method for measuring the length of fish using DL with near-human
accuracy for numerous species of fish. Additionally, we combine the DL method
with a highly precise stereo-BRUVS calibration method, which uses calibration
cubes to ensure precision within a few millimetres in calculated lengths. In a
human versus DL comparison of accuracy, we show that, although DL
commonly slightly over-estimates or under-estimates length, with enough
repeated measurements, the two values average and converge to the same
length, demonstrated by a Pearson correlation coefficient (r) of 0.99 for n=3954
measurement in ‘out-of-sample’ test data. We demonstrate, through the
inclusion of visual examples of stereo-BRUVS scenes, the accuracy of this
approach. The head-to-tail measurement method presented here builds on,
and advances, previously published object detection for stereo-BRUVS.
Furthermore, by replacing the manual process of four careful mouse clicks on
the screen to precisely locate the head and tail of a fish in two images, with two

fast clicks anywhere on that fish in those two images, a significant reduction in
image processing and analysis time is expected. By reducing analysis times, more
images can be processed; thereby, increasing the amount of data available for
environmental reporting and decision making.


stereo-BRUVS, deep learning, automated fish length, photogrammetry, machine

learning, cameras

1 Introduction fish stocks so that informed decisions can be made about

sustainable fishing and management practices (Pauly et al., 2002).
It is estimated that one third of global fish stocks are overfished Fish measurement provides important information in the context of
(Duarte et al., 2020) which impacts the ecosystem services provided stock assessment by monitoring changes in the size of fish, which
by fish (Steneck and Pauly, 2019). Numerous management actions gives insight into the impacts of fishing and other factors on the
at local, national and international scales will be required to rebuild overall health of fish communities and ecosystems.
fish stocks by improving governance, including lowering fishing Automation has the potential to improve the accuracy,
pressure; implementing harvest controls which limit the types of efficiency and consistency of fish measurement (e.g. Shortis, 2015;
gear used and the size and number of fish caught; and the use of Marrable et al., 2022) to reliably increase the accuracy of stock
closed-area management or sanctuaries (MacNeil et al., 2020; assessment information that can then be used to support and design
Melnychuk et al., 2021). Fishery-dependent information from improvements to sustainable fishing practices which protect fish
traps, hook and line, trawls and nets has provided much of the populations and ecosystems. Some benefits of using automation
data for monitoring the status of fish populations. With the include: 1) improved accuracy – automated systems can measure
implementation of closed areas and sanctuaries, there has been an fish more precisely than manual methods, reducing the potential for
increase in the interest of fishery-independent sampling techniques, human error; 2) increased efficiency – automated systems can
as many of the conventional sampling techniques are not process large numbers of fish much more quickly than manual
permissible. Fishery-independent techniques have largely been methods, reducing the time and effort required for stock
based on underwater visual census (UVC) (Brock, 1954). Baited assessments; 3) consistent data – automated systems can provide
remote underwater video systems (BRUVS) (Ellis and DeMartini, consistent and standardised measurements, reducing the potential
1995; Cappo et al., 2001; Cappo et al., 2003) can collect a relative for variation due to differences in the way measurements are taken;
abundance of data on a range of fish species from numerous habitats 4) reduced labour – automated systems can reduce the need for
and depths (Harvey et al. 2021). While estimates of abundance are manual labour, freeing up resources for other tasks and potentially
an important metric, accurate and reliable information on the reducing costs.
length and size of fish within wild populations is more useful
(Jennings and Polunin, 1997; Jennings and Kaiser, 1998). This is
because it has been shown that fishing and other impacts decrease 1.1 Traditional approaches to
the mean length, length frequency and biomass of fish populations measuring fish
(Roberts, 1995; McClanahan et al., 1999). For UVC, biomass is
calculated from fish length based on visual estimates by SCUBA Existing methods that enhance manual measurement by using
divers (Wilson et al., 2018) with the standing biomass of fish automation and computer vision have the potential to support
thought to be a good metric for expressing the health of fish fishing operations and ecosystem monitoring; however, these
populations (Friedlander and DeMartini, 2002; Seguin et al., remain inaccessible to most small-scale fisheries due to their
2022). But these estimates have been demonstrated to be neither associated high cost (Andrialovanirina et al., 2020). Even systems
accurate nor precise, which can affect biomass estimates (Harvey that use remote surveillance monitoring to measure, process and
et al., 2002). Stereo video systems are a more accurate and precise count discarded fish via video record once the vessel has returned to
technique for non-destructively estimating the lengths of fish port have shown that the analytical processing time required is
(Harvey and Shortis, 1995; Harvey et al., 2001a; Harvey et al., equally as labour intensive (Needle et al., 2015; French et al., 2019).
2010) and have been modified for use by SCUBA divers (Goetze Such examples provide further justification for the need of
et al., 2019), remotely operated vehicles (ROVs) (Schramm et al., computer vision tools to increase the efficiency monitoring for
2020; Jessop et al., 2022; Hellmrich et al., 2023) and BRUVS managing vessel operations. Similar challenges are faced by those
(Harvey et al., 2007; Langlois et al., 2020; Harvey et al. 2021). conducting research in aquaculture and fish ecology. There is a
Determining the size and quantity of fish populations in a seemingly exponential trend in the availability of automated fish
specific area is crucial to understanding and assessing the health of detection tools for researchers, yet their documented use is still

minimal, with researchers also requiring ways to measure and track for processing and analysis; and a lack of understanding of costs and
fish (Bradley et al., 2019; Lopez-Marcano et al., 2021). advantages (Dowling et al., 2016; Cresswell et al., 2022). Deep
Assessing the health of fish populations depends on learning (DL) can address these challenges by replacing the
determining the average length of fish in sample population manual, labour-intensive task of precisely locating the heads and
subsets and inferring health in conjunction with other key tails of fish with computer-vision-based algorithms (e.g. Marrable
ecosystem markers. Methods applying the length-based et al., 2022). White et al. (2006) were the first to test this method
measurement of fish for assessing the health of fisheries have been with computer vision on a fishing vessel. Measurement using digital
around for decades (Pauly and Morgan, 1987) with few imagery is a growing field and has been successfully implemented
technological advancements until recently. Manual measurement with both single image (e.g. Lezama-Cervantes et al., 2017;
remains the principal tool in collecting essential management Monkman et al., 2019; Andrialovanirina et al., 2020; Wibisono
information on board fishing vessels. However, this method is et al., 2022), and stereo image (e.g. Johansson et al., 2008; Shafait
documented as highly time consuming and involves considerable, et al., 2017; Suo et al., 2020; Connolly et al., 2021; Lopez-Marcano
and potentially harmful, handling of fish to gain accurate et al., 2021; Marrable et al., 2022). Datasets now also exist to
measurements (Upton and Riley, 2013). Traditionally, evaluating explicitly support the development of DL algorithms; for instance,
stock levels has relied on manually measuring fish length, as it is segmentation, classification and size estimation (e.g. DeepFish,
frequently the only possibility where monitoring is limited and Garcia-d’Urso et al., 2022).
collecting length measurements is easier than quantifying a total Automated fish detection has been demonstrated using a range
catch (Rudd and Thorson, 2018). However, this method does not of computer vision methods of measurement targeting single
consider the fluctuations in fish recruitment and death rates over species for aquaculture (Atienza-Vanacloig et al., 2016; Shi et al.,
time, which is crucial for comprehending the indirect impacts of 2020; Yang et al., 2021). Some invasive methods of measurement
fishing on predator–prey dynamics and for identifying the factors involve channelling fish past stationary cameras (Miranda and
that influence the structure of fish communities on a larger scale Romero, 2017; Shafait et al., 2017), or methods which use active
(Jennings and Polunin, 1997). Average length is also considered an sources of light, such as sonar (Uranga et al., 2017), which are
operational indicator of fishing impact; whereas indices on the potentially stressful to the fish. Furthermore, removing fish from the
composition of species assemblage are difficult to interpret, average water (White et al., 2006) or measurement on board trawlers
length is well understood and reference points can be set (Rochet (Monkman et al., 2019) adds to fish mortality. These challenges
and Trenkel, 2003). As well as causing impacts on targeted species, highlight the importance of developing automated methods for
commercial fishing affects bycatch, including by-product and non-invasive means of measurement, such as BRUVS.
discarded/released species; and sometimes habitats, when fishing Although there have been advances in using DL for image
gear (e.g. demersal trawling) interacts with the sea floor or benthic analysis, video imagery presents additional complexities and
zones (Little and Hill, 2021). An increasing range of mechanisms requirements, particularly with regard to curated and structured
and technical tools is being used to reduce interactions with data (e.g. Marrable et al., 2022).
seabirds, marine mammals, reptiles and other vulnerable species. Recent reviews of machine learning in aquaculture found that
Such bycatch-reduction measures include tori lines, sprayers, and there is a need for DL and neural networks to optimise current
seal and turtle excluder devices (Cresswell et al., 2022). In Australia, approaches but have also identified certain pitfalls in the process,
as around the world, guidelines and rules on fish measurement including noise, occlusions and dynamic viewing spaces (Yang
methodology and length quotas are enacted and overseen et al., 2021; Zhao et al., 2021).
by governments1. Stereo baited remote underwater video systems (stereo-BRUVS)
are widely, and increasingly, used as a non-invasive, stressless
method for counting and measuring fish in aquaculture, fisheries
1.2 The move toward automation and conservation management (Harvey and Shortis, 1995; Harvey
et al., 2021). Recently, Marrable et al. (2022) demonstrated the
Monitoring devices and advances in data processing and application of DL to stereo-BRUVS imagery for the semi-
analysis techniques can, and should, form part of an effective automation of fish identification and early success with species
monitoring approach. However, data or capacity limitation is identification. Extending the application of DL to automate fish
widespread in global fisheries resulting in ineffective or non- length measurement would greatly enhance and advance marine
existent management as a result of this lack of data and/or an environment monitoring, speeding up data collation on localised
inability to generate statistical estimates of stock status. Significant fish populations and increasing the amount of data that can be
improvements in management outcomes, leading to conservation processed and used for environmental reporting and decision
and livelihood benefits, could be achieved through cost-effective making. The current limitation of BRUVS is that the data
analytical approaches; these exist, but are hampered by a range of processing is a highly time-consuming manual exercise, prone to
challenges, including data availability and requirements; resources human error and is costly, delaying the production of length data
and limiting how much BRUVS imagery can be processed
(Connolly et al., 2021; Marrable et al., 2022). However, as with
1 species identification, mean length data is highly valuable for
recreational-fishing-rules/measuring. determining frequency distributions of fish populations and the

spatial and temporal changes required for environmental (RMS) value >20 mm. The RMS value is calculated by the
assessment and reporting. In addition to cost and processing photogrammetry library in EventMeasure and is an indicator of
time, BRUVS is limited by the MaxN ecological abundance how close two corresponding points in each image are to the
metric (Whitmarsh et al., 2017), creating an opportunity for a epipolar line calculated by the opposite point. An RMS value
much larger use of the data held within a video, such as including greater than 20 mm is considered by SeaGIS (outlined in the
fishery-independent assessments of fishing pressure. Recent use of EventMeasure software manual) as an imprecise measurement or
open-source image processing software to measure fisheries catch error in calibration and, therefore, was discarded in this study. This
has also been successful for a wide range of fish sizes reduced the number of images for training to 15558 stereo pairs of
(Andrialovanirina et al., 2020). cropped fish images.

1.3 A semi-automated and generalised 2.2 Data preparation

method of length measurement
The annotated data in OzFish did not include head or tail labels
Here we present a semi-automated and generalised method of and does not store the annotations in any particular order. There
measuring the length of fish using DL with near-human accuracy was no consistent order to which the heads and tails were labelled.
for numerous species of fish across a wide range of habitats. Speed of Head and tail labels are required to train the DL model to classify
analysis is therefore much increased, and demonstrates progress them. Therefore, a systematic review of the images was conducted
towards the use of stereo-BRUVS for length measurement in to reorder many of the annotations, resulting in a dataset in which
fisheries, aquaculture and marine ecology research applications. two labels, ‘head’ and ‘tail’ in consistent order, were reliably applied
to all of the points for training the DL model.
The final step, before training and testing the system, was to split
2 Method the data between ‘in-sample’ and ‘out-of-sample’ datasets. The videos
in OzFish have had the metadata removed before publishing,
In this section, we describe the DL method used for locating the although the data were given prefix letters in their filenames to
heads and tails of fish, combined with a highly precise stereo- indicate they were taken from different deployments and at different
BRUVS calibration method method (Shortis, 2015), which makes locations. Calibration files required for photogrammetry were only
use of calibration cubes to ensure precision in calculated lengths to published for the images with the prefix A and E. As these calibration
the nearest millimetre. Once trained and deployed, this semi- files are needed to do a human versus machine comparison, they were
automated approach solves the problem of finding the same fish withheld from any training or validation and made up the out-of-
in both images; that is, the ‘fish correspondence challenge’, with sample data used for testing algorithm performance. Images with
ecologists only having to select the same fish in the left and right prefix B and G were not published with calibration files; however,
images by clicking anywhere on the body, eliminating the need for these files were not needed for training the head and tail detection
four very precise clicks on the head and tail in both images. The model and made up the in-sample training data.
method is illustrated in Figure 1 and examples of the results After filtering the data, a total of 13555 stereo pairs of cropped
in Figure 2. fish images remained with correct head and tail labels. The available
data for training and testing amounted to 59 unique family, 153
unique genus and 319 different species. The in-sample data were
2.1 Datasets split 70% (5348 stereo pairs) for training, and 30% (2292 stereo
pairs) for validation and hyperparameter tuning. In this study, the
The fish length measurement data made available for this study calibration file verification process, taken to ensure that the ground-
study (Australian Institute Of Marine Science, 2020) was taken truth length in OzFish dataset and calculated length using
from OzFish stereo-BRUVS imagery along with annotations photogrammetry was consistent, resulted in approximately 30%
conducted by fish ecologists using EventMeasure. In order to of the out-of-sample data (1761 stereo pairs) being removed. The
develop a training dataset for the DL model, the head and tail remaining out-of-sample data comprised 4154 stereo pairs.
annotations, which were initially made manually by the ecologists,
were extracted by exporting the frame number and pixel location of
each annotation in the frame from the data files. 2.3 Model training
The original OzFish dataset has 37695 measurements inside
unique bounding boxes which indicate the location and extent of a This study used You Only Look Once (YOLO; Redmon et al.,
fish and include markers which identify its head and tail. Crops 2016) a type of DL model used in object-detection algorithms.
from pairs of stereo images were taken from the full images to create Specifically, the YOLOv5 model, which has been pre-trained on the
head and tail stereo pairs. Small fish, or ones far away in the Common Objects in Context2 (COCO) dataset, was chosen for its
background, were excluded by filtering out any fish objects smaller
than 200 pixels in either height or length. Another filter was applied
to exclude fish that had been measured with a root mean square 2

Illustrates the workflow for data preparation, model training and model evaluation.

ability to handle various sizes, numbers of classes, and pixels around the head and tail points, respectively. Finally, the in-
computational requirements. The variant used in this study was sample training and validation fish crop images with head and tail
the ‘YOLOv5 small’ model. To adapt the model for head and tail labels were used to train the YOLOv5 small model. The early-
detection, transfer learning was employed, which built on stopping method was also implemented in this study to avoid
knowledge gained from the pre-trained model while reducing the overfitting the model.
amount of training data and time needed. A subset of the in-sample
dataset was used to retrain the model according to the standard
procedure outlined on the YOLOv5 website3. 2.4 Model prediction
The YOLOv5 model needs to be trained by defining the extent
of an object of interest (heads and tails in this case) by defining a The head and tail predictions from the object-detection model
bounding box. Therefore, the head and tail points in the training were converted to overall fish length by first taking the bounding
data were converted to bounding boxes by defining a box of 25 × 25 box predictions from the trained DL model and converting them to
points by using the centre location of the box in stereo image pairs.
On occasions when the DL model failed to find one or two of either
3 Access Date (Nov 22, 2022). a head or a tail in both images, the location of the missing feature

Presents four out-of-sample examples of automated fish length measurements using the method described in this study. The example presents fish
of different sizes, habitat and distance from the camera.

was estimated by taking the reflection of one the classifier feature is the proportion of true positive (TP) predictions out of all positive
locations in the bounding box of the fish. On occasions when the predictions. False negative (FN) represent the number of
model returned more than one candidate for a head or tail, the one predictions the model missed and false positive (FP) predictions
with the highest confidence score was chosen. On the occasions are incorrectly predicted results. The F1 score is calculated by taking
when predicted head and tail points were inconsistent in both left the harmonic mean of recall and precision.
and right cropped fish images; for example, if head or tail points The recall, precision and F1 score for fish head and tail detection
were swapped, the predicted result was discarded as an incorrect are presented in Table 1.
measurement. Once the four required points were returned by the
model, the camera calibration files were used along with Recall = (1)
EventMeasure’s photogrammetry library to calculate the length of
the fish.
Precision = (2)

2.5 Model evaluation

The out-of-sample dataset was used for evaluating the

performance of the model and gives an indication of model
generalisability and performance in different domains. Inference
for both heads and tails was performed on the 4154 out-of-sample
data (stereo pairs of cropped fish images), and heads and tails pixel
coordinates were converted to the original scale of stereo-BRUVS
imagery. EventMeasure’s stereophotogrammetry tool was used to
calculate the length of a fish from the four predicted points of head
and tail pairs. Two hundred predictions were removed by the post-
processing steps described in the previous section, and the
remaining 3954 automated measurements were then compared to
the manual measurements made by the fish ecologists. Results are
presented in Figures 3, 4.
Human versus DL comparison showing how DL and
photogrammetry-derived length compares with human and
2.5.1 Recall, Precision and F1 Score photogrammetry-derived length for the same fish. The Pearson
Simplifying model performance for fish head and tail detection correlation coefficient is 0.99 indicating that even though DL
sometimes overestimates or underestimates the length compared
into a single metric can be beneficial. One such metric is the F1 with a manual measurement by an ecologist; with repeat
score, which is a combination of recall and precision. Recall is the measurements, the total length estimates average to be very similar.
likelihood of detecting all actual positive instances, while precision

those measurements and compared in Figure 4. The results

presented here are calculated from the out-of-sample data.
Table 1 shows the DL precision (P), recall (R) and F1 for heads,
tails and the combination of both.

4 Discussion
The semi-automated method presented in this paper
demonstrates the potential to rapidly increase analysis time and
decrease reporting time for assessing fish biomass. Challenges
remain for a completely autonomous solution, some of which are
discussed below.

Histogram of the human versus DL comparison demonstrating the
4.1 Semi-automation of length
density of the number of length measurements. A higher density of measurement
points indicates the total number of measurements aggregate to
close agreement.
The challenge of applying this model in real-world scenarios is
that the model cannot currently match the fish in the corresponding
left and right images. This was not a problem when building and
testing the model, as the data were already analysed by experienced
2  Recall  Precision ecologists who had matched the stereo image pairs. To address this
F1 = (3)
Recall + Precision challenge, the DL model was adapted to communicate with Event
Measure; wherein, the DL model requires an ecologist to click
anywhere on the body of the fish in both images. Inference on the
2.5.2 Human–machine comparison
length is conducted after the ecologist has solved the image
The Pearson correlation coefficient used for the human–
correspondence problem by identifying the same fish in each of the
machine comparison was calculated by:
left and right images. The fish is then precisely cropped from the

o (xi − x)(yi − y) ffi
r = qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (4)
stereo-BRUVS image using the DL method described in Marrable
​ ​ et al. (2022), which places a bounding box over the fish, then parsed
o (xi − x) o (yi − y)
2 2
by the head and tail DL model. Without isolating the fish first, the
Where: model returns all of the heads and tails of all the fish it finds with no
correspondence data to match them. The head and tail locations are
x are the individual DL inference length results returned to EventMeasure which automatically calculates the length
x is the average DL length of the fish using its photogrammetry library. This reduces the number
yi are the individual human annotated length results of mouse clicks on the screen, from four precise clicks (i.e. left head,
y is the average human annotated length left tail, right head, right tail) to two. Additionally, placing clicks
anywhere on the body is significantly faster and requires much less
precision. This semi-automated method of length measurement has
the potential to significantly increase analysis speed.
3 Results Furthermore, by requiring ecologists to choose the corresponding
fish individuals, users can draw on their contextual knowledge to wait
The following section presents the results of the human– for a moment when a fish is the best pose for measurement and not
machine comparison by comparing the machine learning and occluded by other fish, seagrass, the BRUVS bait bag or other objects.
photogrammetry-derived length measurements with the This reduces false positive detection. Context is something that is not
ecologists’ manual measurements (Figure 3) and the density of currently possible by using computer vision alone.

TABLE 1 Deep learning precision (P), recall (R) and F1 for classification.

Feature Images Labels P R F1

Head 8308 8308 77.50% 70.50% 73.83%

Tail 8308 8308 77.20% 69.50% 73.15%

Both 8308 16616 77.40% 70.00% 73.51%

4.2 Sources of error 4.4 Model generalisability

The DL model cannot correspond the head and tail of a given fish Previous published models capable of automating the length
and, therefore, the largest source of error is incorrect correspondence; measurement of fish have either used a single camera out of water
that is, when a head and tail pair are matched to two different fish. (Monkman et al., 2019); been limited to a single species (White et al.,
This is because the model searches within the bounding box for 2006); or used less accurate stereophotogrammetry calibration
features that look like heads and tails and returns the match with the methods (Tonachella et al., 2022). The model presented in this study
highest confidence. This works well when there is only one matching was trained and tested on 319 unique species of fish, making it much
pair; however, there are occasions when there are heads and tails more generalisable than any other previously published model. The
belonging to many fish. The model has no knowledge of data used to train this model was restricted to the species in the OzFish
correspondence and so matches them based on the highest dataset, which includes those mostly found along the coast of Western
confidence level, and sometimes pairs them incorrectly. An example Australian. However, the species richness and diversity shows evidence
of this is seen in Figure 5. This results in either the incorrect length that the model generalises across different species with varying colour,
being calculated from the photogrammetry, or the RMS value texture and morphometrics. An effort to separate in-sample and out-
returned from EventMeasure being >20, so no length is reported. of-sample data was made to give some indication of model
Figure 5A shows an example where two fish tails fall within the generalisability by training and testing to data collected at different
bounding box and the model identifies the wrong tail. This false dates and locations. How well the model works with species outside the
positive is seen most commonly where fish are schooling and OzFish data will be the focus of future work. For applications in marine
swimming between 30° and 45° to the plane of the camera. Angles environments with species not included in the OzFish data, the
within this span produce a large bounding box with more likelihood method described in this study should be repeated with a new
that tails from other fish will be captured. One way to reduce this training corpus that includes species in which users wish to measure.
effect is to automate a rotation of the bounding box, Figure 5B, or the
image in sympathy with the orientation of the fish to reduce the
empty space in the bounding box. Automating this process remains a 4.5 Challenges with data quality
challenge, as even establishing that a false positive detection has
occurred would require logic and processing beyond the capability of One reason for choosing the OzFish dataset for DL training was
the current model. There are published detection models that use because the data were annotated by expert fish ecologists. However,
rotated labels (Li et al., 2018) for ship identification in satellite when auditing the DL data there were still errors in the labelling.
images; but, as yet, YOLOv5 does not have the ability to train using Some errors included head and tail points that seemed to be
rotated bounding boxes. Addressing these false positive cases systematically shifted a few pixels away from the head and tail of
remains the subject of ongoing research. the fish, which may have been caused by incorrect synchronisation
of the stereo-BRUVS. There were also some instances where labels
were randomly out of place, such as labels placed on a rock.
4.3 Stereo calibration One issue that continues to be a challenge for computer-vision-
based DL is that it is so far incapable of using context in the way fish
Harvey and Shortis (1998) highlight the importance of precise ecologists do to help them label fish. For example, in the OzFish
measurement systems for accurate length. This was also the dataset, where a fish was partially occluded by an object, labels were
objective of this approach by using the OzFish dataset for model placed where heads or tails would logically be expected, estimated by
training and validation. The OzFish data were calibrated using the ecologists from experience and numerous previous observations of
calibration cube method (Shortis, 2015) which is more accurate and similar fish. When such an example is viewed by a computer-vision
precise than using 2D calibration patterns as reported by Boutros algorithm which, unlike an ecologist, cannot extrapolate from the
et al. (2015) in their comparison study. context, the algorithm may see a label on a rock and interpret that


Example of a false positive detection of a tail leading to an incorrect length measurement; (A) two fish tails fall withinthe bounding box and the
model identifies the wrong tail. (B) the yellow box demonstrates that rotating theobject-detection bounding box, would eliminate the second tail
from the area and correct the false tail label.

rock as a fish head or tail. In such cases, those data must be removed on fish populations for stock assessment and ecosystem-based
as they would incorrectly train the DL model to detect some rocks as fisheries management. This non-invasive approach enables
fish heads and tails. Additionally, there were many instances of continuous monitoring of fish populations without harming the
seemingly very small fish labelled with heads and tails which were organisms or their habitats, offering a promising alternative for
very hard to distinguish between in static images. However, upon sustainable fishery management.
viewing the moving video, swimming behaviours clearly indicated the
direction fish were swimming in, which made head and tail
identification easy to the human eye. Although there are published 5 Conclusion
DL tracking algorithms (Bertinetto et al., 2016; Hu et al., 2022),
YOLO-based methods only consider static images for training or The semi-automated length measurement method presented
inference. Combining tracking with head and tail detection will be the here builds on and advances previously published DL-based fish
focus of future work so that numerous length measurements of the detection from stereo-BRUVS imagery (Marrable et al., 2022). This
same fish can be made to calculate the average size, a method that is new method combines that fish detection approach to isolate and
shown to be more statistically robust and less prone to measurement crop individual fish from a busy scene with a new DL model for
error (Harvey et al., 2001b). Validation experiments of measurements detecting the head and tail and applying photogrammetry to
from stereo-BRUVS (Harvey and Shortis, 1995; Harvey et al., 2003; determine fish length measurements.
Harvey et al., 2010) have been conducted using three or more repeat Although not completely autonomous, the machine-assisted, semi-
measurements of fish. However, this is seldom done when conducting automated labelling approach solves both the object correspondence
field surveys due to the extra labour required. challenge and allows for expert contextual knowledge to choose which
fish (and in which pose) are sent for analysis using DL. This is expected
to significantly reduce labour and analysis time by speeding up the
4.6 Combining optical and acoustic manual process of precisely locating the head and tail of the fish in both
sampling methods images by carefully placing four mouse clicks on the screen, to two fast
clicks anywhere on a fish while still using expert knowledge to truth
In recent years, size-spectrum models derived from acoustic and validate the result. By accelerating stereo-BRUVS analysis, more
surveys have emerged as essential tools for fish stock assessment imagery can be processed; thereby, increasing the amount of data
and ecosystem-based fisheries management. Acoustic surveys possess available for environmental reporting and decision making.
the advantage of rapidly and efficiently covering vast spatial scales.
However, stationary video platforms, such as stereo-BRUVS, are
constrained by a limited field of view and can only monitor a small
Data availability statement
area around the camera. Acoustic surveys also face challenges,
Publicly available datasets were analyzed in this study. This data
including difficulties in discriminating between fish species and
can be found here:
detecting fish close to the seabed or within dense schools.
Size and shape information of fish targets is extracted from echo
data by adjusting model parameters, such as growth rates, mortality Author contributions
rates, and species-specific traits, to match observed data (Edwards
et al., 2017; Froese et al., 2019). Calibration and validation of these DM, MW, ST, and SB contributed to the development of the
models often necessitate biological samples, which are invasive due to study design. DM, KB, ST, EH, MW, MS, and SLB contributed to
the physical capture and potential harm to fish during the process. the writing of the manuscript. All authors contributed to the article
Assessing fishery resources in reef ecosystems, where obtaining and approved the submitted version.
biological samples is sometimes prohibited, remains challenging. To
address these limitations, optic-acoustic methods combine video
footage and acoustic measurements (Ryan and Kloser, 2016; Demer Conflict of interest
et al., 2020). Underwater cameras or video systems, either mounted
on a research vessel, towed platform, or remotely operated vehicle The authors declare that the research was conducted in the
(ROV), capture images or footage of fish, providing high-resolution absence of any commercial or financial relationships that could be
information on size, shape, colour, and behaviour, which aids in construed as a potential conflict of interest.
species identification and refining size distribution estimates
without the need for biological samples.
The automated length measurement of fish in stereo-videos using Publisher’s note
the method described in this study could be integrated with the optic-
acoustic approach to capitalise on the strengths of both methods. All claims expressed in this article are solely those of the authors
Combining acoustic surveys with stereo-BRUVS, such as the and do not necessarily represent those of their affiliated organizations,
preliminary work by Landero-Figueroa et al. (2016), or other or those of the publisher, the editors and the reviewers. Any product
sampling techniques can help overcome the limitations of each that may be evaluated in this article, or claim that may be made by its
method and provide more accurate and comprehensive information manufacturer, is not guaranteed or endorsed by the publisher.

