Animals 12 01976 v2

animals
Article
Animal Detection and Classification from Camera Trap Images
Using Different Mainstream Object Detection Architectures
Mengyu Tan 1,† , Wentao Chao 2,† , Jo-Ku Cheng 3 , Mo Zhou 3 , Yiwen Ma 1 , Xinyi Jiang 3 , Jianping Ge 1 , Lian Yu 3, *
and Limin Feng 1, *
1 Ministry of Education Key Laboratory for Biodiversity Science and Engineering, National Forestry and
Grassland Administration Key Laboratory for Conservation Ecology of Northeast Tiger and Leopard National
Park, Northeast Tiger and Leopard Biodiversity National Observation and Research Station, National Forestry
and Grassland Administration Amur Tiger and Amur Leopard Monitoring and Research Center,
College of Life Sciences, Beijing Normal University, Beijing 100875, China
2 School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China
3 School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China
* Correspondence: yulian@bnu.edu.cn (L.Y.); fenglimin@bnu.edu.cn (L.F.);
Tel.: +86-10-6220-7746 (L.Y.); +86-186-0039-9715 (L.F.)
† These authors contributed equally to this work.
Simple Summary: The imagery captured by cameras provides important information for wildlife
research and conservation. Deep learning technology can assist ecologists in automatically identifying
and processing imagery captured from camera traps, improving research capabilities and efficiency.
Currently, many general deep learning architectures have been proposed but few have evaluated
their applicability for use in real camera trap scenarios. Our study constructed the Northeast Tiger
and Leopard National Park wildlife dataset (NTLNP dataset) for the first time and compared the
Citation: Tan, M.; Chao, W.; Cheng, real-world application performance of three currently mainstream object detection models. We hope
J.-K.; Zhou, M.; Ma, Y.; Jiang, X.;
this study provides a reference on the applicability of the AI technique in wild real-life scenarios and
Ge, J.; Yu, L.; Feng, L. Animal
truly help ecologists to conduct wildlife conservation, management, and research more effectively.
Detection and Classification from
Camera Trap Images Using Different
Abstract: Camera traps are widely used in wildlife surveys and biodiversity monitoring. Depending
Mainstream Object Detection
on its triggering mechanism, a large number of images or videos are sometimes accumulated. Some
Architectures. Animals 2022, 12, 1976.
https://doi.org/10.3390/
literature has proposed the application of deep learning techniques to automatically identify wildlife
ani12151976 in camera trap imagery, which can significantly reduce manual work and speed up analysis processes.
However, there are few studies validating and comparing the applicability of different models for
Academic Editor: Mirko
object detection in real field monitoring scenarios. In this study, we firstly constructed a wildlife
Di Febbraro
image dataset of the Northeast Tiger and Leopard National Park (NTLNP dataset). Furthermore, we
Received: 15 June 2022 evaluated the recognition performance of three currently mainstream object detection architectures
Accepted: 2 August 2022 and compared the performance of training models on day and night data separately versus together.
Published: 4 August 2022 In this experiment, we selected YOLOv5 series models (anchor-based one-stage), Cascade R-CNN
Publisher’s Note: MDPI stays neutral under feature extractor HRNet32 (anchor-based two-stage), and FCOS under feature extractors
with regard to jurisdictional claims in ResNet50 and ResNet101 (anchor-free one-stage). The experimental results showed that performance
published maps and institutional affil- of the object detection models of the day-night joint training is satisfying. Specifically, the average
iations. result of our models was 0.98 mAP (mean average precision) in the animal image detection and
88% accuracy in the animal video classification. One-stage YOLOv5m achieved the best recognition
accuracy. With the help of AI technology, ecologists can extract information from masses of imagery
potentially quickly and efficiently, saving much time.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
Keywords: animal identification; camera trap; object detection; deep learning
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Animals 2022, 12, 1976. https://doi.org/10.3390/ani12151976 https://www.mdpi.com/journal/animals

Animals 2022, 12, 1976 2 of 16
1. Introduction
Nature is degenerating globally at unprecedented rates, and various human-driven
changes have accelerated biodiversity loss [1–3]. The Living Planet Report 2020 reveals
that populations of mammals, birds, fish, amphibians, and reptiles have fallen by 68% over
the past 50 years [4]. There is an urgent need to understand the mechanisms of biodiversity
loss in the context of increasing anthropogenic disturbance [5,6]. Therefore, we have to
obtain timely and exact information on the species’ distribution, richness, abundance, and
community structure.
Camera trap surveys can provide valuable information for ecologists and wildlife
conservation scientists on the species richness distribution [7,8], animal behavior [9], pop-
ulation density [10], community dynamics [11], and so forth [12,13]. As a non-invasive
approach with good concealment, small interference, and 24 h of continuous work, camera
traps prompt wide usage in wildlife management and biodiversity monitoring [14,15].
A camera trap will be automatically triggered to take photos or videos when animals
pass by [16]. However, camera traps are also susceptible to complex environments (e.g.,
vegetation drifting with the wind, sunlight exposure, etc.), resulting in false triggers and
sometimes producing many images or videos with no wildlife [17,18]. The collected images
and videos have to be cleaned and sorted, which are enormously labor-intensive and
time-consuming manual tasks. In addition, with the wide application of camera trap sur-
veys, the size of datasets increases rapidly, and the data preprocessing obstacle brought by
images with no wildlife in them becomes more and more prominent [19,20]. Cost-effective
technologies are urgently needed to aid in ecological monitoring [21,22].
Deep learning, which can process big data automatically and build relational models
in massive datasets, may be a crucial tool to help ecologists organize, process, and ana-
lyze ecological data more efficiently [19,23,24]. Many researchers have tried to use deep
learning to automatically identify species and remove camera trap images without animals,
which greatly saves time and labor costs [17,25,26]. Norouzzadeh used multitask models
to automatically identify, count, and describe wildlife images with a high classification
accuracy of 93.8% [27]. Schneider successfully solved the problem of outputting only one
label for multi-species images by training object detectors with Faster R-CNN [28]. Object
detection can identify the location and class of interest objects in an image and return all
results, so it will further improve the ability of camera data processing [29]. Afterwards,
some studies suggested that in complex natural environments, the detection of the location
of animals first may be the basis for improving the classification ability [15]. Vecvanags
evaluated the performance of RetinaNet and Faster R-CNN, which can provide technical
support for effective monitoring and management of ungulates [30]. Nowadays, many
object detection models have been proposed in the field of deep learning and more and
more articles have focused on these applications in ecology. However, object detection is
still a challenging task in camera trap surveys and few studies have compared the currently
mainstream object detection models in real camera trap monitoring projects.
Meanwhile, the long-term development of deep learning in the ecological field requires
large, diverse, accurately labeled, and publicly available datasets [31]. Many previous stud-
ies trained models using large datasets from open-source databases or citizen science
platforms (e.g., the Snapshot Serengeti dataset, iNaturalist), which were almost always
collected from specific regions [21,27,31]. There are few wildlife datasets for deep learning
training in China. We need to be aware that geographic bias in ecological datasets may
have implications on the practical application of the model [31]. Additionally, the com-
position of different species also shows a noticeable imbalance in some datasets [32]. It is
challenging and costly in time and effort to label masses of imagery from some camera trap
monitoring projects. Therefore, we should consider the actual situations when we apply
automatic identification technologies to actual ecological protections. Additionally, ecology
researchers in China urgently need high-quality wildlife datasets for deep learning to fill
the gap.
Animals 2022, 12, x FOR PEER REVIEW 3 of 17
Animals 2022, 12, 1976 3 of 16
The goals of our study were to build a wildlife dataset for deep learning and evaluate
the applicability of object detection in real infrared camera working scenarios. We can
The goalsthe
summarize of our
mainstudy were and
contents to build a wildlife dataset
contributions of our for deep
work aslearning
follows:and
(1) evaluate
We con-
the
structed the first Northeast Tiger and Leopard National Park wildlife imageWe
applicability of object detection in real infrared camera working scenarios. can
dataset
summarize the main contents and contributions of our work as follows: (1) We constructed
(NTLNP dataset). (2) We verified the performance of the object detection network in rec-
the first Northeast Tiger and Leopard National Park wildlife image dataset (NTLNP dataset).
ognizing wild animals in a complex natural background and compared the efficiency of
(2) We verified the performance of the object detection network in recognizing wild animals
three mainstream detection networks in wildlife recognition: YOLOv5 (anchor-based one-
in a complex natural background and compared the efficiency of three mainstream detection
stage), FCOS (anchor-free one-stage), and Cascade R-CNN (anchor-based two-stage). (3)
networks in wildlife recognition: YOLOv5 (anchor-based one-stage), FCOS (anchor-free
We applied the trained model to videos recorded by the camera traps and evaluated its
one-stage), and Cascade R-CNN (anchor-based two-stage). (3) We applied the trained
performance.
model to videos recorded by the camera traps and evaluated its performance.
The remainder of the paper is organized as follows: Section 2 presents the materials
The remainder of the paper is organized as follows: Section 2 presents the materials
and methods used in this study; Section 3 presents the experimental results; Section 4 dis-
and methods used in this study; Section 3 presents the experimental results; Section 4
cusses the experimental findings, shortages, and future work; and Section 5 presents the
discusses the experimental findings, shortages, and future work; and Section 5 presents
conclusions.
the conclusions.
2.2.Materials
Materialsand
andMethods
Methods
2.1.Dataset
2.1. DatasetConstruction
Construction
Thedata
The dataused
usedininthis
this study
study was
was video
video clips
clips taken
taken by by infrared
infrared cameras
cameras in Northeast
in the the North-
east Tiger
Tiger and Leopard
and Leopard NationalNational Park 2014
Park from fromto2014 to We
2020. 2020. We selected
selected 17 main 17 species
main species (15
(15 wild
wild animals and 2 major domestic animals) as research objects, including Amur tiger
animals and 2 major domestic animals) as research objects, including Amur tiger (Panthera
(Panthera
tigris tigris
altaica), altaica),
Amur Amur(Panthera
leopard leopard (Panthera pardus orientalis),
pardus orientalis), wild boar wild boar
(Sus (Sus scrofa),
scrofa), roe
roe deer
deer (Capreolus
(Capreolus pygargus),
pygargus), sika deersika deer (Cervus
(Cervus nippon),nippon), Asian
Asian black black
bear bearthibetanus),
(Ursus (Ursus thibetanus),
red fox
red foxvulpes),
(Vulpes (VulpesAsian
vulpes), Asian(Meles
badger badgermeles),
(Melesraccoon
meles), raccoon dog (Nyctereutes
dog (Nyctereutes procyonoides),
procyonoides), musk
musk
deer deer (Moschus
(Moschus moschiferus),
moschiferus), SiberianSiberian weasel (Mustela
weasel (Mustela sibirica), sibirica), sablezibellina),
sable (Martes (Martes zibellina),
yellow-
yellow-throated
throated martenflavigula),
marten (Martes (Martes flavigula),
leopard catleopard cat (Prionailurus
(Prionailurus bengalensis),
bengalensis), Manchurian Manchu-
hare
rian hare (Lepus mandshuricus), cow, and dog. Figure 1 shows some sample images.
(Lepus mandshuricus), cow, and dog. Figure 1 shows some sample images.
Figure 1. Examples of some species of the NTLNP dataset.

Figure 1. Examples of some species of the NTLNP dataset.
Weused
We usedaaPython
Pythonscript
scripttotoextract
extractimages
imagesfrom
fromthe
thevideos
videos(the
(theframe
framerate
ratewas
was50).
50).
Limitedby
Limited bythe
thenumber
numberofofindividuals
individualsand andliving
livinghabits,
habits,the
thenumber
numberofofimages
imagesfor
forsome
some
specieswas
species wasrelatively
relativelysmall.
small. Except
Except for
for hibernating
hibernating species,
species, images
images of
ofeach
eachcategory
category
included four different seasons. We carried out uniform standard manual annotation to the
images. All images were labeled in Pascal VOC format using the software labelImg.
Animals 2022, 12, 1976

included four different seasons. We carried out uniform standard manual annotation4 of
to16
the images. All images were labeled in Pascal VOC format using the software labelImg.
2.2. Object Detection Network

2.2. Object Detection Network
In the deep learning era, object detection has two main technological development
routes: anchor-based
In the deep learning and era,
anchor-free methods has
object detection while themain
two anchor-based method
technological includes
development
one-stage
routes: and two-stage
anchor-based and detection
anchor-free algorithms
methods [29,33]. In anchor-based
while the the anchor-based algorithms,
method includes
one-stage and
one-stage detection directly
two-stage generates
detection the class probability
algorithms [29,33]. Inand
theposition coordinate
anchor-based value
algorithms,
of the object from the predefined anchor box; two-stage detection includes generating
one-stage detection directly generates the class probability and position coordinate a
value
region proposal from the image and generating the final target boundary from the region a
of the object from the predefined anchor box; two-stage detection includes generating
proposal
region [34]. The
proposal fromanchor-free
the imagemethod, the Keypoint-bsaed
and generating detection
the final target type such
boundary from as
theFCOS,
region
mainly detects target key points to produce the bounding box [35]. Therefore, the FCOS,
proposal [34]. The anchor-free method, the Keypoint-bsaed detection type such as one-
stage object
mainly detects detection
target key algorithms
points to may be faster,
produce but the two-stage
the bounding object detection
box [35]. Therefore, algo-
the one-stage
rithmsdetection
object are generally more accurate.
algorithms may be faster, but the two-stage object detection algorithms are
In this
generally study,
more we applied three state-of-the-art models to identify, localize, and clas-
accurate.
sify In
animals in a complex
this study, we applied forest environment,
three namely
state-of-the-art YOLOv5,
models FCOS, localize,
to identify, and Cascade R-
and clas-
CNN
sify [35,36].in
animals Weaset up two experiment
complex groups: one
forest environment, was training
namely YOLOv5,on day and night
FCOS, and images
Cascade
jointly, and
R-CNN the other
[35,36]. We set was uptraining on day andgroups:
two experiment night images
one wasseparately.
training on day and night
images jointly, and the other was training on day and night images separately.
2.2.1. YOLOV5
2.2.1. YOLOV5
YOLO is an acronym for ‘You only look once’. YOLOv5 is the latest generation in the
YOLO series
YOLO [37].
is an It has an
acronym foranchor-based
‘You only lookone-stage detector is
once’. YOLOv5 with
thealatest
fast inference
generationspeed
in the
[38].
YOLO series [37]. It has an anchor-based one-stage detector with a fast inference speed [38].
1.1. Architecture
Architecture Overview
Overview
Wechose
We chosethree
three architectures:
architectures: YOLOv5s,
YOLOv5s, YOLOv5m,
YOLOv5m,and andYOLOv5l.
YOLOv5l.Backbone
Backboneadopts
adopts
the Cross Stage Partial Network (CSPNet) [39]. Before entering the backbonenetwork,
the Cross Stage Partial Network (CSPNet) [39]. Before entering the backbone network, the
the
YOLOv5algorithm
YOLOv5 algorithmadds addsthetheFocus
Focusmodule
moduleand
andperforms
performsdownsampling
downsampling byby slicing
slicing thethe
pic-
picture.
ture. TheThe
neckneck
is inisthe
in form
the form
of a of a Feature
Feature Pyramid
Pyramid Network
Network (FPN)
(FPN) plusplus a Path
a Path Aggre-
Aggregation
gation Network
Network (PAN) and (PAN) and combines
combines three different
three different scalesscales of feature
of feature information
information [40,41].
[40,41]. Then,
itThen,
uses it uses
the the Non-Maximum
Non-Maximum Suppression
Suppression (NMS)(NMS)
methodmethod to remove
to remove redundant
redundant pre-
prediction
diction bounding
bounding boxes (Figure
boxes (Figure 2). 2).
Figure 2. YOLOv5 structure diagram. Conv is convolution; C3 is improved from the Cross Stage
Figure 2. YOLOv5 structure diagram. Conv is convolution; C3 is improved from the Cross Stage
Partial Network (CSP Net); Conv2d is two-dimensional convolution.
Partial Network (CSP Net); Conv2d is two-dimensional convolution.
2. Implementation Details
We used the YOLOv5 framework for model training based on PyTorch [42]. The
optimizer was Stochastic Gradient Descent (SGD), the momentum was set to 0.937, and
the weight decay was set to 0.0005. The initial learning rate was set to 1 × 10−2 which
We used the YOLOv5 framework for model training based on PyTorch [42]. The op-
timizer was Stochastic Gradient Descent (SGD), the momentum was set to 0.937, and the
Animals 2022, 12, 1976 weight decay was set to 0.0005. The initial learning rate was set to 1 × 10−2 which5 would
of 16
decrease linearly, the warm-up epoch was 3, and the initial warm-up momentum was 0.8.
Due to the different sizes of the models, the total number of epochs and the batch size
were different. The detailed settings of each model are shown in Table 1. Experiments
would decrease linearly, the warm-up epoch was 3, and the initial warm-up momentum
were run on RTX A4000 GPU.
was 0.8. Due to the different sizes of the models, the total number of epochs and the batch
size were different. The detailed settings of each model are shown in Table 1. Experiments
Table 1. YOLOv5 parameter settings.
were run on RTX A4000 GPU.
Model Epoch Batch Size
Table 1. YOLOv5 parameter settings.
YOLOv5s_day 80 32
YOLOv5m_day
Model Epoch 80 32
Batch Size
YOLOv5l_day
YOLOv5s_day 80
80 32
16
YOLOv5s_night
YOLOv5m_day 80 65 32 32
YOLOv5m_night
YOLOv5l_day 80 65 16 32
YOLOv5l_night
YOLOv5s_night 65 65 32 16
YOLOv5m_night 65 32
YOLOv5s_togather 60 32
YOLOv5l_night 65 16
YOLOv5m_togather
YOLOv5s_togather 60 60 32 32
YOLOv5l_togather
YOLOv5m_togather 60 45 32 16
YOLOv5l_togather 45 16
2.2.2. FCOS
FCOS is a one-stage, fully convolutional object detection network that is anchor free
2.2.2. FCOS
[35]. It uses
FCOS is acenter pointsfully
one-stage, to replace anchor boxes
convolutional fordetection
object boundingnetwork
box regression, which is
that is anchor
more straightforward and flexible.
free [35]. It uses center points to replace anchor boxes for bounding box regression, which
is more straightforward
1. Architecture and flexible.
Overview
1. The network
Architecture structure consists of three main parts: backbone, FPN, and output net-
Overview
work.
The The backbone
network network
structure used in
consists of this
threeexperiment
main parts:wasbackbone,
ResNet50 FPN,
and ResNet101
and output [43],
which could
network. be divided
The backbone into 5used
network parts. It adds
in this FPN forwas
experiment multi-scale
ResNet50feature extraction.
and ResNet101 The
[43],
output
which network
could consists
be divided of5Heads,
into each
parts. It adds of FPN
whichforcontains a shared
multi-scale part
feature and 3 branches.
extraction. The
Classification
output network predicts
consists the confidence
of Heads, eachof ofthe existence
which of the
contains target at
a shared each
part andsampling point
3 branches.
on the feature
Classification map, center-ness
predicts predicts
the confidence of thethe distanceofbetween
existence the
the target atsampling point point
each sampling and the
oncenter of themap,
the feature target, and regression
center-ness predicts
predicts the distance
the distance between between the sampling
the sampling pointthe
point and and
the real
center boxtarget,
of the of theand
original imagepredicts
regression (Figure 3).the distance between the sampling point and
the real box of the original image (Figure 3).
Figure
Figure 3. 3. FCOS
FCOS structure
structure diagram.
diagram.H ×HW×is W
theisheight and width
the height andofwidth
feature
ofmaps. ‘/s’maps.
feature (s = 8, 16,
‘/s’…,
128) is the downsampling ratio of the feature maps at the level to the input image [35].
(s = 8, 16, . . . , 128) is the downsampling ratio of the feature maps at the level to the input image [35].
We used the FCOS framework for model training based on PyTorch [35,42]. We
trained 35 epochs under different backbone networks with the batch-size set to 12 and
8, respectively. In the early stage of training, the warm-up strategy was used to increase the
learning rate from 0 to 2 × 10−3 gradually. When the training times reached 20,000 times, it
We used the FCOS framework for model training based on PyTorch [35,42]. We
Animals 2022, 12, 1976 trained 35 epochs under different backbone networks with the batch-size set to 12 and 6 of8,16
respectively. In the early stage of training, the warm-up strategy was used to increase the
learning rate from 0 to 2 × 10−3 gradually. When the training times reached 20,000 times, it
reducedthe
reduced thelearning
learningrate
ratetoto22×× 10
10−−44, ,and
andafter
afterthe
thetraining
trainingtimes
timesreached
reached27,000
27,000times,
times,the
the
learning rate was reduced to 2 × 10 −5
− .5Experiments were run on
learning rate was reduced to 2 × 10 . Experiments were run on RTX A5000 GPU.RTX A5000 GPU.
2.2.3.Cascade
2.2.3. CascadeR-CNN
R-CNN
CascadeR-CNN
Cascade R-CNN stacks
stacks several
several cascade
cascade modules
modules in in the
the detector
detectorand
anduses
usesdifferent
different
Intersectionover
Intersection overUnion
Union(IoU)
(IoU)thresholds
thresholdstoto train
train [36].
[36]. It It dramatically
dramatically improves
improves the
the accu-
accuracy
racy of the anchor-based two-stage object detection
of the anchor-based two-stage object detection algorithm. algorithm.
1.1. Architecture
Architecture Overview
Overview
We chose HRNet32 as the backbone network to perform the task of wildlife object
We chose HRNet32 as the backbone network to perform the task of wildlife object
detection in the manner of Cascade R-CNN [36,44]. HRNet achieves the purpose of strong
detection in the manner of Cascade R-CNN [36,44]. HRNet achieves the purpose of strong
semantic information and precise location information through parallel branches of mul-
semantic information and precise location information through parallel branches of mul-
tiple resolutions and continuous information interaction between different branches [44].
tiple resolutions and continuous information interaction between different branches [44].
Overall, Cascade R-CNN has four stages, one Region Proposal Network (RPN) and three
Overall, Cascade R-CNN has four stages, one Region Proposal Network (RPN) and three
for detection with IoU = {0.5, 0.6, 0.7}. Sampling in the first detection stage follows Faster
for detection
R-CNN [45].with
In theIoU = stage,
next {0.5, 0.6, 0.7}. Sampling
resampling in thebyfirst
is achieved detection
simply using stage follows Faster
the regression out-
R-CNN [45]. In the next stage, resampling is achieved by
put from the previous stage. The model structure is shown in Figure 4.simply using the regression
output from the previous stage. The model structure is shown in Figure 4.
Figure4.4.Cascade
Figure CascadeR-CNN
R-CNNstructure
structure diagram.
diagram. ROI
ROI pooling
pooling is
is region-wise
region-wisefeature
featureextraction
extraction[36].
[36].
2.2. Implementation
Implementation Details
Details
WeWeused
usedthe theMMDetection
MMDetection framework
framework for for model
modeltraining
trainingbased
basedon onPyTorch
PyTorch[42,46].
[42,46].
The optimizer was Stochastic Gradient Descent (SGD), the momentum wasset
The optimizer was Stochastic Gradient Descent (SGD), the momentum was settoto0.9,
0.9,and
and
theweight
the weightdecay
decaywaswasset
setto
to0.0001.
0.0001. The
The total
total number
number of of epochs
epochswaswas30.
30.The
Thelearning
learningraterate
was1 1××10
was 10−−22 and
and the
the batch
batch size was 2.
size was 2. For
For joint
joint training,
training, the
the learning
learning rate
rate was
was 11 ×× 10
10−2and
−2
and
the batch size was 4. In total, 500 steps were used for the warm-up. The learning
the batch size was 4. In total, 500 steps were used for the warm-up. The learning rate would rate
would decrease
decrease linearly linearly
accordingaccording to the and
to the epoch, epoch,
theand the decrease
decrease ratio
ratio was 10,was 10, in epoch
in epoch 16 and
16 and epoch 19, respectively. Experiments were
epoch 19, respectively. Experiments were run on RTX 3090 GPU.run on RTX 3090 GPU.
2.3.Evaluation
2.3. EvaluationMetrics
Metrics
Thispaper
This paperused
used the
the precision,
precision, recall,
recall, and
and mean
mean average
averageprecision
precision(mAP)
(mAP)asasevalua-
evalua-
tion metrics:
tion metrics:
TP
Precision = (1)
TP + FP
TP
Recall = (2)
TP + FN
where true positive (TP) is the number of correct detections of the ground-truth bounding
box, that is, the number of IoU that exceeds the threshold and is correctly classified; false
Animals 2022, 12, 1976 7 of 16
positive (FP) is the number of incorrect detections of a nonexistent object or misplaced

detections of an existing object, that is, the number of IoU not exceeding the threshold or
the number of misclassification errors; and false negative (FN) is the number of missed
detections, that is, the number of boxes that are not predicted [47]:
Z 1
AP = P( R)dR (3)
0
∑iC=1 AP(i )
mAP = (4)
C
AP (average precision) is obtained by calculating the P-R integral, where P is the
precision and R is the recall. AP is averaged to obtain mAP (mean average precision), where
C is the number of categories and in this paper, C = 17.
When detecting videos, we used accuracy as the evaluation metric. For a clip of the
video, the final label was determined by the most frequently occurring detection results of
all the frames of the target video, which were counted only if its confidence exceeded the
score threshold:
N
Accuracy = (5)
T
where N is the number of correctly classified videos and T is the total number of videos.
3. Results
3.1. NTLNP Dataset
After checking and cleaning, a total of 25,657 images were selected from 17 species
categories to build the NTLNP dataset, including 15,313 images from during the day and
10,344 images from at night. The image resolution was 1280 × 720 or 1600 × 1200 pixels
(Table 2). According to the ratio of 8:2, the NTLNP dataset was divided into the training set
and test set. The various types of data are shown in Table 3.
Table 2. The main properties of the NTLNP dataset.
Species No. of No. of No. of

Image Resolution
Category Total Images Daytime Images Nighttime Images
17 25,657 15,313 10,344 1280 × 720/1600 × 1200
Table 3. NTLNP dataset and per-class training set and test set assignments.
Day and Night Day Night

Species
Training Set Test Set Training Set Test Set Training Set Test Set
Amur tiger 1123 246 676 145 447 101
Amur leopard 1260 314 872 219 388 95
Wild boar 1801 423 1159 291 642 132
Sika dear 1726 466 1216 328 510 138
Red fox 1504 358 802 188 702 170
Raccoon dog 1169 324 248 81 921 243
Asian badger 1052 257 735 176 317 81
Asian black bear 1084 285 772 188 312 97
Leopard cat 1589 385 841 196 748 189
Roe deer 1749 374 1317 293 432 81
Siberian weasel 985 284 554 175 431 109
Yellow-throated marten 779 205 681 178 98 27
Sable 483 129 152 40 331 89
Musk deer 1045 248 216 47 829 201
Manchurian hare 1010 270 17 3 993 267
Cow 1016 284 936 263 80 21
Dog 1150 280 1056 252 94 28
Total 20,525 5132 12,250 3063 8275 2069
Animals 2022, 12, 1976 8 of 16
3.2. Experimental Results

3.2.1. Model Performance
Considering that the NTLNP dataset contained color images (day) and gray images
(night), we investigated whether it was better when day and night images were trained
separately or together. The results of each model are shown in Table 4. It was eventually
discovered that the day models’ accuracy outperformed that of the night models, and when
day and night images were trained jointly, all models were more accurate. Both YOLOv5
and FCOS achieved good precision and recall and Cascade_R-CNN_HRNet32 had high
recall but low precision, which was 81.5%, 73.8%, and 80.9% in day, night, and day-night
joint. When using mAP with a threshold of 0.5 IoU as the model evaluation, the average
accuracy of almost all models was above 98%, and YOLOv5 had a higher value compared
to the other two models. The accuracy of FCOS_Resnent50 and FCOS_Resnent101 was
relatively low at night: 94.7% and 96.5%, respectively. Cascade_R-CNN_HRNet32 achieved
a 97.3% accuracy in the daytime images, 97% accuracy in the nighttime images, and 98%
accuracy in the day-night joint training. When using mAP_0.5:0.95 as the metric, the models’
accuracy was between 82.4% and 88.9%.
Table 4. Overall recognition accuracy of different object detection models.
Metric
Experiment Model
Precision Recall mAP_0.5 mAP_0.5:0.95
YOLOv5s 0.981 0.972 0.987 0.858
YOLOv5m 0.987 0.975 0.989 0.880
YOLOv5l 0.984 0.975 0.989 0.878
Day&Night
FCOS_Resnet50 0.969 0.892 0.979 0.812
FCOS_Resnet101 0.963 0.882 0.978 0.820
Cascade_R-CNN_HRNet32 0.809 0.986 0.980 0.840
YOLOv5s 0.981 0.968 0.984 0.867
YOLOv5m 0.981 0.974 0.984 0.880
YOLOv5l 0.982 0.969 0.983 0.889
Day
FCOS_Resnet50 0.909 0.904 0.981 0.825
FCOS_Resnet101 0.928 0.920 0.983 0.832
YOLOv5s 0.956 0.972 0.984 0.850
YOLOv5m 0.976 0.982 0.989 0.867
YOLOv5l 0.971 0.986 0.989 0.874
Night
FCOS_Resnet50 0.940 0.859 0.947 0.678
FCOS_Resnet101 0.970 0.867 0.965 0.796
Note: mAP_0.5 is the average precision calculated when IoU is 0.5, mAP_0.5:0.95 is the average precision calculated
when IoU is 0.5 to 0.95 with steps of 0.05.
3.2.2. Species Detection and Classification

We selected YOLOv5m, FCOS_Resnet101, and Cascade_R-CNN_HRNet32, which had
a better performance, to further evaluate the recognition accuracy of each species.
Since there were only 20 images of hares in the daytime, they were not considered
in the model. The recognition accuracy of the 3 models trained on the daytime dataset
for the 16 species is shown in Figure 5. Cascade_R-CNN_HRNet32, YOLOv5m, and
FCOS_Resnet101 had a 91.6–100%, 94.2–99.5%, and 94–100% accuracy for the 16 species.
Cascade_R-CNN_HRNet32 achieved a 100% recognition accuracy for Amur leopard and
musk deer, and FCOS_Resnet101 for Amur tiger and red fox. The accuracy of YOLOv5m
and FCOS_Resnet101 for raccoon dog reached 96% and 96.4%, respectively, which was
4.4–4.8% higher than Cascade_R-CNN_HRNet32. Sable showed the worst performance,
and YOLOv5m had the relatively best accuracy of 94.2%.
Animals 2022, 12, 1976Animals2022,
Animals 2022,12,
12,xxFOR
FORPEER
PEERREVIEW
REVIEW 9 of 16 10
Figure
Figure
Figure 5. Recognition 5.5.Recognition
Recognition
accuracy accuracyofof
of eachaccuracy
species ofthree
eachobject
each species
species of three
of three object
detection object
models detection
detection
based on models
models based on
based
the daytime on the
the day
day
dataset.
dataset.
dataset. The y-axis The y-axis
Thevalue
is the AP is
y-axiswhen the AP
is the IOU value
AP value when
= 0.5,when IOU =
IOUfrom
ranging 0.5, ranging
= 0.5,0.85–1;
ranging from
thefrom 0.85–1;
0.85–1;
x-axis the
is thethe x-axis is the specie
x-axis
species. is the specie
Figure66the
Figure
Figure 6 demonstrates demonstrates
demonstrates
recognitionthethe recognition
recognition
accuracy of theaccuracy
accuracy of the
of the night
night models. night models.
models.
We found We found
We
that found
the three models the three
theexhibitedmodels
three models exhibited performance
exhibited performance
performance differences
differences atdifferences at night. YOLOv5m
at night. YOLOv5m
night. YOLOv5m had the
had
had the best the bes
bes
curacyin
curacy
accuracy in recognizing in recognizing
recognizing
animals animals
animals
at night, at night,
at
reaching night, reachingThe
reaching
97.7–99.5%. 97.7–99.5%.
97.7–99.5%. The
accuracyThe accuracy of
accuracy
of Cascade_R- of Cascad
Cascad
CNN_HRNet32
CNN_HRNet32CNN_HRNet32
was above 95%was was abovespecies
above
for most 95% for
95% forbutmost
most species
species
lower but deer
but
for roe lowerand
lower for dogs
for roe deer
roe deer and dogs
and
at 92.8% dogs at
at 92
92
and 88.2%.
and 88.2%.
and 88.2%. In contrast, In contrast, FCOS_Resnet101
In contrast, FCOS_Resnet101
FCOS_Resnet101 performed
performed theperformed the worst
the worst
worst at night, at night, with
withatsignificant signifi
night, with signifi
differences amongdifferences
differences among
species.among species.
Amurspecies. Amur
Amur
tiger, Amur tiger, Amur
tiger,
leopard, Amur leopard,
leopard,
and musk and
and
deer musk deer
musk
achieved deer achieved aa 11
achieved
a 100%
accuracy
accuracy while accuracy while dog
while dog
dog and badger were and
and badger
badger
only 87.4% were
were only
andonly 87.4%
87.4%
91.7% and 91.7% accurate.
and 91.7% accurate.
accurate.
Figure 6. Recognition accuracy of each species of the three object detection models based on
Figure
Figure 6. Recognition 6. Recognition
nighttime dataset.
accuracy Theaccuracy
of each y-axis
species of
of each
is the species
AP three
the value of the
when
object IOUthree
= 0.5,object
detection detection
ranging
models from models
on thethebased
based0.85–1; x-axison
i
nighttime
The y-axis is the AP value when IOU = 0.5, ranging from 0.85–1; the x-axis is x-axis i
nighttime dataset.species. dataset. The y-axis is the AP value when IOU = 0.5, ranging from 0.85–1; the
species.
the species.
Compared with separate training, the day-night jointly models achieved a bette
Compared curacy Compared with separate training, thejointly
day-night jointly modelsaachieved a bette
with for all species
separate (Figure
training, the 7). YOLOv5m
day-night was models
the best achieved
model, with better
an accuracy of
curacy
accuracy for all 99.5%. for
speciesRoe all species
deer,7).
(Figure (Figure
badger,
YOLOv5m 7). YOLOv5m
raccoonwasdog, was the best
theyellow-throated
best model, with model,
marten, with an accuracy
and dogof
an accuracy of
all achiev
99.5%.
higher
97–99.5%. Roe deer, Roe deer,
recognition
badger, badger,
raccoonaccuracyraccoon dog, yellow-throated
than the other two
dog, yellow-throated models.
marten, marten,
andThe and
dogaccuracy dog all achiev
of FCOS_Resn
all achieved a
higher
higher recognition recognition
andaccuracy accuracy
Cascade_R-CNN_HRNet32
than the other than the other
tworanged
models. fromtwo
The models. The
94.2–100%
accuracy and
of accuracy of FCOS_Resn
95.3–99.9%,
FCOS_Resnet50 respectively
and Cascade_R-CNN_HRNet32 ranged from 94.2–100% and 95.3–99.9%, respectively
and Cascade_R-CNN_HRNet32 ranged from 94.2–100% and 95.3–99.9%, respectively.
Animals 2022, 12, 1976Animals 2022, 12, x FOR PEER REVIEW 10 of 16 11 of
Figureaccuracy
Figure 7. Recognition 7. Recognition accuracy
of each species of of
eachthespecies
three of the three
object object detection
detection models
models based onbased
the on the da
Figure 7. Recognition
nightaccuracy of each
dataset. The species
y-axis is theofAP
thevalue
threewhen
objectIOU
detection
= 0.5, models
rangingbased on the day-
from 0.85–1; the x-axis is the sp
day-night The y-axis
dataset. y-axis
night dataset. The
is the AP value when IOU = 0.5, ranging from 0.85–1; the x-axis is
is the AP value when IOU = 0.5, ranging from 0.85–1; the x-axis is the spe-
cies.
cies.
the species.
All models had the ability to detect each object in a single image. Because differe
All
Allmodels
models had
had the
the ability to
ability to detect
detecteach
each objectinina asingle
single image. Because different
species rarely appeared in front object
of one camera trap image.
at theBecause different
same time, there were only im
species
species rarely appearedininfront
rarely appeared front of one cameratraptrap at the same time, thereonly
were only
ages of one objectoforone camera
multiple objects atofthe
thesame
sametime, there
species in were im-
our dataset. Some identifi
ages ofofone
images oneobject or
object multiple
or multiple objects of
objects the
of same
the same species
speciesin our
in dataset.
our dataset.Some identified
Some identified
images are shown in Figure 8 and more results of the different models are reported in t
imagesare
images areshown
shown in Figure
in Figure 88 and
Supplementary and more
moreresults
Materials resultsofofS1–S3).
(Figures the
thedifferent
differentmodels
modelsareare
reported in the
reported in the
Supplementary Materials
Supplementary Materials (Figures
(Figures S1–S3).
S1–S3).
Figure 8. Examples of correct detection and classification.

Animals 2022, 12, 1976 11 of 16
3.2.3. Video Automatic Recognition

We applied the day-night joint YOLOv5m, Cascade_R-CNN_HRNet32, and FCOS_Resnet101
to automatically recognize the videos captured by infrared cameras in the Northeast Tiger and
Leopard National Park. The accuracy of the three models was tested when the score thresholds
were 0.6, 0.7, and 0.8, respectively. The result is shown in Table 5. YOLOv5m showed the most
robust performance among all models. When the threshold was 0.7, the accuracy was 89.6%.
Cascade_R-CNN_HRNet32 was slightly inferior, obtaining the highest accuracy of 86.5% at the
threshold of 0.8. The accuracy of FCOS_Resnet101 showed significant differences at different
thresholds. When the threshold was 0.6, the video classification accuracy reached 91.6%. Never-
theless, when the threshold was 0.8, the recognition rate of the videos dropped sharply, eventually
only reaching 64.7%.
Table 5. Video classification accuracy of the three models.
Videos Model Acc_0.6 Acc_0.7 Acc_0.8

YOLOv5m 88.8% 89.6% 89.5%
725 Cascade_R-CNN_HRNet32 86.3% 86.4% 86.5%
FCOS_Resnet101 91.6% 86.6% 64.7%
Note: Acc represents Accuracy; Acc_0.6, 0.7, 0.8 represent the accuracy of video classification where the score
threshold = {0.6, 0.7, 0.8}.
4. Discussion
Open-source datasets on citizen science platforms boost interdisciplinary research,
where scientists are able to train various models based on these datasets and propose
optimization schemes [26,27]. However, we have to consider the geographic biases of
most ecological datasets in practical applications [31]. In this study, for the first time, we
constructed an image dataset of 17 species in the Northeast Tiger and Leopard National
Park with standard bounding box and annotation (Table 3, NTLNP dataset). This dataset
provides a great resource for exploring and evaluating the application of deep learning in
the Northeast Tiger and Leopard National Park. Our dataset was small compared to large
image recognition projects, but the results were relatively good and could provide a fairly
effective aid in the subsequent data processing process. At the same time, the construction
of the NTLNP dataset also complemented the diversity of ecological data for deep learning.
By comparison, we found that day-night joint training had a better performance
(Table 4), breaking our assumption that separate training would be more effective. YOLOv5,
FCOS, and Cascade R-CNN all achieved high average precision: >97.9% at mAP_0.5 and
>81.2% at mAP_0.5:0.95, which could meet the needs of automatic wildlife recognition
(Table 4). Moreover, all models exhibited similar characteristics, i.e., good performance for
large targets such as Amur tiger and Amur leopard. For small targets such as badger and
yellow-throated marten, the accuracy of predicting borders was reduced due to their fast
movement, which would easily cause blurring in images at night (Figure 9a). Additionally,
the models sometimes misidentified the background as an animal (Figure 9b). We believe
that static backgrounds that closely resembled animal forms might interfere with the
recognition. Additionally, when animals were too close/far or hidden/occluded, the
models might have failed to detect the targets (Figure 9c,d). Some similar morphological
species were prone to misidentification (Figure 9e). Overall, the recognition results were
seriously affected when the image quality was poor.
In this experiment, the accuracy of the anchor-based one-stage YOLOv5 series models
exceeded that of the anchor-free one-stage FCOS series models and anchor-based two-
stage Cascade_R-CNN_HRNet32. Especially, YOLOv5m achieved the highest accuracy,
with 98.9% for mAP_0.5 and 88% for mAP_0.5:0.95 (Table 4). This was inconsistent with
the usual results mentioned in previous literature, where two-stage models were usually
more accurate than one-stage models, and the deeper the network, the better the model
performance [34]. Therefore, when applying artificial intelligence (AI), ecologists should
Animals 2022, 12, 1976 12 of 16
Animals 2022, 12, consider

x FOR PEER REVIEW
the actual
situation of each protected area and choose the appropriate model as a 13 o
tool to help wildlife monitoring and research.
Figure 9. Examples Figure 9. Examples

of the of thecases
typical failure typical failure
of the cases (a)
models. of the models.
False (a) False
negative or low negative or low recogni
recognition
ratio due to poor image quality (blur, etc.); (b) Misrecognition of the background (stump, stone, fallen(stump, st
ratio due to poor image quality (blur, etc.); (b) Misrecognition of the background
fallen leave,
leave, etc.); (c) Inability etc.);
to detect the(c)target
Inability
when toanimals
detect the
aretarget when animals
too close/far; are tootoclose/far;
(d) Inability detect the(d) Inabilit
target when animals are hidden or occluded; (e) Similar species are prone to misidentification. Redto misiden
detect the target when animals are hidden or occluded; (e) Similar species are prone
cation. Red dotted boxes are added manually to show the missing targets.
dotted boxes are added manually to show the missing targets.
In this experiment,
Moreover, we suggest the threshold thesetting
accuracy of the
of the anchor-based
model being tested one-stage
along aYOLOv5
suitable series m
els exceeded that of the anchor-free one-stage FCOS series models and anchor-based tw
gradient in practical applications. When we applied the trained models to the infrared
stage Cascade_R-CNN_HRNet32. Especially, YOLOv5m achieved the highest accura
camera videos, we found that at different thresholds, the accuracy of FCOS_Resnet101
with 98.9% for mAP_0.5 and 88% for mAP_0.5:0.95 (Table 4). This was inconsistent w
showed more significant variation while that of YOLOv5m and Cascade_R-CNN_HRNet32
the usual results mentioned in previous literature, where two-stage models were usu
was almost constant (Table 5). As can be seen, sometimes setting the threshold too high
more accurate than one-stage models, and the deeper the network, the better the mo
does not improve the accuracy while a problem with a low threshold is that it can lead to
performance [34]. Therefore, when applying artificial intelligence (AI), ecologists sho
an increase in false positives of images without wildlife.
consider the actual situation of each protected area and choose the appropriate mode
Finally, due to the limitations of the experimental environments, this study only
a tool to help wildlife monitoring and research.
compared the accuracy but failed to compare other parameters such as the running speed
Moreover, we suggest the threshold setting of the model being tested along a suita
of the models. In follow-up studies, it is necessary to perform a comprehensive comparison
gradient in practical applications. When we applied the trained models to the infra
before choosing the model that suits the application scenario best. In addition, we found
camera videos, we found that at different thresholds, the accuracy of FCOS_Resnet
that the background information strongly influenced the models’ performance. It should
showed more significant variation while that of YOLOv5m and Cascade
be noted that static infrared
CNN_HRNet32 cameras are usually
was almost fixed(Table
constant on trees
5). in
Asthe field,
can capturing
be seen, large setting
sometimes
numbers of photos or videos
threshold withdoes
too high the same background.
not improve Beery while
the accuracy proposed the Context
a problem with a R-
low thresh
CNN architecture, which can aggregate contextual features from other frames and leverage
is that it can lead to an increase in false positives of images without wildlife.
the long-term temporal context
Finally, dueto
toimprove objectof
the limitations detection in passiveenvironments,
the experimental monitoring [48].
this The
study only co
pared the accuracy but failed to compare other parameters such as the running speed
Animals 2022, 12, 1976 13 of 16
seasonal, temporal, and locational variations made the background information vary widely,
so the models were prone to misjudgment for unlearned backgrounds. In the future, the
selection of images of species at different times and in different geographical environments
can enhance the model’s ability to learn the context. Moreover, affected by the light and
geographical environments, the quality of the images and videos captured by the cameras
was different, and the uncertainty of triggering, animals that were too large/small or hidden,
and fast movement increases the difficulty of identification [49,50]. Attempts can be made to
further improve the species recognition accuracy by combining ecological information such
as the sound, activity patterns, and geographical distribution of the animals with image-
based identification systems [51,52]. Furthermore, for ecological studies, distinguishing
individual differences within species is also crucial, and the future incorporation of re-
identification into detection systems will enable the tracking of individuals and counting of
the number of species in a region [53–55].
5. Conclusions
Camera traps provide a critical aid in multifaceted surveys of wildlife worldwide while
they often produce large volumes of images and videos [56]. A growing number of studies
have tried to use deep learning techniques to extract effective information from massive
images or videos. Our paper constructed the NTLNP dataset, which could increase the
diversity of wildlife datasets, and verified the feasibility and effectiveness of object detection
models for identifying wild animals in the complex forest backgrounds in the Northeast
Tiger and Leopard National Park. On the NTLNP dataset, we conducted experiments on
three mainstream object detection models and all models showed a satisfying performance.
Moreover, we proposed that according to the deployment scenario, the dynamic selection
model would achieve better results. Overall, this technology is of great practical value
in helping researchers conduct more effective biodiversity monitoring, conservation, and
scientific research in the Northeast Tiger and Leopard National Park.
As ecology enters the field of big data, deep learning brings a lot of hope to ecolo-
gists [19]. Although it is impossible for the model to achieve 100% accuracy, the technology
will reduce the manual identification work and help ecologists quickly and efficiently
extract information from massive data. In the future, in-depth interdisciplinary cooperation
will further promote technological innovation in ecological research and conservation.
Supplementary Materials: The following supporting information can be downloaded at: https:
//www.mdpi.com/article/10.3390/ani12151976/s1, Figure S1 Examples of correct animal detection
and classification usingYOLOV5m network; Figure S2 Examples of correct animal detection and
classification using FCOS_Resnet101; Figure S3 Examples of correct animal detection and classification
using Cascade_R-CNN_HRNet32.
Author Contributions: Conceptualization, M.T., L.Y. and L.F.; Investigation, M.T. and Y.M.; Data
curation, M.T., Y.M., J.-K.C., M.Z. and X.J.; Methodology, W.C., J.-K.C., M.Z. and M.T.; Visualization,
M.T. and Y.M.; Writing—original draft, M.T.; Writing—review and editing, W.C., L.F. and L.Y.;
Resources, J.G., L.F. and L.Y.; Funding acquisition, J.G. and L.F. All authors have read and agreed to
the published version of the manuscript.
Funding: This research was funded by National Natural Science Foundation of China, grant number
31670537; National Scientific and Technical Foundation Project of China, grant number 2019FY101700;
National Forestry and Grassland Administration, grant number 2017; Cyrus Tang Foundation, grant
number 2016; BNU Interdisciplinary Research Foundation for the First-Year Doctoral Candidates,
grant number BNUXKJC2118.
Institutional Review Board Statement: Not applicable. No ethical approval was required as only
camera trapping is a noninvasive method.
Informed Consent Statement: Not applicable.
Data Availability Statement: NTLNP_dataset link: https://pan.bnu.edu.cn/l/s1JHuO (accessed on
1 May 2022). The key is available on request from the corresponding author.
Animals 2022, 12, 1976 14 of 16
Acknowledgments: This research was partially supported by China’s Northeast Tiger and Leopard
National Park. We are thankful to Fu Yanwen for his comments and help.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Hooper, D.U.; Adair, E.C.; Cardinale, B.J.; Byrnes, J.E.; Hungate, B.A.; Matulich, K.L.; Gonzalez, A.; Duffy, J.E.; Gamfeldt, L.;
O’Connor, M.I. A global synthesis reveals biodiversity loss as a major driver of ecosystem change. Nature 2012, 486, 105–108.
[CrossRef] [PubMed]
2. Dirzo, R.; Young, H.S.; Galetti, M.; Ceballos, G.; Isaac, N.J.; Collen, B. Defaunation in the Anthropocene. Science 2014, 345, 401–406.
[CrossRef] [PubMed]
3. Díaz, S.M.; Settele, J.; Brondízio, E.; Ngo, H.; Guèze, M.; Agard, J.; Arneth, A.; Balvanera, P.; Brauman, K.; Butchart, S. The Global
Assessment Report on Biodiversity and Ecosystem Services: Summary for Policy Makers; Intergovernmental Science-Policy Platform on
Biodiversity and Ecosystem Services: Bonn, Germany, 2019; ISBN 978-3-947851-13-3.
4. Almond, R.E.; Grooten, M.; Peterson, T. Living Planet Report 2020-Bending the Curve of Biodiversity Loss; World Wildlife Fund:
Washington, DC, USA, 2020.
5. Anderson, C.B. Biodiversity monitoring, earth observations and the ecology of scale. Ecol. Lett. 2018, 21, 1572–1585. [CrossRef]
[PubMed]
6. Adam, M.; Tomášek, P.; Lehejček, J.; Trojan, J.; Jůnek, T. The Role of Citizen Science and Deep Learning in Camera Trapping.
Sustainability 2021, 13, 10287. [CrossRef]
7. Ordeñana, M.A.; Crooks, K.R.; Boydston, E.E.; Fisher, R.N.; Lyren, L.M.; Siudyla, S.; Haas, C.D.; Harris, S.; Hathaway, S.A.;
Turschak, G.M. Effects of urbanization on carnivore species distribution and richness. J. Mammal. 2010, 91, 1322–1331. [CrossRef]
8. Gilbert, N.A.; Pease, B.S.; Anhalt-Depies, C.M.; Clare, J.D.; Stenglein, J.L.; Townsend, P.A.; Van Deelen, T.R.; Zuckerberg, B.
Integrating harvest and camera trap data in species distribution models. Biol. Conserv. 2021, 258, 109147. [CrossRef]
9. Palencia, P.; Fernández-López, J.; Vicente, J.; Acevedo, P. Innovations in movement and behavioural ecology from camera traps:
Day range as model parameter. Methods Ecol. Evol. 2021, 12, 1201–1212. [CrossRef]
10. Luo, G.; Wei, W.; Dai, Q.; Ran, J. Density estimation of unmarked populations using camera traps in heterogeneous space. Wildl.
Soc. Bull. 2020, 44, 173–181. [CrossRef]
11. Mölle, J.P.; Kleiven, E.F.; Ims, R.A.; Soininen, E.M. Using subnivean camera traps to study Arctic small mammal community
dynamics during winter. Arct. Sci. 2021, 8, 183–199. [CrossRef]
12. O’Connell, A.F.; Nichols, J.D.; Karanth, K.U. Camera Traps in Animal Ecology: Methods and Analyses; Springer: Berlin/Heidelberg,
Germany, 2011; Volume 271.
13. Jia, L.; Tian, Y.; Zhang, J. Domain-Aware Neural Architecture Search for Classifying Animals in Camera Trap Images. Animals
2022, 12, 437. [CrossRef]
14. Newey, S.; Davidson, P.; Nazir, S.; Fairhurst, G.; Verdicchio, F.; Irvine, R.J.; van der Wal, R. Limitations of recreational camera
traps for wildlife management and conservation research: A practitioner’s perspective. Ambio 2015, 44, 624–635. [CrossRef]
15. Carl, C.; Schönfeld, F.; Profft, I.; Klamm, A.; Landgraf, D. Automated detection of European wild mammal species in camera trap
images with an existing and pre-trained computer vision model. Eur. J. Wildl. Res. 2020, 66, 62. [CrossRef]
16. Rovero, F.; Zimmermann, F.; Berzi, D.; Meek, P. Which camera trap type and how many do I need? A review of camera features
and study designs for a range of wildlife research applications. Hystrix 2013, 24, 148–156.
17. Yousif, H.; Yuan, J.; Kays, R.; He, Z. Animal Scanner: Software for classifying humans, animals, and empty frames in camera trap
images. Ecol. Evol. 2019, 9, 1578–1589. [CrossRef] [PubMed]
18. Yang, D.-Q.; Li, T.; Liu, M.-T.; Li, X.-W.; Chen, B.-H. A systematic study of the class imbalance problem: Automatically identifying
empty camera trap images using convolutional neural networks. Ecol. Inform. 2021, 64, 101350. [CrossRef]
19. Christin, S.; Hervet, É.; Lecomte, N. Applications for deep learning in ecology. Methods Ecol. Evol. 2019, 10, 1632–1644. [CrossRef]
20. Browning, E.; Gibb, R.; Glover-Kapfer, P.; Jones, K.E. Passive Acoustic Monitoring in Ecology and Conservation. WWF Conserv.
Technol. Ser. 1 2017, 2, 10–12.
21. Shepley, A.; Falzon, G.; Meek, P.; Kwan, P. Automated location invariant animal detection in camera trap images using publicly
available data sources. Ecol. Evol. 2021, 11, 4494–4506. [CrossRef] [PubMed]
22. Culina, A.; Baglioni, M.; Crowther, T.W.; Visser, M.E.; Woutersen-Windhouwer, S.; Manghi, P. Navigating the unfolding open data
landscape in ecology and evolution. Nat. Ecol. Evol. 2018, 2, 420–426. [CrossRef]
23. Olden, J.D.; Lawler, J.J.; Poff, N.L. Machine learning methods without tears: A primer for ecologists. Q. Rev. Biol. 2008, 83, 171–193.
[CrossRef]
24. Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep learning applications and
challenges in big data analytics. J. Big Data 2015, 2, 1. [CrossRef]
25. Villa, A.G.; Salazar, A.; Vargas, F. Towards automatic wild animal monitoring: Identification of animal species in camera-trap
images using very deep convolutional neural networks. Ecol. Inform. 2017, 41, 24–32. [CrossRef]
Animals 2022, 12, 1976 15 of 16
26. Chen, G.; Han, T.X.; He, Z.; Kays, R.; Forrester, T. Deep convolutional neural network based species recognition for wild animal
monitoring. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October
2014; pp. 858–862.
27. Norouzzadeh, M.S.; Nguyen, A.; Kosmala, M.; Swanson, A.; Palmer, M.S.; Packer, C.; Clune, J. Automatically identifying, counting,
and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. USA 2018, 115, E5716–E5725.
[CrossRef]
28. Schneider, S.; Taylor, G.W.; Kremer, S. Deep learning object detection methods for ecological camera trap data. In Proceedings of
the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018; pp. 321–328.
29. Zhao, Z.-Q.; Zheng, P.; Xu, S.-t.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019,
30, 3212–3232. [CrossRef]
30. Vecvanags, A.; Aktas, K.; Pavlovs, I.; Avots, E.; Filipovs, J.; Brauns, A.; Done, G.; Jakovels, D.; Anbarjafari, G. Ungulate Detection
and Species Classification from Camera Trap Images Using RetinaNet and Faster R-CNN. Entropy 2022, 24, 353. [CrossRef]
31. Tuia, D.; Kellenberger, B.; Beery, S.; Costelloe, B.R.; Zuffi, S.; Risse, B.; Mathis, A.; Mathis, M.W.; van Langevelde, F.; Burghardt, T.
Perspectives in machine learning for wildlife conservation. Nat. Commun. 2022, 13, 792. [CrossRef]
32. Feng, J.; Xiao, X. Multiobject Tracking of Wildlife in Videos Using Few-Shot Learning. Animals 2022, 12, 1223. [CrossRef] [PubMed]
33. Zou, Z.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. arXiv 2019, arXiv:1905.05055.
34. Carranza-García, M.; Torres-Mateo, J.; Lara-Benítez, P.; García-Gutiérrez, J. On the performance of one-stage and two-stage object
detectors in autonomous vehicles using camera data. Remote Sens. 2020, 13, 89. [CrossRef]
35. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9626–9635.
36. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162.
37. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
38. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
39. Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning
capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle,
WA, USA, 14–19 June 2020; pp. 1571–1580.
40. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768.
41. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944.
42. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch:
An imperative style, high-performance deep learning library. In Proceedings of the 33rd International Conference on Neural
Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019;
p. 721.
43. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
44. Sun, K.; Zhao, Y.; Jiang, B.; Cheng, T.; Xiao, B.; Liu, D.; Mu, Y.; Wang, X.; Liu, W.; Wang, J. High-resolution representations for
labeling pixels and regions. arXiv 2019, arXiv:1904.04514.
45. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceed-
ings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015;
MIT Press: Montreal, QC, Canada, 2015; Volume 1, pp. 91–99.
46. Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J. MMDetection: Open mmlab detection toolbox
and benchmark. arXiv 2019, arXiv:1906.07155.
47. Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the
2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; pp. 237–242.
48. Beery, S.; Wu, G.; Rathod, V.; Votel, R.; Huang, J. Context r-cnn: Long term temporal context for per-camera object detection. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020;
pp. 13072–13082.
49. Yousif, H.; Yuan, J.; Kays, R.; He, Z. Fast human-animal detection from highly cluttered camera-trap images using joint background
modeling and deep learning classification. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems
(ISCAS), Baltimore, MD, USA, 28–31 May 2017; pp. 1–4.
50. Miao, Z.; Gaynor, K.M.; Wang, J.; Liu, Z.; Muellerklein, O.; Norouzzadeh, M.S.; McInturff, A.; Bowie, R.C.; Nathan, R.; Yu, S.X.
Insights and approaches using deep learning to classify wildlife. Sci. Rep. 2019, 9, 8137. [CrossRef]
51. Yang, B.; Zhang, Z.; Yang, C.-Q.; Wang, Y.; Orr, M.C.; Wang, H.; Zhang, A.-B. Identification of species by combining molecular
and morphological data using convolutional neural networks. Syst. Biol. 2022, 71, 690–705. [CrossRef]
52. Lin, C.; Huang, X.; Wang, J.; Xi, T.; Ji, L. Learning niche features to improve image-based species identification. Ecol. Inform. 2021,
61, 101217. [CrossRef]
Animals 2022, 12, 1976 16 of 16
53. Shi, C.; Liu, D.; Cui, Y.; Xie, J.; Roberts, N.J.; Jiang, G. Amur tiger stripes: Individual identification based on deep convolutional
neural network. Integr. Zool. 2020, 15, 461–470. [CrossRef] [PubMed]
54. Hou, J.; He, Y.; Yang, H.; Connor, T.; Gao, J.; Wang, Y.; Zeng, Y.; Zhang, J.; Huang, J.; Zheng, B. Identification of animal individuals
using deep learning: A case study of giant panda. Biol. Conserv. 2020, 242, 108414. [CrossRef]
55. Guo, S.; Xu, P.; Miao, Q.; Shao, G.; Chapman, C.A.; Chen, X.; He, G.; Fang, D.; Zhang, H.; Sun, Y. Automatic identification of
individual primates with deep learning techniques. Iscience 2020, 23, 101412. [CrossRef]
56. Fennell, M.; Beirne, C.; Burton, A.C. Use of object detection in camera trap image identification: Assessing a method to rapidly
and accurately classify human and animal detections for research and application in recreation ecology. Glob. Ecol. Conserv. 2022,
35, e02104. [CrossRef]

Animals 12 01976 v2

Uploaded by

Copyright:

Available Formats

Animals 12 01976 v2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Animals 12 01976 v2

Uploaded by

Copyright:

Available Formats

animals

Animals 2022, 12, 1976. https://doi.org/10.3390/ani12151976 https://www.mdpi.com/journal/animals

Animals 2022, 12, 1976 3 of 16

Figure 1. Examples of some species of the NTLNP dataset.

Animals 2022, 12, 1976

2.2. Object Detection Network

positive (FP) is the number of incorrect detections of a nonexistent object or misplaced

Table 2. The main properties of the NTLNP dataset.

Species No. of No. of No. of

Day and Night Day Night

3.2. Experimental Results

Table 4. Overall recognition accuracy of different object detection models.

3.2.2. Species Detection and Classification

Figure 8. Examples of correct detection and classification.

3.2.3. Video Automatic Recognition

Table 5. Video classification accuracy of the three models.

Videos Model Acc_0.6 Acc_0.7 Acc_0.8

Animals 2022, 12, consider

tool to help wildlife monitoring and research.

Figure 9. Examples Figure 9. Examples

You might also like