common detectors. In our previous works, we have already with potential targets, and then classify them with some
considered this problem. In the first approach, we have de- machine learning technique. The features could be either
signed features useful for this case [16]. The second method general purpose like Haar features [22] and features based
used CNNs to learn visual features from similar images [17] on Discrete Cosine Transform [23] or specialized ones like
and on the third method, we have associated a CNN with a color and textures [24] [25]. These approaches are useful for
tracker to verify consistency between consecutive frames [18]. applications with well-defined conditions, however, airborne
Despite achieving interesting results on SEAGULL dataset images captured by small aircraft in maritime scenarios have
[14], there are some conditions where these methods fail. objects with a large range of sizes, orientations and shapes. To
Consequently, in this work we introduce a method to robustly face this challenge, some works have used saliency methods,
detect objects in airborne maritime surveillance images, which i.e. algorithms that try to emulate the human visual attention
are affected by glare, wakes and waves’ crests. mechanism [26] [27]. The mentioned approaches highlight
areas that are distinct from the background but that may
A. Contributions not correspond to the object of interest. These undesired
detections are usually suppressed by using either a heuristic,
The main contributions of our work can be summarized as like checking if a given detection persists on a given number
follows: of frames [16] or a more formal framework like the usage of
• Characterize the applicability of convolutional LSTM to a Hidden Markov Model [28].
learn visual and temporal features relevant for detection Following the advances in computer vision and pattern
of maritime objects in airborne video sequences. recognition, maritime detection has also adopted deep learn-
• Incorporate Domain Specific Knowledge about maritime ing. Maire et al. [29] have proposed the application of CNNs
objects’ size and the number of visible objects at a given in a sliding window fashion, for the detection of marine
time to improve training and the detection performance. mammals. Their relatively shallow network architecture led
In particular, we penalize predictions with large areas to limited results. Bousetouane and Morris [30] have also
labeled as containing a boat. used CNNs for the detection in a maritime scenario. Their
• Effect analysis of the time scale considered by the neural pipeline includes several weak detectors to compute candidate
network on the detection robustness. regions, extracts features learned by a neural network and
• Compare the proposed methodology, in multiple maritime then classifies them with a support vector machine. The main
monitoring scenarios, with a mainstream detection net- downside of this approach is that the first set of weak detec-
work, YOLO [19], and also with one of our previously tors relies on hand engineered features. While this approach
published methods, that uses visual features and a Mul- performs well on their case, in our scenario the targets have
tiple Hypothesis Tracker (MHT) to associate detections significant appearance variability which makes the features’
[18]. manual selection intractable.
There are other approaches, like detection on wide area
B. Outline of the paper imagery, which rely on networks like Fast R-CNN [31]. Wide
area imagery is especially suited for this kind of networks
The paper is organized as follows. The next section reviews
because it is usually orthorectified and the ground distance cor-
the literature about object detection and the Section III presents
responding to a pixel is well characterized. This allows to more
the network’s architecture and provides detail about the Conv-
easily define anchors, which are very relevant for this family
LSTM layer and the loss function. In Section IV, we describe
of methods. In addition to adaptations of canonical networks
the dataset that was used to train and test our method, present
there have been inovative approaches for the case of satellite
the evaluation metrics and the results. In the last section, we
images. For instance, Wang et al. have designed a network
conclude this work.
for change detection, exploring with different weights spectral
and subpixel information [32]. Like the previous example, this
II. R ELATED WORK approach also needs image pre-processing steps. In the present
Some of the first applications of vehicle detection from work, the aircraft’s movement and the perspective variations
aerial images assumed moving vehicles over land, with some hinder the applicability of these preprocessing steps.
authors using techniques like background subtraction [20]. There have been some advances using the exploitation of
Background subtraction is effective when there are static information contained in videos sequences. By consecutively
objects with features that allow the registration and alignment observing a given object in several frames, phenomena like
of consecutive images. In scenarios over the ocean, usually glare and waves are discarded because their persistence is lim-
the most distinctive visible features (other than the objects of ited in time. Consequently, the correct object can be detected
interest) are glare, waves and wakes, all of which are dynamic even if, in some frames, it gets occluded or its appearance
phenomena, that change rapidly and hinder image alignment. changes dramatically. Historically, approaches like Markov
Other authors assume that an objects’ position is limited to Chain Monte Carlo data association [33] and MHT [34], have
areas like roads [21], which is not applicable in maritime been used to successfully associate detections in consecutive
surveillance scenarios as vessels can move virtually anywhere. time instants, achieving highly robust detection results. In [18],
Another traditional approach to vehicle detection from aerial we have also used MHT to improve the results obtained with
images was to extract relevant features, from image regions a CNN. The downside of this approach is that the movement
model is tuned by the designer and not learned from data. time
Additionally, the visual features and temporal features are
handled separately, which means that if there is a given object
which is persistent on the image but the visual cues are not
recognized by the neural network, it is not detected.
More recently, some approaches have explored data-driven
learning methods, for sequential data, like recurrent neural time
networks, in particular, the LSTM layer [35]. The core of stride
the LSTM layer is a memory cell that encodes knowledge (a) (b) (c)
about the features seen up to a given moment. This cell is able
Fig. 2. Example of (a) input Xt , (b) ground truth Yt and (c) prediction Yˆt
to keep or discard this knowledge due to three gates (input, in case the sequence has a time span corresponding to 7 frames but only 3
output and forget) that control the amount of information that frames are processed. Images are represented in a lighter tone to indicate that
enters, exits and is kept on the layer. One interesting LSTM they are discarded.
usage was suggested by Wang et al. [36], in which the LSTM,
associated with a CNN, implements an attention mechanism
and test our method as they are representative of maritime
for scene classification. The recurrent module obtains high
monitoring missions. In either stage, we extract short video
level features from the CNN, sequentially generates attention
sequences from the full-length videos and we use them as
masks and incrementaly classifies the images. Despite being
our samples. Since the observed scene does not change sig-
a different task, this indicates that recurrent processing might
nificantly between two frames, we pick one image every rth
improve the performance of the considered task.
frame, as displayed in Fig. 2. In the rest of the work, we will
An interesting approach for image-like data, presented in
designate this separation of r frames as time stride. Each of
[37] is the use of LSTM with convolutional structure. Convolu-
these samples Xt is a 4D tensor, composed of K images, i.e.
tional LSTM (ConvLSTM), as named by its authors, removes
Xt = xt−(K×r) , . . . , xt , where xt is the image captured at
the spatial redundancy present in the application of traditional
instant t. In Fig. 2 (a), we show an example where the number
LSTMs to image-like data, much in the same way convo-
of processed frames is K = 3 and the separation between
lutional neural networks remove spatial redundancy present
them (time stride) is r = 3. In this case, only images xt , xt−3
in fully connected neural networks. Nonetheless, ConvLSTM
and xt−6 , represented in a darker tone, are considered and the
retains the internal structure (with memory cells, and forget,
lighter images are discarded.
input and output gates) that allows the common LSTM to keep
All videos were manually labeled, marking the location
memory during long periods and using it when relevant. The
(bounding box) of all objects of interest, and therefore each
main difference is that the gates in LSTM are applied multi-
sequence Xt has a corresponding ground truth Yt . Analogously
plicatively and in ConvLSTM are applied in a convolutional
to the video sequences, each ground truth Yt incorporates
K ground truth maps, separated r frames between them,
In our work, our objective is to use ConvLSTM to learn tem-
Yt = yt−(K×r) , . . . , yt . As shown in Fig. 2(b), a ground
poral and visual features that improve the detection of boats
truth map yt consists of a binary image where pixels with
in video sequences captured by a RGB camera installed on a
ones indicate locations with objects of interest. Similarly to
small size aircraft, in maritime environment. The videos are
videos sequences, in the presented example, we only consider
part of SEAGULL dataset [14] and illustrate realistic maritime
the ground truths yt , yt−3 and yt−6 . All images x and labels
monitoring missions. The observed boats have variable sizes
y were resized to a resolution of 720 × 720 and 300 × 300,
and shapes, ranging from small life rafts to high-speed boats
respectively. The neural network’s goal is to obtain an estimate
to cargo ships. Additionally, the variation in observation per-
Ŷt similar to the ground truth, as displayed on Fig.2(c).
spective also changes the appearance of these boats as shown
in Fig. 1. Due to the aircraft’s characteristics and its flight
pattern, the video contains significant apparent movement. B. Convolutional LSTM
The sequences include large amplitude and low-frequency The neural network that is used on the present work has
movements as well as small amplitude and high-frequency different types of layers but we will focus on the impact of
components. Additionally, the movement is caused by both one particular type for the detection task. This type of layer is
linear motion and rotations. Furthermore, the sequences were the Convolutional Long Short-Term Memory (ConvLSTM).
captured during sunny days, over the Atlantic Ocean. This We have followed an approach similar to Wingjian et al.
means that, as exhibited in Fig. 1, the detector has to deal [37], using a modified version of traditional LSTM, where the
with glare and waves’ white caps. input-to-state and state-to-state multiplications are replaced by
convolutions. Similarly to traditional LSTM, its convolutional
III. C ONVOLUTIONAL LSTM N ETWORK alternative contains mechanisms that control the amount of
A. Problem Description information that is received and outputted by the network.
These mechanisms are the input gate it and output gate ot
The present section formally describes the problem con- and its equations are
sidered in this work. We have used videos from SEAGULL
dataset (which are detailed in Subsection IV-A) to both train it = σ(WZI ∗ zt + WHI ∗ ŷt−1 + WCI ◦ ct−1 + bI ) (1)
Output ŷt
ct−1 ◦ + ct F IGURE 4).
σ σ Tanh σ
Previous C. Overall Architecture
output Output
ŷt−r ŷt The network’s architecture1 was chosen to easily assess the
impact of learning temporal features in detection. With this
goal in mind, we decided to use a popular neural network
Input zt
(VGG16 [38]) as a feature extractor. This choice intents
to make performance comparison easier by using the feature
Fig. 3. Diagram of the ConvLSTM layer. Matrices W and bias b are omitted extractor pre-trained on ImageNet thus accelerating training
to improve clarity.
times. We have used features computed at different levels of
the network, in a similar fashion to works like Ronneberger et
al. [39]. In order to use features from the different levels, we
need to adjust their resolution, which is done using an upsam-
ot = σ(WZO ∗ zt + WHO ∗ ŷt−1 + WCO ◦ ct−1 + bO ) . (2) pling layer that performs nearest neighbour interpolation. We
also reduce their number of channels before concatenation, us-
There are another two important tools in this layer: the cell ing 2D convolutional layers. The details of these convolutional
state and the forget gate. The former keeps track of the layers as well as the ConvLSTM are presented in Table I. In
information processed in previous time instants and the latter our case, the ConvLSTM is the last layer in the pipeline. Due
controls how the cell state is updated. The cell state is to this fact, the output of ConvLSTM that in many works is
computed as usually denoted as hidden state happens to be the output and
therefore is the predicted map Ŷt .
ct = ft ◦ ct−1 + it ◦ tanh(WZC ∗ zt + WHC ∗ ŷt−1 + bC ) (3) As already mentioned, this network’s purpose was to evalu-
ate the influence of temporal features, therefore it is intention-
and the forget gate as ally simple. While we use only one ConvLSTM layer, more
could be added. In particular, the feature extraction section,
ft = σ(WZF ∗ zt + WHF ∗ ŷt−1 + WCF ◦ ct−1 + bF ) . (4)
which now is carried out by pre-trained VGG16, might be
In these equations, σ symbolizes the sigmoid function and replaced by recurrent layers. In [37], the authors obtained
bI , bO and bF represent a bias term for the respective state. better results using deeper neural networks, composed only
The W terms represent the weight matrices, e.g. WZi is the of ConvLSTM layers but, in practice, we found those config-
weight matrix that connects the input feature map zt to the urations harder to train due to memory limitations and training
input gate. speed.
Finally, the output of this layer ŷt is computed as
D. Introducing Domain-Specific Knowledge
ŷt = ot ◦ tanh(ct−1 ) . (5) In maritime monitoring missions, we know the flight altitude
and the camera parameters, therefore we have estimates of the
In these expressions, ∗ represents the convolution operator area being observed. Additionally, we also have estimates of
and ◦ is the element-wise multiplication. Fig. 3 illustrates the the maximum size of boats and the number of boats in the
operations that are described in equations 1 to 5. image at a given time. In our method, we use this information
Both the input and output of this layer are 4D tensors, to guide the training process by incorporating this domain-
where two dimensions represent space, a third corresponds specific knowledge into the loss function. The loss function is
to time and the forth represents the number of channels. In composed of two parts; we name the domain-specific part as
our proposed network, we use the output of the ConvLSTM area loss and depends on the average area labeled as containing
directly as our estimate ŷt . As depicted on Fig. 4, the input zt a boat. The average area is written as
of the ConvLSTM layer at instant t, has the same resolution K M
and number of frames of the output ŷt , differing only in 1 XX m
Āboat (Ŷt ) = ŷt−(k×r) (6)
the number of channels. This input of the ConvLSTM layer K m=1
is composed of features computed by purely convolutional
layers at different depths, as presented on the next subsection, where ytm is the mth pixel of prediction map yt , M is the
resulting in a sequence of images with four channels. The number of pixels in each image and K is the number of images
output is a sequence of single-channel images, one in each 1 The implementation of the architecture described in this section is included
time instant. in the following repository: https://bitbucket.org/gccruz/convlstm detection.git
600 × 600 × 3
300 × 300 × 128 s
MaxPooling Āboat (Ŷt )
UpSampling Fig. 5. Loss function Larea .
×75 300
×512 ×300
×1 h=
×300 100 (m)
.. ×1
.. surface
zt 300 × 300 × 4 Fig. 6. Diagram of one unfavorable condition for the loss area, where a given
boat might be closer to the sensor mounted on the aircraft.
300 × 300 × 1
in the map yt . If each of those pixels is correctly labeled
ŷt−r ŷt+r
with value one and assuming the boat is visible throughout the
sequence, then Āboat (Ŷt ) = 4000 and there is no penalization.
become smaller. When comparing GRU with ConvLSTM, the Unlike the previous evaluation, to compare the method
latter performed better in both tests. One of the main causes presented in this work with other detectors, we did not use
is the relatively reduced number of parameters allowed by the binary cross-entropy but used an evaluation more common for
convolutional structure, which makes training easier. detectors. We have used the evaluation metric that was adopted
Our method, the network with a ConvLSTM layer, trained in [18], to keep the same scoring process. This metric itself
on sequences with a time span of 40 frames and time stride was adapted from Dollár et al. [41]. To validate a detection, the
of 5 frames, uses learned visual and temporal features, which mentioned authors required an intersection over union (IoU)
have allowed it to obtain the best performance in both tests. bigger than 50% and defined it as
It is important to note that the network that was trained
area{D̄t ∩ Gt }
using domain-specific knowledge achieved a slightly better IOU = , (11)
result in the case of the video with strong wake. Despite area{D̄t ∪ Gt }
being a small gain, it is valuable to verify that using a loss where Dt and Gt denoted the detections and the ground truth.
penalizing the prediction of large areas led better performance While this is adequate for many applications, just as in [18],
on discarding persistant phenomena like wakes. The network we believe that in the present scenario, 10% is enough. Given
configuration using ConvLSTM and trained using domain- this matching criterion, two quantities are computed: Precision
specific knowledge is used on the next subsection to com- and Recall. These are respectively defined as Precision =
pare with other already published detectors and demonstrate # true positive (TP)/ ( # TP + # false positive (FP) ) and
that learning temporal features produces benefits over those Recall = # TP/ ( # TP + # false negative (FN) )6 . With
approaches. Precision and Recall, we have obtained the results presented
in Fig. 7. These plots show the behavior of detectors through
D. Comparison with other detectors in SEAGULL dataset their operating range but it is useful to have a quantity to
summarize and compare the complete range. We selected
The benchmarks that we used were a standard detection Area Under the Curve (AUC), which is computed as the sum
neural network (YOLO 9000 [19]) and a neural network asso- of Precision p(n), at every possible threshold with the index
ciated with a Multiple Hypothesis Tracker - detectnet+MHT n, times Recall’s variation ∆r(n) between these points, i.e.,
[18]. Both alternatives were retrained on the same dataset as
our method. X
YOLO, the standard neural network is not the top scoring AU C = p(n)∆r(n). (12)
detector in the literature but presents one of the best com-
promises between speed and performance, therefore becomes The AUC obtained with each method in each video is pre-
an adequate candidate to process a stream in real time. This sented in the corresponding legend.
network is composed of convolutional layers that predict a When inspecting the results, it is worth noting that, indepen-
grid, where each cell may have multiple bounding boxes, each dently of the method, there is a substantial difference in perfor-
with its coordinates, confidence score and class probability. mance between NEAR, WIDE and the rest of the sequences.
The second method uses a detection neural network simpler In the mentioned sequences there are some challenges but the
than YOLO but explores time coherence among detections in appearance of the boats is more or less constant. The three
consecutive instants. The network creates a grid that indicates other sequences have much more challenging conditions, like
if an object of interest is present on each cell and computes a two boats close to each other that are detected as one, a very
regression for the location of a bounding box. The bounding small life raft and a wake that cloaks the boat
boxes created at each time instant are then used to create a From the first four plots, we can verify that the behavior
level of a graph, where each node corresponds to a detec- of the three methods is similar. Typically detectnet+MHT has
tion, weighted by the detection score. Nodes of consecutive a high precision but achieves a smaller recall. This is caused
levels are connected by edges, which are weighted based on by the MHT which discards many false positives but when
the Euclidean distance between bounding boxes. MHT then more demanding conditions occur (like abrupt movement), it
computes the probability of a given tracklet to correspond to also discards true positives. YOLO has a smoother decrease
a correct detection by searching combinations of nodes and in Precision as Recall increases. Our proposed method shows
corresponding edges with high scores. a comparable performance and the AUC is similar in the first
The output of both compared methods is bounding boxes, three videos. In the fourth video (NEAR), ConvLSTM falls
therefore we adapted our method to get the same type of short of the other two approaches. This inferior performance
output. Starting with the prediction map, Yt , that was al- of the proposed method, on the third video, is not caused by
ready mentioned, we obtain binary maps by thresholding Yt . lack of sensitivity of the detector but rather by inadequate
Afterwards, we compute the bounding box for each of the size of the bounding boxes. In several instants, the size’s
blobs presented in the binary image. This technique is not difference of the ConvLSTM detection box, shown in red in
as advanced as the regression layers used in the compared Fig. 8(d) and the object, leads to false negatives, i.e. the IoU
methods but allows us to evaluate the performance without
6 False positives and false negatives in this paragraph refers to incorrect
adding layers that needed to be trained with the rest of the
bounding boxes and to missing bounding boxes, respectively. This differs from
network and might conceal the behavior of the ConvLSTM the false positives and false negatives that were used to compute the Error
layer. Rate mentioned before.
0.6 tipically correspond to noise. Thus, when the boat enters the
field of view, our method does not imediatly marks it as a
detection, it requires some time instants to do so. The decrease
0.2 in Recall, caused by the boat leaving and entering the image,
also affects detectnet+MHT, which also needs several time
0 0.2 0.4 0.6 0.8 1
Recall instants with the object of interest in the field of view, to
(a) consider a valid detection.
On the last video sequence, the boat is moving at high
speed causing a wake several times bigger than the vessel.
0.8 This condition affects the performance dramatically with both
1) Testing on MARDCT dataset: The dataset that was
98% YOLO chosen for these additional tests was MARDCT [11]. This
0.2 88% detectnet + MHT dataset was gathered by the ARGOS system that monitors
76% convLSTM
the Venice Grand Canal. ARGOS’ cameras are installed in
0 0.2 0.4 0.6 0.8 1
buildings, consequently most objects of interest are very close
(d) to the camera and some urban elements, like walls, are present.
Due to this fact, we had to carefully select videos with some
similarity to our scenario. The elected videos were wake-1,
18% convLSTM
0.8 3% YOLO wake-2 and wake-3. The properties of the three videos differ
3% detectnet + mht from SEAGULL dataset: the resolution is smaller, the line of
sight from the camera to the objects of interest was almost
paralel to the water surface and the type of boats is also
0.2 different. The apparent movement of the image is also distinct,
with long periods with movement caused only by the boats and
0 0.2 0.4 0.6 0.8 1
short periods with pan movement, causing severe blur.
It is important to note that, in this case, there is no distance’s
information from the camera to the boats, hence there is
Fig. 7. Results of evaluation using a traditional detection metric [41], with
an overlap threshold of 10%. Results were obtained for sequence (a) SUSP, no guarantee that the Domain-Specific Knowledge included
(b) SAR, (c) WIDE, (d) NEAR and (e) WAKE. at training time is beneficial. Despite this shortcoming, we
applied the same three detectors that were already used in
the previous subsection without retraining. For brevity, we
condensed the results as AUC in Table VI.
Video sequences
wakes-1 wakes-2 wakes-3
detectnet+MHT 0 58 34
YOLO 8 97 4
ConvLSTM 11 77 62
TABLE VII other two. The second test evaluates the quality of generated
A SSESSMENT OF COMPUTATIONAL PERFORMANCE OF THE DIFFERENT bounding boxes against two detectors. The performance of the
three methods is comparable in four videos out of five. The
Average Number of fifth video, however, is very challenging and our proposed
Method execution Parameters
time (s) (Approx.)
method achieves a score several times higher.
Given the obtained results (especially with the second test),
detectnet+MHT 0.25 6 × 106
YOLO 0.03 50 × 106 we conclude that learning temporal features is useful for
ConvLSTM 0.19 14 × 106 maritime detection in videos captured by small aircraft. In
the future, we would like to explore more configurations,
in particular stack more recurrent layers and also extend
tract features from several images before feeding the informa- the time span considered by ConvLSTM. As shown in the
tion into the recurrent layer (since we have used, respectively, a SEAGULL dataset, some videos contain objects of interest
time span and a time stride of 40 and 5, we process 8 images). with faint visual features and the temporal features learned
When considering the number of parameters, YOLO is the by ConvLSTM can improve the knowledge about a given
most demanding option and detectnet+MTH is the lightest scenario. However, for conditions where the object of interest
alternative. Our method obtains an intermediate value, with is clearly visible and especially when its size is larger, other
three times less parameters than YOLO. Since nowadays there detectors can generate better detections.
are several embeded applications of YOLO, even in FPGAs With these considerations in mind, a real-world application
[43], there are also good prospects to sucessfully implement could benefit from using a combination of detectors that might
our method on the same type of platforms. be selected according to the mission or scenario. Another
path that we would like to investigate in the future is the
V. C ONCLUSIONS use of contextual information available on-board like alti-
tude, aircraft’s attittude and sensor’s parameters to improve
This paper presents a method to learn not only spatial
detection. One possibility, is the use of these parameters to
features but also temporal features present in video sequences.
create an additional input channel, where each pixel contains
The usage of temporal features attempts to improve the detec-
the slant range from the sensor to the observed area. The
tion of maritime objects in video sequences, which contain
range information would prevent detections with large areas
strong distractors like glare and wakes. This method is com-
in regions that are very far or detections with very small areas
posed of two main parts, one spatial feature extractor based
in regions that are very close.
on VGG network and one recurrent layer, the Convolutional
LSTM. The latter is the key component to learn temporal
features since it has a memory cell that keeps or forgets infor- ACKNOWLEDGMENTS
mation, according to the situation. Unlike traditional LSTMs, This work was partially supported by FCT project VOA-
some operations in this layer are applied convolutionally which MAIS (02/SAICT/2017/31172). The authors would like to
removes a significant spatial redundancy. thank all the VisLab and Portuguese Air Force Research
This method is evaluated with two kinds of tests. The Center team that allowed the collection and labeling of the
first test investigates what is the configuration (number of video sequences.
frames, length and time stride) that produces best binary
maps representing the position of boats and then compares the R EFERENCES
proposed approach with traditional LSTM and with the purely
[1] I. M. Association et al., “International shipping facts and figures–
convolutional network (ConvNet). The comparison shows that information resources on trade, safety, security, and the environment,”
there is a performance gain of the proposed approach over the London: International Maritime Association, 2011.
pp. 4511–4523, 2014.