Spatial As Deep: Spatial CNN For Traffic Scene Understanding

Uploaded by

nwuxv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views8 pages

Spatial As Deep: Spatial CNN For Traffic Scene Understanding

Uploaded by

nwuxv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Spatial As Deep: Spatial CNN for Traffic Scene Understanding

Xingang Pan1 , Jianping Shi2 , Ping Luo1 , Xiaogang Wang1 , and Xiaoou Tang1
1
The Chinese University of Hong Kong 2 SenseTime Group Limited
{px117, pluo, xtang}@ie.cuhk.edu.hk, shijianping@sensetime.com, xgwang@ee.cuhk.edu.hk
arXiv:1712.06080v1 [cs.CV] 17 Dec 2017

Abstract
Convolutional neural networks (CNNs) are usually built by
stacking convolutional operations layer-by-layer. Although
CNN has shown strong capability to extract semantics from
raw pixels, its capacity to capture spatial relationships of pix-
els across rows and columns of an image is not fully ex-
plored. These relationships are important to learn semantic
objects with strong shape priors but weak appearance coher-
ences, such as traffic lanes, which are often occluded or not
even painted on the road surface as shown in Fig. 1 (a). In
this paper, we propose Spatial CNN (SCNN), which general-
Figure 1: Comparison between CNN and SCNN in (a) lane
izes traditional deep layer-by-layer convolutions to slice-by-
slice convolutions within feature maps, thus enabling mes- detection and (b) semantic segmentation. For each example,
sage passings between pixels across rows and columns in a from left to right are: input image, output of CNN, output
layer. Such SCNN is particular suitable for long continuous of SCNN. It can be seen that SCNN could better capture the
shape structure or large objects, with strong spatial relation- long continuous shape prior of lane markings and poles and
ship but less appearance clues, such as traffic lanes, poles, and fix the disconnected parts in CNN.
wall. We apply SCNN on a newly released very challenging
traffic lane detection dataset and Cityscapse dataset1 . The re-
sults show that SCNN could learn the spatial relationship for appearance clues like lane markings and poles, which have
structure output and significantly improves the performance.
We show that SCNN outperforms the recurrent neural net-
long continuous shape and might be occluded. For instance,
work (RNN) based ReNet and MRF+CNN (MRFNet) in the in the first example in Fig. 1 (a), the car at the right side fully
lane detection dataset by 8.7% and 4.6% respectively. More- occludes the rightmost lane marking.
over, our SCNN won the 1st place on the TuSimple Bench- Although CNN based methods (Krizhevsky, Sutskever,
mark Lane Detection Challenge, with an accuracy of 96.53%. and Hinton 2012; Long, Shelhamer, and Darrell 2015) have
pushed scene understanding to a new level thanks to the
strong representation learning ability. It is still not perform-
Introduction ing well for objects having long structure region and could
In recent years, autonomous driving has received much at- be occluded, such as the lane markings and poles shown in
tention in both academy and industry. One of the most chal- the red bounding boxes in Fig. 1. However, humans can eas-
lenging task of autonomous driving is traffic scene under- ily infer their positions and fill in the occluded part from the
standing, which comprises computer vision tasks like lane context, i.e., the viewable part.
detection and semantic segmentation. Lane detection helps To address this issue, we propose Spatial CNN (SCNN),
to guide vehicles and could be used in driving assistance a generalization of deep convolutional neural networks to
system (Urmson et al. 2008), while semantic segmentation a rich spatial level. In a layer-by-layer CNN, a convolution
provides more detailed positions about surrounding objects layer receives input from the former layer, applies convolu-
like vehicles or pedestrians. In real applications, however, tion operation and nonlinear activation, and sends result to
these tasks could be very challenging considering the many the next layer. This process is done sequentially. Similarly,
harsh scenarios, including bad weather conditions, dim or SCNN views rows or columns of feature maps as layers and
dazzle light, etc. Another challenge of traffic scene under- applies convolution, nonlinear activation, and sum opera-
standing is that in many cases, especially in lane detection, tions sequentially, which forms a deep neural network. In
we need to tackle objects with strong structure prior but less this way information could be propagated between neurons
Copyright c 2018, Association for the Advancement of Artificial in the same layer. It is particularly useful for structured ob-
Intelligence (www.aaai.org). All rights reserved. ject such as lanes, poles, or truck with occlusions, since the
1
Code is available at https://github.com/XingangPan/SCNN spatial information can be reinforced via inter layer propa-
Figure 2: (a) Dataset examples for different scenarios. (b) Proportion of each scenario.

gation. As shown in Fig. 1, in cases where CNN is discon- tations for lane/lane markings, but have merely hundreds
tinuous or is messy, SCNN could well preserve the smooth- of images, too small for deep learning methods. Caltech
ness and continuity of lane markings and poles. In our ex- Lanes Dataset (Aly 2008) and the recently released TuSim-
periment, SCNN significantly outperforms other RNN or ple Benchmark Dataset (TuSimple 2017) consists of 1224
MRF/CRF based methods, and also gives better results than and 6408 images with annotated lane markings respectively,
the much deeper ResNet-101 (He et al. 2016). while the traffic is in a constrained scenario, which has
Related Work. For lane detection, most existing algo- light traffic and clear lane markings. Besides, none of these
rithms are based on hand-crafted low-level features (Aly datasets annotates the lane markings that are occluded or are
2008; Son et al. 2015; Jung, Youn, and Sull 2016), limiting unseen because of abrasion, while such lane markings can be
there capability to deal with harsh conditions. Only Huval inferred by human and is of high value in real applications.
et al. (2015) gave a primacy attempt adopting deep learn- To collect data, we mounted cameras on six different ve-
ing in lane detection but without a large and general dataset. hicles driven by different drivers and recorded videos during
While for semantic segmentation, CNN based methods have driving in Beijing on different days. More than 55 hours of
become mainstream and achieved great success (Long, Shel- videos were collected and 133,235 frames were extracted,
hamer, and Darrell 2015; Chen et al. 2017). which is more than 20 times of TuSimple Dataset. We have
There have been some other attempts to utilize spatial in- divided the dataset into 88880 for training set, 9675 for vali-
formation in neural networks. Visin et al. (2015) and Bell dation set, and 34680 for test set. These images were undis-
et al. (2016) used recurrent neural networks to pass infor- torted using tools in (Scaramuzza, Martinelli, and Siegwart
mation along each row or column, thus in one RNN layer 2006) and have a resolution of 1640 × 590. Fig. 2 (a) shows
each pixel position could only receive information from the some examples, which comprises urban, rural, and highway
same row or column. Liang et al. (2016a; 2016b) proposed scenes. As one of the largest and most crowded cities in the
variants of LSTM to exploit contextual information in se- world, Beijing provides many challenging traffic scenarios
mantic object parsing, but such models are computation- for lane detection. We divided the test set into normal and
ally expensive. Researchers also attempted to combine CNN 8 challenging categories, which correspond to the 9 exam-
with graphical models like MRF or CRF, in which message ples in Fig. 2 (a). Fig. 2 (b) shows the proportion of each
pass is realized by convolution with large kernels (Liu et scenario. It can be seen that the 8 challenging scenarios ac-
al. 2015; Tompson et al. 2014; Chu et al. 2016). There are count for most (72.3%) of the dataset.
three advantages of SCNN over these aforementioned meth- For each frame, we manually annotate the traffic lanes
ods: in SCNN, (1) the sequential message pass scheme is with cubic splines. As mentioned earlier, in many cases lane
much more computational efficiency than traditional dense markings are occluded by vehicles or are unseen. In real ap-
MRF/CRF, (2) the messages are propagated as residual, plications it is important that lane detection algorithms could
making SCNN easy to train, and (3) SCNN is flexible and estimate lane positions from the context even in these chal-
could be applied to any level of a deep neural network. lenging scenarios that occur frequently. Therefore, for these
cases we still annotate the lanes according to the context, as
Spatial Convolutional Neural Network shown in Fig. 2 (a) (2)(4). We also hope that our algorithm
could distinguish barriers on the road, like the one in Fig. 2
Lane Detection Dataset (a) (1). Thus the lanes on the other side of the barrier are not
In this paper, we present a large scale challenging dataset annotated. In this paper we focus our attention on the detec-
for traffic lane detection. Despite the importance and dif- tion of four lane markings, which are paid most attention to
ficulty of traffic lane detection, existing datasets are either in real applications. Other lane markings are not annotated.
too small or too simple, and a large public annotated bench-
mark is needed to compare different methods (Bar Hillel Spatial CNN
et al. 2014). KITTI (Fritsch, Kuhnl, and Geiger 2013) and Traditional methods to model spatial relationship are based
CamVid (Brostow et al. 2008) contains pixel level anno- on Markov Random Fields (MRF) or Conditional Ran-
Iter N times
C n class Add
W
H

CNN Softmax Softmax

Input Top hidden layer Unary Message Compatibility Final

Output
Potentials Passing Transform Prediction
(a)
C w n class
W
H

CNN

Final
Input Top hidden layer SCNN_D SCNN_U SCNN_R SCNN_L Output
Prediction
(b)

Figure 3: (a) MRF/CRF based method. (b) Our implementation of Spatial CNN. MRF/CRF are theoretically applied to unary
potentials whose channel number equals to the number of classes to be classified, while SCNN could be applied to the top
hidden layers with richer information.

dom Fields (CRF) (Krähenbühl and Koltun 2011). Recent with C kernels of size C ×w, where w is the kernel width. In
works (Zheng et al. 2015; Liu et al. 2015; Chen et al. 2017) a traditional CNN the output of a convolution layer is then
to combine them with CNN all follow the pipeline of Fig. 3 fed into the next layer, while here the output is added to the
(a), where the mean field algorithm can be implemented with next slice to provide a new slice. The new slice is then sent to
neural networks. Specifically, the procedure is (1) Normal- the next convolution layer and this process would continue
ize: the output of CNN is viewed as unary potentials and is until the last slice is updated.
normalized by the Softmax operation, (2) Message Passing, Specifically, assume we have a 3-D kernel tensor K with
which could be realized by channel wise convolution with element Ki,j,k denoting the weight between an element in
large kernels (for dense CRF, the kernel size would cover channel i of the last slice and an element in channel j of
the whole image and the kernel weights are dependent on the the current slice, with an offset of k columes between two
input image), (3) Compatibility Transform, which could be elements. Also denote the element of input 3-D tensor X as
implemented with a 1 × 1 convolution layer, and (4) Adding Xi,j,k , where i, j, and k indicate indexes of channel, row,
unary potentials. This process is iterated for N times to give and column respectively. Then the forward computation of
the final output. SCNN is:
It can be seen that in the message passing process of 
traditional methods, each pixel receives information from Xi,j,k ,
 PP 0
j=1
all other pixels, which is very computational expensive and Xi,j,k = Xi,j,k + f
0 Xm, j − 1, k + n − 1 (1)
m n
hard to be used in real time tasks as in autonomous driving.

×Km,i,n , j = 2, 3, ..., H

For MRF, the large convolution kernel is hard to learn and
usually requires careful initialization (Tompson et al. 2014; where f is a nonlinear activation function as ReLU. The X
Liu et al. 2015). Moreover, these methods are applied to the with superscript 0 denotes the element that has been updated.
output of CNN, while the top hidden layer, which comprises Note that the convolution kernel weights are shared across
richer information, might be a better place to model spatial all slices, thus SCNN is a kind of recurrent neural network.
relationship. Also note that SCNN has directions. In Fig. 3 (b), the four
To address these issues, and to more efficiently learn the ’SCNN’ module with suffix ’D’, ’U’, ’R’, ’L’ denotes SCNN
spatial relationship and the smooth, continuous prior of lane that is downward, upward, rightward, and leftward respec-
markings, or other structured object in the driving scenario, tively.
we propose Spatial CNN. Note that the ’spatial’ here is
not the same with that in ’spatial convolution’, but denotes Analysis
propagating spatial information via specially designed CNN There are three main advantages of Spatial CNN over tradi-
structure. tional methods, which are concluded as follows.
As shown in the ’SCNN D’ module of Fig. 3 (b), consid- (1) Computational efficiency. As show in Fig. 4, in dense
ering a SCNN applied on a 3-D tensor of size C × H × W , MRF/CRF each pixel receives messages from all other pix-
where C, H, and W denote the number of channel, rows, and els directly, which could have much redundancy, while in
columns respectively. The tensor would be splited into H SCNN message passing is realized in a sequential propaga-
slices, and the first slice is then sent into a convolution layer tion scheme. Specifically, assume a tensor with H rows and
conv1_1 Conv w3 c64

HConv

LargeFOV
fc6
h4 w3 c1024
Conv
(a) (b) fc7 w1 c128
Conv CNN
fc8 w1 c5
Figure 4: Message passing directions in (a) dense MRF/CRF
Softmax
and (b) Spatial CNN (rightward). For (a), only message 0.02 0.99 0.99 0.99
passing to the inner 4 pixels are shown for clearance.
Interp ×8 AvgPool 2×2

FC c128
FC c4
W columns, then in dense MRF/CRF, there is message pass Sigmoid
between every two of the W H pixels. For niter iterations, Spatial cross entropy loss 0111
the number of message passing is niter W 2 H 2 . In SCNN, Cross entropy loss
(a) (b)
each pixel only receive information from w pixels, thus the
number of message passing is ndir W Hw, where ndir and w
denotes the number of propagation directions in SCNN and Figure 5: (a) Training model, (b) Lane prediction process.
the kernel width of SCNN respectively. niter could range ’Conv’,’HConv’, and ’FC’ denotes convolution layer, atrous
from 10 to 100, while in this paper ndir is set to 4, cor- convolution layer (Chen et al. 2017), and fully connected
responding to 4 directions, and w is usually no larger than layer respectively. ’c’, ’w’, and ’h’ denotes number of output
10 (in the example in Fig. 4 (b) w = 3). It can be seen channels, kernel width, and ’rate’ for atrous convolution.
that for images with hundreds of rows and columns, SCNN
could save much computations, while each pixel still could
tial weights of the first 13 convolution layers are copied from
receive messages from all other pixels with message propa-
VGG16 (Simonyan and Zisserman 2015) trained on Ima-
gation along 4 directions.
geNet (Deng et al. 2009). All experiments are implemented
(2) Message as residual. In MRF/CRF, message passing
on the Torch7 (Collobert, Kavukcuoglu, and Farabet 2011)
is achieved via weighted sum of all pixels, which, according
framework.
to the former paragraph, is computational expensive. And
recurrent neural network based methods might suffer from Lane Detection
gradient descent (Pascanu, Mikolov, and Bengio 2013), con-
sidering so many rows or columns. However, deep residual Lane detection model Unlike common object detection
learning (He et al. 2016) has shown its capability to easy task that only requires bounding boxes, lane detection re-
the training of very deep neural networks. Similarly, in our quires precise prediction of curves. A natural idea is that the
deep SCNN messages are propagated as residual, which is model should output probability maps (probmaps) of these
the output of ReLU in Eq.(1). Such residual could also be curves, thus we generate pixel level targets to train the net-
viewed as a kind of modification to the original neuron. works, like in semantic segmentation tasks. Instead of view-
As our experiments will show, such message pass scheme ing different lane markings as one class and do clustering
achieves better results than LSTM based methods. afterwards, we want the neural network to distinguish dif-
(3) Flexibility Thanks to the computational efficiency of ferent lane markings on itself, which could be more robust.
SCNN, it could be easily incorporated into any part of a Thus these four lanes are viewed as different classes. More-
CNN, rather than just output. Usually, the top hidden layer over, the probmaps are then sent to a small network to give
contains information that is both rich and of high semantics, prediction on the existence of lane markings.
thus is an ideal place to apply SCNN. Typically, Fig. 3 shows During testing, we still need to go from probmaps to
our implementation of SCNN on the LargeFOV (Chen et al. curves. As shown in Fig.5 (b), for each lane marking whose
2017) model. SCNNs on four spatial directions are added existence value is larger than 0.5, we search the correspond-
sequentially right after the top hidden layer (’fc7’ layer) to ing probmap every 20 rows for the position with the high-
introduce spatial message propagation. est response. These positions are then connected by cubic
splines, which are the final predictions.
As shown in Fig.5 (a), the detailed differences between
Experiment our baseline model and LargeFOV are: (1) the output chan-
We evaluate SCNN on our lane detection dataset and nel number of the ’fc7’ layer is set to 128, (2) the ’rate’ for
Cityscapes (Cordts et al. 2016). In both tasks, we train the the atrous convolution layer of ’fc6’ is set to 4, (3) batch nor-
models using standard SGD with batch size 12, base learn- malization (Ioffe and Szegedy 2015) is added before each
ing rate 0.01, momentum 0.9, and weight decay 0.0001. The ReLU layer, (4) a small network is added to predict the ex-
learning rate policy is ”poly” with power and iteration num- istence of lane markings. During training, the line width of
ber set to 0.9 and 60K respectively. Our models are modified the targets is set to 16 pixels, and the input and target images
based on the LargeFOV model in (Chen et al. 2017). The ini- are rescaled to 800 × 288. Considering the imbalanced label
and w = 9 gives a satisfactory result, which surpasses the
baseline by a significant margin 8.4% and 3.2% correspond-
ing to different IoU threshold.

Figure 6: Evaluation based on IoU. Green lines denote Table 2: Experimental results on SCNN with different kernel
ground truth, while blue and red lines denote TP and FP re- widths.
spectively. Kernel width w 1 3 5 7 9 11
F1 (0.3) 78.5 79.5 80.2 80.5 80.9 80.6
F1 (0.5) 66.3 68.9 70.4 71.2 71.6 71.7
between background and lane markings, the loss of back-
ground is multiplied by 0.4.
(3) Spatial CNN on different positions. As mentioned ear-
Evaluation In order to judge whether a lane marking is lier, SCNN could be added to any place of a neural network.
successfully detected, we view lane markings as lines with Here we consider the SCNN DURL model applied on (1)
widths equal to 30 pixel and calculate the intersection- output and (2) the top hidden layer, which correspond to
over-union (IoU) between the ground truth and the predic- Fig. 3. The results in Table. 3 indicate that the top hidden
tion. Predictions whose IoUs are larger than certain thresh- layer, which comprises richer information than the output,
old are viewed as true positives (TP), as shown in Fig. 6. turns out to be a better position to apply SCNN.
Here we consider 0.3 and 0.5 thresholds corresponding to
loose and strict evaluations. Then we employ F-measure =
(1 + β 2 ) β 2Precision Recall
Precision+Recall as the final evaluation index, where Table 3: Experimental results on spatial CNN at different
Precision = T P +F P and Recall = T PT+F
TP P positions, with w = 9.
N . Here β is set
to 1, corresponding to harmonic mean (F1-measure). Position Output Top hidden layer

Ablation Study In section 2.2 we propose Spatial CNN to F1 (0.3) 79.9 80.9
F1 (0.5) 68.8 71.6
enable spatial message propagation. To verify our method,
we will make detailed ablation studies in this subsection.
Our implementation of SCNN follows that shown in Fig. 3. (4) Effectiveness of sequential propagation. In our SCNN,
(1) Effectiveness of multidirectional SCNN. Firstly, we in- information is propagated in a sequential way, i.e., a slice
vestigate the effects of directions in SCNN. We try SCNN does not pass information to the next slice until it has re-
that has different direction implementations, the results are ceived information from former slices. To verify the effec-
shown in Table. 1. Here the kernel width w of SCNN is set to tiveness of this scheme, we compare it with parallel prop-
5. It can be seen that the performance increases as more di- agation, i.e., each slice passes information to the next slice
rections are added. To prove that the improvement does not simultaneously before being updated. For this parallel case,
result from more parameters but from the message passing the 0 in the right part of Eq.(1) is removed. As Table. 4
scheme brought about by SCNN, we add an extra convolu- shows, the sequential message passing scheme outperforms
tion layer with 5×5 kernel width after the top hidden layer of the parallel scheme significantly. This result indicates that
the baseline model and compare with our method. From the in SCNN, a pixel does not merely affected by nearby pixels,
results we can see that extra convolution layer could merely but do receive information from further positions.
bring about little improvement, which verifies the effective-
ness of SCNN.
Table 4: Comparison between sequential and parallel mes-
sage passing scheme, for SCNN DULR with w = 9.
Table 1: Experimental results on SCNN with different di- Message passing scheme Parallel Sequential
rectional settings. F1 denotes F1-measure, and the value in
the bracket denotes the IoU threshold. The suffix ’D’, ’U’, F1 (0.3) 78.4 80.9
’R’, ’L’ denote downward, upward, rightward, and leftward F1 (0.5) 65.2 71.6
respectively.
Models Baseline ExtraConv SCNN D SCNN DU SCNN DURL (5) Comparison with state-of-the-art methods. To fur-
ther verify the effectiveness of SCNN in lane detec-
F1 (0.3) 77.7 77.6 79.5 79.9 80.2
F1 (0.5) 63.2 64.0 68.6 69.4 70.4 tion, we compare it with several methods: the rnn based
ReNet (Visin et al. 2015), the MRF based MRFNet, the
DenseCRF (Krähenbühl and Koltun 2011), and the very
(2) Effects of kernel width w. We further try SCNN with deep residual network (He et al. 2016). For ReNet based
different kernel width based on the ”SCNN DURL” model, on LSTM, we replace the ”SCNN” layers in Fig. 3 with
as shown in Table. 2. Here the kernel width denotes the num- two ReNet layers: one layer to pass horizontal information
ber of pixels that a pixel could receive messages from, and and the other to pass vertical information. For DenseCRF,
the w = 1 case is similar to the methods in (Visin et al. 2015; we use dense CRF as post-processing and employ 10 mean
Bell et al. 2016). The results show that larger w is beneficial, field iterations as in (Chen et al. 2017). For MRFNet, we use
Table 5: Comparison with other methods, with IoU threshold=0.5. For crossroad, only FP are shown.
Category Baseline ReNet DenseCRF MRFNet ResNet-50 ResNet-101 Baseline+SCNN
Normal 83.1 83.3 81.3 86.3 87.4 90.2 90.6
Crowded 61.0 60.5 58.8 65.2 64.1 68.2 69.7
Night 56.9 56.3 54.2 61.3 60.6 65.9 66.1
No line 34.0 34.5 31.9 37.2 38.1 41.7 43.4
Shadow 54.7 55.0 56.3 59.3 60.7 64.6 66.9
Arrow 74.0 74.1 71.2 76.9 79.0 84.0 84.1
Dazzle light 49.9 48.2 46.2 53.7 54.1 59.8 58.5
Curve 61.0 59.9 57.8 62.3 59.8 65.5 64.4
Crossroad 2060 2296 2253 1837 2505 2183 1990
Total 63.2 62.9 61.0 67.0 66.7 70.8 71.6

Figure 7: Comparison between probmaps of baseline, ReNet, MRFNet, ResNet-101, and SCNN.

the implementation in Fig. 3 (a), with iteration times and this, we compare their runtime experimentally. The results
message passing kernel size set to 10 and 20 respectively. are shown in Table. 6, where the runtime of the LSTM in
The main difference of the MRF here with CRF is that the ReNet is also given. Here the runtime does not include run-
weights of message passing kernels are learned during train- time of the backbone network. For SCNN, we test both the
ing rather than depending on the image. For ResNet, our practical case and the case with the same setting as dense
implementation is the same with (Chen et al. 2017) except CRF. In the practical case, SCNN is applied on top hidden
that we do not use the ASPP module. For SCNN, we add layer, thus the input has more channels but less hight and
SCNN DULR module to the baseline, and the kernel width width. In the fair comparison case, the input size is modified
w is 9. The test results on different scenarios are shown in to be the same with that in dense CRF, and both methods are
Table 5, and visualizations are given in Fig. 7. tested on CPU. The results show that even in fair comparison
From the results, we can see that the performance of case, SCNN is over 4 times faster than dense CRF, despite
ReNet is not even comparable with SCNN DULR with w = the efficient implementation of dense CRF in (Krähenbühl
1, indicating the effectiveness of our residual message pass- and Koltun 2011). This is because SCNN significantly re-
ing scheme. Interestingly, DenseCRF leads to worse result duces redundancy in message passing, as in Fig. 4. Also,
here, because lane markings usually have less appearance SCNN is more efficient than LSTM, whose gate mechanism
clues so that dense CRF cannot distinguish lane markings requires more computation.
and background. In contrast, with kernel weights learned
from data, MRFNet could to some extent smooth the results
and improve performance, as Fig. 7 shows, but are still not Table 6: Runtime of dense CRF, LSTM, and SCNN. The two
very satisfactory. Furthermore, our method even outperform SCNNs correspond to the one used in practice and the one
the much deeper ResNet-50 and ResNet-101. Despite the whose input size is modified for fair comparison with dense
over a hundred layers and the very large receptive field of CRF respectively. The kernel width w of SCNN is 9.
ResNet-101, it still gives messy or discontinuous outputs in SCNN DULR SCNN DULR
Method dense CRF LSTM
(in practice) (fair comparison)
challenging cases, while our method, with only 16 convolu-
tion layers plus 4 SCNN layers, could preserve the smooth- Input size 5×288×800 128×36×100 128×36×100 5×288×800
(C × H × W )
ness and continuity of lane lines better. This demonstrates Device CPU2 GPU3 GPU CPU
the much stronger capability of SCNN to capture structure Runtime (ms) 737 115 42 176
prior of objects over traditional CNN.
(6) Computational efficiency over other methods. In the
Analysis section we give theoretical analysis on the com- 2
Intel Core i7-4790K CPU
3
putational efficiency of SCNN over dense CRF. To verify GeForce GTX TITAN Black
Table 7: Results on Cityscapes validation set.
traffic traffic
Method road terrain building wall car pole fence sidewalk sky rider person vegetation truck bus train motor bicycle mIoU
light sign
LargeFOV 97.0 59.2 89.9 42.2 92.3 52.9 62.3 71.1 52.2 78.8 92.2 52.1 75.9 91.0 48.8 70.2 37.6 54.6 72.3 68.0
LargeFOV+SCNN 97.0 59.8 90.3 45.7 92.5 55.2 62.3 71.7 52.5 78.1 92.6 53.2 76.4 91.1 55.6 71.2 41.7 56.2 72.3 69.2
ResNet-101 98.3 64.2 92.4 44.5 94.9 66.0 74.5 82.1 59.9 86.0 94.7 65.5 84.1 92.7 57.3 81.1 54.0 64.5 80.0 75.6
ResNet-101+SCNN 98.3 65.4 92.6 46.7 94.8 66.1 74.3 81.5 61.2 86.1 94.7 65.5 84.0 92.7 57.7 82.0 59.9 67.0 80.1 76.4

Figure 8: Visual improvements on Cityscapes validation set. For each example, from left to right are: input image, ground truth,
result of LargeFOV, result of LargeFOV+SCNN.

Semantic Segmentation on Cityscapes

Table 8: Comparison between our SCNN and other
To demonstrate the generality of our method, we also MRF/CRF based methods on Cityscapes test set.
evaluate Spatial CNN on Cityscapes (Cordts et al. 2016). LargeFOV DPN
Cityscapes is a standard benchmark dataset for semantic Method Ours
(Chen et al. 2017) (Liu et al. 2015)
segmentation on urban traffic scenes. It contains 5000 fine
mIoU 63.1 66.8 68.2
annotated images, including 2975 for training, 500 for val-
idation and 1525 for testing. 19 categories are defined in-
cluding both stuff and objects. We use two classic mod-
methods, we evaluate LargeFOV+SCNN on Cityscapes test
els, the LargeFOV and ResNet-101 in DeepLab (Chen et al.
set, and compare with methods that also use VGG16 (Si-
2017) as the baselines. Batch normalization layers (Ioffe and
monyan and Zisserman 2015) as the backbone network. The
Szegedy 2015) are added to LargeFOV to enable faster con-
results are shown in Table 8. Here LargeFOV, DPN, and our
vergence. For both models, the channel numbers of the top
method use dense CRF, dense MRF, and SCNN respectively,
hidden layers are modified to 128 to make them compacter.
and share nearly the same base CNN part. The results show
We add SCNN to the baseline models in the same way as
that our method achieves significant better performance.
in lane detection. The comparisons between baselines and
those combined with the SCNN DURL models with kernel
width w = 9 are shown in Table 7. It can be seen that SCNN
Conclusion
could also improve semantic segmentation results. With SC- In this paper, we propose Spatial CNN, a CNN-like scheme
NNs added, the IoUs for all classes are at least comparable to achieve effective information propagation in the spatial
to the baselines, while the ”wall”, ”pole”, ”truck”, ”bus”, level. SCNN could be easily incorporated into deep neu-
”train”, and ”motor” categories achieve significant improve. ral networks and trained end-to-end. It is evaluated at two
This is because for long shaped objects like train and pole, tasks in traffic scene understanding: lane detection and se-
SCNN could capture its continuous structure and connect mantic segmentation. The results show that SCNN could ef-
the disconnected part, as shown in Fig. 8. And for wall, fectively preserve the continuity of long thin structure, while
truck, and bus which could occupy large image area, the dif- in semantic segmentation its diffusion effects is also proved
fusion effect of SCNN could correct the part that are mis- to be beneficial for large objects. Specifically, by introduc-
classified according to the context. This shows that SCNN ing SCNN into the LargeFOV model, our 20-layer network
is useful not only for long thin structure, but also for large outperforms ReNet, MRF, and the very deep ResNet-101 in
objects which require global information to be classified lane detection. Last but not least, we believe that the large
correctly. There is another interesting phenomenon that the challenging lane detection dataset we presented would push
head of the vehicle at the bottom of the images, whose label forward researches on autonomous driving.
is ignored during training, is in a mess in LargeFOV while
with SCNN added it is classified as road. This is also due to Acknowledgments
the diffusion effects of SCNN, which passes the information This work is supported by SenseTime Group Limited. We
of road to the vehicle head area. would like to thank Xiaohang Zhan, Jun Li, and Xudong
To compare our method with other MRF/CRF based Cao for helpful work in building the lane detection dataset.
References Scaramuzza, D.; Martinelli, A.; and Siegwart, R. 2006. A flexi-
Aly, M. 2008. Real time detection of lane markers in urban streets. ble technique for accurate omnidirectional camera calibration and
In Intelligent Vehicles Symposium, 2008 IEEE, 7–12. IEEE. structure from motion. In Computer Vision Systems, 2006 ICVS’06.
IEEE International Conference on, 45–45. IEEE.
Bar Hillel, A.; Lerner, R.; Levi, D.; and Raz, G. 2014. Recent
progress in road and lane detection: a survey. Machine vision and Simonyan, K., and Zisserman, A. 2015. Very deep convolutional
applications 1–19. networks for large-scale image recognition. In ICLR.
Bell, S.; Lawrence Zitnick, C.; Bala, K.; and Girshick, R. 2016. Son, J.; Yoo, H.; Kim, S.; and Sohn, K. 2015. Real-time illumi-
Inside-outside net: Detecting objects in context with skip pooling nation invariant lane detection for lane departure warning system.
and recurrent neural networks. In CVPR. Expert Systems with Applications 42(4):1816–1824.
Brostow, G. J.; Shotton, J.; Fauqueur, J.; and Cipolla, R. 2008. Tompson, J. J.; Jain, A.; LeCun, Y.; and Bregler, C. 2014. Joint
Segmentation and recognition using structure from motion point training of a convolutional network and a graphical model for hu-
clouds. In ECCV. man pose estimation. In NIPS.
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, TuSimple. 2017. Tusimple benchmark. http://benchmark.
A. L. 2017. Deeplab: Semantic image segmentation with deep tusimple.ai/#/.
convolutional nets, atrous convolution, and fully connected crfs. Urmson, C.; Anhalt, J.; Bagnell, D.; Baker, C.; Bittner, R.; Clark,
TPAMI. M.; Dolan, J.; Duggins, D.; Galatali, T.; Geyer, C.; et al. 2008.
Chu, X.; Ouyang, W.; Wang, X.; et al. 2016. Crf-cnn: Modeling Autonomous driving in urban environments: Boss and the urban
structured information in human pose estimation. In NIPS. challenge. Journal of Field Robotics 25(8):425–466.
Collobert, R.; Kavukcuoglu, K.; and Farabet, C. 2011. Torch7: A Visin, F.; Kastner, K.; Cho, K.; Matteucci, M.; Courville, A.; and
matlab-like environment for machine learning. In BigLearn, NIPS Bengio, Y. 2015. Renet: A recurrent neural network based alterna-
Workshop, number EPFL-CONF-192376. tive to convolutional networks. arXiv preprint arXiv:1505.00393.
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.;
Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The Du, D.; Huang, C.; and Torr, P. H. 2015. Conditional random fields
cityscapes dataset for semantic urban scene understanding. In as recurrent neural networks. In ICCV.
CVPR.
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L.
2009. Imagenet: A large-scale hierarchical image database. In
CVPR.
Fritsch, J.; Kuhnl, T.; and Geiger, A. 2013. A new performance
measure and evaluation benchmark for road detection algorithms.
In Intelligent Transportation Systems-(ITSC), 2013 16th Interna-
tional IEEE Conference on, 1693–1700. IEEE.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-
ing for image recognition. In CVPR.
Huval, B.; Wang, T.; Tandon, S.; Kiske, J.; Song, W.; Pazhayam-
pallil, J.; Andriluka, M.; Rajpurkar, P.; Migimatsu, T.; Cheng-Yue,
R.; et al. 2015. An empirical evaluation of deep learning on high-
way driving. arXiv preprint arXiv:1504.01716.
Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerat-
ing deep network training by reducing internal covariate shift. In
ICML.
Jung, S.; Youn, J.; and Sull, S. 2016. Efficient lane detection based
on spatiotemporal images. IEEE Transactions on Intelligent Trans-
portation Systems 17(1):289–295.
Krähenbühl, P., and Koltun, V. 2011. Efficient inference in fully
connected crfs with gaussian edge potentials. In NIPS.
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet
classification with deep convolutional neural networks. In NIPS.
Liang, X.; Shen, X.; Feng, J.; Lin, L.; and Yan, S. 2016a. Semantic
object parsing with graph lstm. In ECCV.
Liang, X.; Shen, X.; Xiang, D.; Feng, J.; Lin, L.; and Yan, S. 2016b.
Semantic object parsing with local-global long short-term memory.
In CVPR.
Liu, Z.; Li, X.; Luo, P.; Loy, C.-C.; and Tang, X. 2015. Semantic
image segmentation via deep parsing network. In ICCV.
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional
networks for semantic segmentation. In CVPR.
Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the difficulty
of training recurrent neural networks. In ICML.