Object Detection: Current and Future Directions: Rodrigo Verschae and Javier Ruiz-del-Solar

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

PERSPECTIVE

published: 19 November 2015


doi: 10.3389/frobt.2015.00029

Object Detection: Current and


Future Directions
Rodrigo Verschae 1 *† and Javier Ruiz-del-Solar 1,2
1
Advanced Mining Technology Center, Universidad de Chile, Santiago, Chile, 2 Department of Electrical Engineering,
Universidad de Chile, Santiago, Chile

Object detection is a key ability required by most computer and robot vision systems.
The latest research on this area has been making great progress in many directions. In
the current manuscript, we give an overview of past research on object detection, outline
the current main research directions, and discuss open problems and possible future
directions.
Keywords: object detection, perspective, mini review, current directions, open problems
Edited by:
Venkatesh Babu Radhakrishnan,
Indian Institute of Science Bangalore, 1. INTRODUCTION
India
During the last years, there has been a rapid and successful expansion on computer vision research.
Reviewed by:
Juxi Leitner, Parts of this success have come from adopting and adapting machine learning methods, while others
Queensland University of Technology, from the development of new representations and models for specific computer vision problems
Australia or from the development of efficient solutions. One area that has attained great progress is object
George Azzopardi, detection. The present works gives a perspective on object detection research.
University of Groningen, Netherlands Given a set of object classes, object detection consists in determining the location and scale of all
Soma Biswas,
object instances, if any, that are present in an image. Thus, the objective of an object detector is to find
Indian Institute of Science Bangalore,
India
all object instances of one or more given object classes regardless of scale, location, pose, view with
respect to the camera, partial occlusions, and illumination conditions.
*Correspondence:
In many computer vision systems, object detection is the first task being performed as it allows
Rodrigo Verschae
rodrigo@verschae.org
to obtain further information regarding the detected object and about the scene. Once an object
† instance has been detected (e.g., a face), it is be possible to obtain further information, including: (i)
Present address:
Rodrigo Verschae,
to recognize the specific instance (e.g., to identify the subject’s face), (ii) to track the object over an
Graduate School of Informatics, image sequence (e.g., to track the face in a video), and (iii) to extract further information about the
Kyoto University, Kyoto, Japan object (e.g., to determine the subject’s gender), while it is also possible to (a) infer the presence or
location of other objects in the scene (e.g., a hand may be near a face and at a similar scale) and (b) to
Specialty section: better estimate further information about the scene (e.g., the type of scene, indoor versus outdoor,
This article was submitted to Vision etc.), among other contextual information.
Systems Theory, Tools and Object detection has been used in many applications, with the most popular ones being: (i)
Applications, a section of the
human-computer interaction (HCI), (ii) robotics (e.g., service robots), (iii) consumer electronics
journal Frontiers in Robotics and AI
(e.g., smart-phones), (iv) security (e.g., recognition, tracking), (v) retrieval (e.g., search engines,
Received: 20 July 2015 photo management), and (vi) transportation (e.g., autonomous and assisted driving). Each of these
Accepted: 04 November 2015
applications has different requirements, including: processing time (off-line, on-line, or real-time),
Published: 19 November 2015
robustness to occlusions, invariance to rotations (e.g., in-plane rotations), and detection under pose
Citation:
changes. While many applications consider the detection of a single object class (e.g., faces) and from
Verschae R and Ruiz-del-Solar J
(2015) Object Detection: Current and
a single view (e.g., frontal faces), others require the detection of multiple object classes (humans,
Future Directions. vehicles, etc.), or of a single class from multiple views (e.g., side and frontal view of vehicles).
Front. Robot. AI 2:29. In general, most systems can detect only a single object class from a restricted set of views and
doi: 10.3389/frobt.2015.00029 poses.

Frontiers in Robotics and AI | www.frontiersin.org 1 November 2015 | Volume 2 | Article 29


Verschae and Ruiz-del-Solar Object Detection: Current and Future Directions

Several surveys on detection and recognition have been pub- objects that people often interact with, such as other humans [e.g.,
lished during the last years [see Hjelmås and Low (2001), Yang pedestrians (Papageorgiou and Poggio, 2000; Viola and Jones,
et al. (2002), Sun et al. (2006), Li and Allinson (2008), Enzweiler 2002; Dalal and Triggs, 2005; Bourdev et al., 2010; Paisitkriangkrai
and Gavrila (2009), Dollar et al. (2012), Andreopoulos and Tsotsos et al., 2015)] and body parts [(Kölsch and Turk, 2004; Ong and
(2013), Li et al. (2015), and Zafeiriou et al. (2015)], and there are Bowden, 2004; Wu and Nevatia, 2005; Verschae et al., 2008;
four main problems related to object detection. The first one is Bourdev and Malik, 2009) e.g., faces, hands, and eyes], as well
object localization, which consists of determining the location and as vehicles [(Papageorgiou and Poggio, 2000; Felzenszwalb et al.,
scale of a single object instance known to be present in the image; 2010b), e.g., cars and airplanes], and animals [e.g., Fleuret and
the second one is object presence classification, which corresponds Geman (2008)].
to determining whether at least one object of a given class is Most object detection systems consider the same basic scheme,
present in an image (without giving any information about the commonly known as sliding window: in order to detect the objects
location, scale, or the number of objects), while the third problem appearing in the image at different scales and locations, an exhaus-
is object recognition, which consist in determining if a specific tive search is applied. This search makes use of a classifier, the
object instance is present in the image. The fourth related problem core part of the detector, which indicates if a given image patch,
is view and pose estimation, which consist of determining the view corresponds to the object or not. Given that the classifier basically
of the object and the pose of the object. works at a given scale and patch size, several versions of the input
The problem of object presence classification can be solved using image are generated at different scales, and the classifier is used
object detection techniques, but in general, other methods are to classify all possible patches of the given size, for each of the
used, as determining the location and scale of the objects is not downscaled versions of the image.
required, and determining only the presence can be done more Basically, three alternatives exist to the sliding window scheme.
efficiently. In some cases, object recognition can be solved using The first one is based on the use of bag-of-words (Weinland
methods that do not require detecting the object in advance [e.g., et al., 2011; Tsai, 2012), method sometimes used for verifying the
using methods based on Local Interest Points such as Tuytelaars presence of the object, and that in some cases can be efficiently
and Mikolajczyk (2008) and Ramanan and Niranjan (2012)]. Nev- applied by iteratively refining the image region that contains
ertheless, solving the object detection problem would solve (or the object [e.g., Lampert et al. (2009)]. The second one samples
help simplifying) these related problems. An additional, recently patches and iteratively searches for regions of the image where
addressed problem corresponds to determining the “objectness” of it is likely that the object is present [e.g., Prati et al. (2012)].
an image patch, i.e., measuring the likeliness for an image window These two schemes reduce the number of image patches where to
to contain an object of any class [e.g., Alexe et al. (2010), Endres perform the classification, seeking to avoid an exhaustive search
and Hoiem (2010), and Huval et al. (2013)]. over all image patches. The third scheme finds key-points and
In the following, we give a summary of past research on object then matches them to perform the detection [e.g., Azzopardi and
detection, present an overview of current research directions, and Petkov (2013)]. These schemes cannot always guarantee that all
discuss open problems and possible future directions, all this with object’s instances will be detected.
a focus on the classifiers and architectures of the detector, rather
than on the used features. 3. OBJECT DETECTION APPROACHES
Object detection methods can be grouped in five categories, each
2. A BRIEF REVIEW OF OBJECT with merits and demerits: while some are more robust, others
DETECTION RESEARCH can be used in real-time systems, and others can be handle more
classes, etc. Table 1 gives a qualitative comparison.
Early works on object detection were based on template match-
ing techniques and simple part-based models [e.g., Fischler and
Elschlager (1973)]. Later, methods based on statistical classi- 3.1. Coarse-to-Fine and Boosted
fiers (e.g., Neural Networks, SVM, Adaboost, Bayes, etc.) were Classifiers
introduced [e.g., Osuna et al. (1997), Rowley et al. (1998), Sung The most popular work in this category is the boosted cascade
and Poggio (1998), Schneiderman and Kanade (2000), Yang classifier of Viola and Jones (2004). It works by efficiently rejecting,
et al. (2000a,b), Fleuret and Geman (2001), Romdhani et al. in a cascade of test/filters, image patches that do not correspond
(2001), and Viola and Jones (2001)]. This initial successful fam- to the object. Cascade methods are commonly used with boosted
ily of object detectors, all of them based on statistical clas- classifiers due to two main reasons: (i) boosting generates an
sifiers, set the ground for most of the following research in additive classifier, thus it is easy to control the complexity of each
terms of training and evaluation procedures and classification stage of the cascade and (ii) during training, boosting can be also
techniques. used for feature selection, allowing the use of large (parametrized)
Because face detection is a critical ability for any system that families of features. A coarse-to-fine cascade classifier is usually
interacts with humans, it is the most common application of the first kind of classifier to consider when efficiency is a key
object detection. However, many additional detection problems requirement. Recent methods based on boosted classifiers include
have been studied [e.g., Papageorgiou and Poggio (2000), Agarwal Li and Zhang (2004), Gangaputra and Geman (2006), Huang
et al. (2004), Alexe et al. (2010), Everingham et al. (2010), and et al. (2007), Wu and Nevatia (2007), Verschae et al. (2008), and
Andreopoulos and Tsotsos (2013)]. Most cases correspond to Verschae and Ruiz-del-Solar (2012).

Frontiers in Robotics and AI | www.frontiersin.org 2 November 2015 | Volume 2 | Article 29


Verschae and Ruiz-del-Solar Object Detection: Current and Future Directions

3.2. Dictionary Based of training samples is required for training the classifier. Recent
The best example in this category is the Bag of Word method [e.g., methods include Dean et al. (2013), Huval et al. (2013), Ouyang
Serre et al. (2005) and Mutch and Lowe (2008)]. This approach and Wang (2013), Sermanet et al. (2013), Szegedy et al. (2013),
is basically designed to detect a single object per image, but after Zeng et al. (2013), Erhan et al. (2014), Zhou et al. (2014), and
removing a detected object, the remaining objects can be detected Ouyang et al. (2015).
[e.g., Lampert et al. (2009)]. Two problems with this approach are
that it cannot robustly handle well the case of two instances of the 3.5. Trainable Image Processing
object appearing near each other, and that the localization of the Architectures
object may not be accurate. In such architectures, the parameters of predefined operators
and the combination of the operators are learned, sometimes
3.3. Deformable Part-Based Model considering an abstract notion of fitness. These are general-
This approach considers object and part models and their relative purpose architectures, and thus they can be used to build several
positions. In general, it is more robust that other approaches, but modules of a larger system (e.g., object recognition, key point
it is rather time consuming and cannot detect objects appearing detectors and object detection modules of a robot vision sys-
at small scales. It can be traced back to the deformable models tem). Examples include trainable COSFIRE filters (Azzopardi and
(Fischler and Elschlager, 1973), but successful methods are recent Petkov, 2013, 2014), and Cartesian Genetic Programming (CGP)
(Felzenszwalb et al., 2010b). Relevant works include Felzenszwalb (Harding et al., 2013; Leitner et al., 2013).
et al. (2010a) and Yan et al. (2014), where efficient evaluation
of deformable part-based model is implemented using a coarse- 4. CURRENT RESEARCH PROBLEMS
to-fine cascade model for faster evaluation, Divvala et al. (2012),
where the relevance of the part-models is analyzed, among others Table 2 presents a summary of solved, current, and open prob-
[e.g., Azizpour and Laptev (2012), Zhu and Ramanan (2012), and lems. In the present section we discuss current research directions.
Girshick et al. (2014)].
4.1. Multi-Class
3.4. Deep Learning Many applications require detecting more than one object class.
One of the first successful methods in this family is based on If a large number of classes is being detected, the processing
convolutional neural networks (Delakis and Garcia, 2004). The speed becomes an important issue, as well as the kind of classes
key difference between this and the above approaches is that in that the system can handle without accuracy loss. Works that
this approach the feature representation is learned instead of being have addressed the multi-class detection problem include Tor-
designed by the user, but with the drawback that a large number ralba et al. (2007), Razavi et al. (2011), Benbouzid et al. (2012),

TABLE 1 | Qualitative comparison of object detection approaches.

Method Coarse-to-fine and Dictionary based Deformable part-based Deep learning Trainable image
boosted classifiers models processing architectures

Accuracy ++ += ++ ++ +=
Generality == ++ += ++ +=
Speed ++ += == += +=
Advantages Real-time, it can Representation It can handle Representation can General-purpose architecture
work at small can be shared deformations and be transfered to other that can be used is several
resolutions across classes occlusions classes modules of a system
Drawbacks/ Features are It may not It can not detect Large training sets The obtained system may be
requirements predefined detect all object small objects specialized hardware Too specialized for a
instances (GPU) for efficiency particular setting
Typical Robotics, security Retrieval, search Transportation Retrieval, search HCI, health, robotics
applications pedestrian detection

Accuracy: ++, High; +=, Good; ==, Low.


Speed: ++, real-time (15 fps or more); +=, online (10–5 fps); ==, offline (5 fps or more).
Generality: ++ (+=), applicable to many (some) object classes; ==, depend on features designed for specific classes.

TABLE 2 | Summary of current directions and open problems.

Solved problems Single-class Single-view Small deformations Multi-scale

Current directions Multi-class (scalability and efficiency) Multi-view/pose Occlusions, deformable Contextual information
Multi-resolution Interlaced object and background Temporal features
Open Incremental learning Object-part relation Pixel-level detection Multi-modal
Background objects

Frontiers in Robotics and AI | www.frontiersin.org 3 November 2015 | Volume 2 | Article 29


Verschae and Ruiz-del-Solar Object Detection: Current and Future Directions

Song et al. (2012), Verschae and Ruiz-del-Solar (2012), and Erhan aiming to map semantically related features to visual words
et al. (2014). Efficiency has been addressed, e.g., by using the same [e.g., Wu et al. (2010)], among many others [e.g., Torralba and
representation for several object classes, as well as by develop- Sinha (2001), Divvala et al. (2009), Sun et al. (2012), Mottaghi et al.
ing multi-class classifiers designed specifically to detect multiple (2014), and Cadena et al. (2015)]. While most methods consider
classes. Dean et al. (2013) presents one of the few existing works the detection of objects in a single frame, temporal features can be
for very large-scale multi-class object detection, where 100,000 beneficial [e.g., Viola et al. (2005) and Dalal et al. (2006)].
object classes were considered.
5. OPEN PROBLEMS AND
4.2. Multi-View, Multi-Pose, FUTURE DIRECTIONS
Multi-Resolution
Most methods used in practice have been designed to detect a In the following, we outline problems that we believe have not
single object class under a single view, thus these methods cannot been addressed, or addressed only partially, and may be interest-
handle multiple views, or large pose variations; with the exception ing relevant research directions.
of deformable part-based models which can deal with some pose
variations. Some works have tried to detect objects by learning 5.1. Open-World Learning and Active Vision
subclasses (Wu and Nevatia, 2007) or by considering views/poses An important problem is to incrementally learn, to detect new
as different classes (Verschae and Ruiz-del-Solar, 2012); in both classes, or to incrementally learn to distinguish among subclasses
cases improving the efficiency and robustness. Also, multi-pose after the “main” class has been learned. If this can be done in
models [e.g., Erol et al. (2007)] and multi-resolution models [e.g., an unsupervised way, we will be able to build new classifiers
Park et al. (2010)] have been developed. based on existing ones, without much additional effort, greatly
reducing the effort required to learn new object classes. Note that
4.3. Efficiency and Computational Power humans are continuously inventing new objects, fashion changes,
etc., and therefore detection systems will need to be continu-
Efficiency is an issue to be taken into account in any object detec-
ously updated, adding new classes, or updating existing ones.
tion system. As mentioned, a coarse-to-fine classifier is usually the
Some recent works have addressed these issues, mostly based on
first kind of classifier to consider when efficiency is a key require-
deep learning and transfer learning methods [e.g., Bengio (2012),
ment [e.g., Viola et al. (2005)], while reducing the number of image
Mesnil et al. (2012), and Kotzias et al. (2014)]. This open-world
patches where to perform the classification [e.g., Lampert et al.
learning is of particular importance in robot applications, case
(2009)] and efficiently detecting multiple classes [e.g., Verschae
where active vision mechanisms can aid in the detection and
and Ruiz-del-Solar (2012)] have also been used. Efficiency does
learning [e.g., Paletta and Pinz (2000) and Correa et al. (2012)].
not imply real-time performance, and works such as Felzenszwalb
et al. (2010b) are robust and efficient, but not fast enough for real-
time problems. However, using specialized hardware (e.g., GPU) 5.2. Object-Part Relation
some methods can run in real-time (e.g., deep learning). During the detection process, should we detect the object first or
the parts first? This is a basic dilemma, and no clear solution exists.
Probably, the search for the object and for the parts must be done
4.4. Occlusions, Deformable Objects, and
concurrently where both processes give feedback to each other.
Interlaced Object and Background How to do this is still an open problem and is likely related to how
Dealing with partial occlusions is also an important problem, to use of context information. Moreover, in cases the object part
and no compelling solution exits, although relevant research has can be also decomposed in subparts, an interaction among several
been done [e.g., Wu and Nevatia (2005)]. Similarly, detecting hierarchies emerge, and in general it is not clear what should be
objects that are not “closed,” i.e., where objects and background done first.
pixels are interlaced with background is still a difficult problem.
Two examples are hand detection [e.g., Kölsch and Turk (2004)] 5.3. Multi-Modal Detection
and pedestrian detection [see Dollar et al. (2012)]. Deformable The use of new sensing modalities, in particular depth and ther-
part-based model [e.g., Felzenszwalb et al. (2010b)] have been to mal cameras, has seen some development in the last years [e.g.,
some extend successful under this kind of problem, but further Fehr and Burkhardt (2008) and Correa et al. (2012)]. However,
improvement is still required. the methods used for processing visual images are also used for
thermal images, and to a lesser degree for depth images. While
4.5. Contextual Information and using thermal images makes easier to discriminate the foreground
Temporal Features from the background, it can only be applied to objects that irra-
Integrating contextual information (e.g., about the type of scene, diate infrared light (e.g., mammals, heating, etc.). Using depth
or the presence of other objects) can increase speed and robust- images is easy to segment the objects, but general methods for
ness, but “when and how” to do this (before, during or after detecting specific classes has not been proposed, and probably
the detection), it is still an open problem. Some proposed higher resolution depth images are required. It seems that depth
solutions include the use of (i) spatio-temporal context [e.g., and thermal cameras alone are not enough for object detection,
Palma-Amestoy et al. (2010)], (ii) spatial structure among visual at least with their current resolution, but further advances can be
words [e.g., Wu et al. (2009)], and (iii) semantic information expected as the sensing technology improves.

Frontiers in Robotics and AI | www.frontiersin.org 4 November 2015 | Volume 2 | Article 29


Verschae and Ruiz-del-Solar Object Detection: Current and Future Directions

5.4. Pixel-Level Detection (Segmentation) last years, and some existing techniques are now part of many
and Background Objects consumer electronics (e.g., face detection for auto-focus in smart-
phones) or have been integrated in assistant driving technolo-
In many applications, we may be interested in detecting objects
gies, we are still far from achieving human-level performance, in
that are usually considered as background. The detection of such
particular in terms of open-world learning. It should be noted
“background objects,” such as rivers, walls, mountains, has not
that object detection has not been used much in many areas
been addressed by most of the here mentioned approaches. In gen-
where it could be of great help. As mobile robots, and in general
eral, this kind of problem has been addressed by first segmenting
autonomous machines, are starting to be more widely deployed
the image and later labeling each segment of the image [e.g., Peng
(e.g., quad-copters, drones and soon service robots), the need of
et al. (2013)]. Of course, for successfully detecting all objects in
object detection systems is gaining more importance. Finally, we
a scene, and to completely understand the scene, we will need to
need to consider that we will need object detection systems for
have a pixel level detection of the objects, and further more, a 3D
nano-robots or for robots that will explore areas that have not been
model of such scene. Therefore, at some point object detection and
seen by humans, such as depth parts of the sea or other planets,
image segmentation methods may need to be integrated. We are
and the detection systems will have to learn to new object classes
still far from attaining such automatic understanding of the world,
as they are encountered. In such cases, a real-time open-world
and to achieve this, active vision mechanisms might be required
learning ability will be critical.
[e.g., Aloimonos et al. (1988) and Cadena et al. (2015)].

6. CONCLUSION ACKNOWLEDGMENTS
Object detection is a key ability for most computer and robot This research was partially funded by the FONDECYT Projects
vision system. Although great progress has been observed in the 3120218 and 1130153 (CONICYT, Chile).

REFERENCES Cadena, C., Dick, A., and Reid, I. (2015). “A fast, modular scene understanding sys-
tem using context-aware object detection,” in Robotics and Automation (ICRA),
Agarwal, S., Awan, A., and Roth, D. (2004). Learning to detect objects in images via 2015 IEEE International Conference on (Seattle, WA).
a sparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26, Correa, M., Hermosilla, G., Verschae, R., and Ruiz-del-Solar, J. (2012). Human
1475–1490. doi:10.1109/TPAMI.2004.108 detection and identification by robots using thermal and visual information in
Alexe, B., Deselaers, T., and Ferrari, V. (2010). “What is an object?,” in Computer domestic environments. J. Intell. Robot Syst. 66, 223–243. doi:10.1007/s10846-
Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (San Francisco, 011-9612-2
CA: IEEE), 73–80. doi:10.1109/CVPR.2010.5540226 Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for human
Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. Int. J. detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Comput. Vis. 1, 333–356. doi:10.1007/BF00133571 Computer Society Conference on, Vol. 1 (San Diego, CA: IEEE), 886–893. doi:10.
Andreopoulos, A., and Tsotsos, J. K. (2013). 50 years of object recognition: direc- 1109/CVPR.2005.177
tions forward. Comput. Vis. Image Underst. 117, 827–891. doi:10.1016/j.cviu. Dalal, N., Triggs, B., and Schmid, C. (2006). “Human detection using oriented
2013.04.005 histograms of flow and appearance,” in Computer Vision ECCV 2006, Volume
Azizpour, H., and Laptev, I. (2012). “Object detection using strongly-supervised 3952 of Lecture Notes in Computer Science, eds A. Leonardis, H. Bischof, and A.
deformable part models,” in Computer Vision-ECCV 2012 (Florence: Springer), Pinz (Berlin: Springer), 428–441.
836–849. Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijayanarasimhan, S., Yagnik, J., et al.
Azzopardi, G., and Petkov, N. (2013). Trainable cosfire filters for keypoint detection (2013). “Fast, accurate detection of 100,000 object classes on a single machine,”
and pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 490–503. in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on
doi:10.1109/TPAMI.2012.106 (Washington, DC: IEEE), 1814–1821.
Azzopardi, G., and Petkov, N. (2014). Ventral-stream-like shape representation: Delakis, M., and Garcia, C. (2004). Convolutional face finder: a neural architecture
from pixel intensity values to trainable object-selective cosfire models. Front. for fast and robust face detection. IEEE Trans. Pattern Anal. Mach. Intell. 26,
Comput. Neurosci. 8:80. doi:10.3389/fncom.2014.00080 1408–1423. doi:10.1109/TPAMI.2004.97
Benbouzid, D., Busa-Fekete, R., and Kegl, B. (2012). “Fast classification using sparse Divvala, S., Hoiem, D., Hays, J., Efros, A., and Hebert, M. (2009). “An empirical
decision dags,” in Proceedings of the 29th International Conference on Machine study of context in object detection,” in Computer Vision and Pattern Recognition,
Learning (ICML-12), ICML ‘12, eds J. Langford and J. Pineau (New York, NY: 2009. CVPR 2009. IEEE Conference on (Miami, FL: IEEE), 1271–1278. doi:10.
Omnipress), 951–958. 1109/CVPR.2009.5206532
Bengio, Y. (2012). “Deep learning of representations for unsupervised and transfer Divvala, S. K., Efros, A. A., and Hebert, M. (2012). “How important are deformable
learning,” in ICML Unsupervised and Transfer Learning, Volume 27 of JMLR parts in the deformable parts model?,” in Computer Vision-ECCV 2012. Work-
Proceedings, eds I. Guyon, G. Dror, V. Lemaire, G. W. Taylor, and D. L. Silver shops and Demonstrations (Florence: Springer), 31–40.
(Bellevue: JMLR.Org), 17–36. Dollar, P., Wojek, C., Schiele, B., and Perona, P. (2012). Pedestrian detection: an
Bourdev, L. D., Maji, S., Brox, T., and Malik, J. (2010). “Detecting people evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34,
using mutually consistent poselet activations,” in Computer Vision – ECCV 743–761. doi:10.1109/TPAMI.2011.155
2010 – 11th European Conference on Computer Vision, Heraklion, Crete, Greece, Endres, I., and Hoiem, D. (2010). “Category independent object proposals,” in Pro-
September 5-11, 2010, Proceedings, Part VI, Volume 6316 of Lecture Notes in ceedings of the 11th European Conference on Computer Vision: Part V, ECCV’10
Computer Science, eds K. Daniilidis, P. Maragos, and N. Paragios (Heraklion: (Berlin: Springer-Verlag), 575–588.
Springer), 168–181. Enzweiler, M., and Gavrila, D. (2009). Monocular pedestrian detection: survey and
Bourdev, L. D., and Malik, J. (2009). “Poselets: body part detectors trained using 3d experiments. IEEE Trans. Pattern Anal. Mach. Intell. 31, 2179–2195. doi:10.1109/
human pose annotations,” in IEEE 12th International Conference on Computer TPAMI.2008.260
Vision, ICCV 2009, Kyoto, Japan, September 27 – October 4, 2009 (Kyoto: IEEE), Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D. (2014). “Scalable object detec-
1365–1372. tion using deep neural networks,” in Computer Vision and Pattern Recognition

Frontiers in Robotics and AI | www.frontiersin.org 5 November 2015 | Volume 2 | Article 29


Verschae and Ruiz-del-Solar Object Detection: Current and Future Directions

(CVPR), 2014 IEEE Conference on (Columbus, OH: IEEE), 2155–2162. doi:10. approach,” in JMLR W& CP: Proceedings of the Unsupervised and Transfer
1109/CVPR.2014.276 Learning Challenge and Workshop, Vol. 27, eds I. Guyon, G. Dror, V. Lemaire,
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. (2007). Vision- G. Taylor, and D. Silver (Bellevue: JMLR.org) 97–110.
based hand pose estimation: a review. Comput. Vis. Image Underst. 108, 52–73; Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., et al. (2014). “The
Special Issue on Vision for Human-Computer Interaction. doi:10.1016/j.cviu. role of context for object detection and semantic segmentation in the wild,”
2006.10.012 in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). (Columbus, OH: IEEE), 891–898. doi:10.1109/CVPR.2014.119
The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 303–338. Mutch, J., and Lowe, D. G. (2008). Object class recognition and localization using
doi:10.1007/s11263-009-0275-4 sparse features with limited receptive fields. Int. J. Comput. Vis. 80, 45–57.
Fehr, J., and Burkhardt, H. (2008). “3d rotation invariant local binary patterns,” in doi:10.1007/s11263-007-0118-0
Pattern Recognition, 2008. ICPR 2008. 19th International Conference on (Tampa, Ong, E.-J., and Bowden, R. (2004). “A boosted classifier tree for hand shape detec-
FL: IEEE), 1–4. doi:10.1109/ICPR.2008.4761098 tion,” in Proceedings of the Sixth International Conference on Automatic Face and
Felzenszwalb, P. F., Girshick, R. B., and McAllester, D. (2010a). “Cascade object Gesture Recognition (Seoul: IEEE), 889–894. doi:10.1109/AFGR.2004.1301646
detection with deformable part models,” in Computer Vision and Pattern Recog- Osuna, E., Freund, R., and Girosi, F. (1997). “Training support vector machines:
nition (CVPR), 2010 IEEE Conference on (San Francisco, CA: IEEE), 2241–2248. an application to face detection,” in Proc. of the IEEE Conference of Computer
Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D. (2010b). Object Vision and Pattern Recognition (San Juan: IEEE), 130–136. doi:10.1109/CVPR.
detection with discriminatively trained part-based models. IEEE Trans. Pattern 1997.609310
Anal. Mach. Intell. 32, 1627–1645. doi:10.1109/TPAMI.2009.167 Ouyang, W., and Wang, X. (2013). “Joint deep learning for pedestrian detection,” in
Fischler, M. A., and Elschlager, R. (1973). The representation and matching of Computer Vision (ICCV), 2013 IEEE International Conference on (Sydney, VIC:
pictorial structures. IEEE Trans. Comput. C-22, 67–92. doi:10.1109/T-C.1973. IEEE), 2056–2063. doi:10.1109/ICCV.2013.257
223602 Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., et al. (2015). “Deepid-
Fleuret, F., and Geman, D. (2001). Coarse-to-fine face detection. Int. J. Comput. Vis. net: deformable deep convolutional neural networks for object detection,” in
41, 85–107. doi:10.1023/A:1011113216584 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Fleuret, F., and Geman, D. (2008). Stationary features and cat detection. Journal of (Boston, MA: IEEE), 2403–2412.
Machine Learning Research (JMLR) 9, 2549–2578. Paisitkriangkrai, S., Shen, C., and van den Hengel, A. (2015). Pedestrian detection
Gangaputra, S., and Geman, D. (2006). “A design principle for coarse-to-fine with spatially pooled features and structured ensemble learning. IEEE Trans.
classification,” in Proc. of the IEEE Conference of Computer Vision and Pattern Pattern Anal. Mach. Intell. PP, 1. doi:10.1109/TPAMI.2015.2474388
Recognition, Vol. 2 (New York, NY: IEEE), 1877–1884. doi:10.1109/CVPR.2006. Paletta, L., and Pinz, A. (2000). Active object recognition by view integration
21 and reinforcement learning. Rob. Auton. Syst. 31, 71–86. doi:10.1016/S0921-
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). “Rich feature hierarchies 8890(99)00079-2
for accurate object detection and semantic segmentation,” in Computer Vision Palma-Amestoy, R., Ruiz-del Solar, J., Yanez, J. M., and Guerrero, P. (2010).
and Pattern Recognition (CVPR), 2014 IEEE Conference on (Columbus, OH: Spatiotemporal context integration in robot vision. Int. J. Human. Robot. 07,
IEEE), 580–587. 357–377. doi:10.1142/S0219843610002192
Harding, S., Leitner, J., and Schmidhuber, J. (2013). “Cartesian genetic program- Papageorgiou, C., and Poggio, T. (2000). A trainable system for object detection. Int.
ming for image processing,” in Genetic Programming Theory and Practice X, J. Comput. Vis. 38, 15–33. doi:10.1023/A:1008162616689
Genetic and Evolutionary Computation, eds R. Riolo, E. Vladislavleva, M. D. Park, D., Ramanan, D., and Fowlkes, C. (2010). “Multiresolution models for object
Ritchie, and J. H. Moore (New York, NY: Springer), 31–44. detection,” in Computer Vision ECCV 2010, Volume 6314 of Lecture Notes in Com-
Hjelmås, E., and Low, B. K. (2001). Face detection: a survey. Comput. Vis. Image puter Science, eds K. Daniilidis, P. Maragos, and N. Paragios (Berlin: Springer),
Underst. 83, 236–274. doi:10.1006/cviu.2001.0921 241–254.
Huang, C., Ai, H., Li, Y., and Lao, S. (2007). High-performance rotation invariant Peng, B., Zhang, L., and Zhang, D. (2013). A survey of graph theoretical approaches
multiview face detection. IEEE Trans. Pattern Anal. Mach. Intell. 29, 671–686. to image segmentation. Pattern Recognit. 46, 1020–1038. doi:10.1016/j.patcog.
doi:10.1109/TPAMI.2007.1011 2012.09.015
Huval, B., Coates, A., and Ng, A. (2013). Deep Learning for Class-Generic Object Prati, A., Gualdi, G., and Cucchiara, R. (2012). Multistage particle windows for
Detection. arXiv preprint arXiv:1312.6885. fast and accurate object detection. IEEE Trans. Pattern Anal. Mach. Intell. 34,
Kölsch, M., and Turk, M. (2004). “Robust hand detection,” in Proceedings of the Sixth 1589–1604. doi:10.1109/TPAMI.2011.247
International Conference on Automatic Face and Gesture Recognition (Seoul: Ramanan, A., and Niranjan, M. (2012). A review of codebook models in patch-
IEEE), 614–619. based visual object recognition. J. Signal Process. Syst. 68, 333–352. doi:10.1007/
Kotzias, D., Denil, M., Blunsom, P., and de Freitas, N. (2014). Deep Multi-Instance s11265-011-0622-x
Transfer Learning. CoRR, abs/1411.3128. Razavi, N., Gall, J., and Van Gool, L. (2011). “Scalable multi-class object detection,”
Lampert, C. H., Blaschko, M., and Hofmann, T. (2009). Efficient subwindow search: in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on
a branch and bound framework for object localization. IEEE Trans. Pattern Anal. (Providence, RI: IEEE), 1505–1512. doi:10.1109/CVPR.2011.5995441
Mach. Intell. 31, 2129–2142. doi:10.1109/TPAMI.2009.144 Romdhani, S., Torr, P., Scholkopf, B., and Blake, A. (2001). “Computationally
Leitner, J., Harding, S., Chandrashekhariah, P., Frank, M., Frster, A., Triesch, J., efficient face detection,” in Computer Vision, 2001. ICCV 2001. Proceedings.
et al. (2013). Learning visual object detection and localisation using icvision. Eighth IEEE International Conference on, Vol. 2 (Vancouver, BC: IEEE), 695–700.
Biol. Inspired Cogn. Archit. 5, 29–41; Extended versions of selected papers from doi:10.1109/ICCV.2001.937694
the Third Annual Meeting of the {BICA} Society (BICA 2012). doi:10.1016/j. Rowley, H. A., Baluja, S., and Kanade, T. (1998). Neural network-based detection.
bica.2013.05.009 IEEE Trans. Pattern Anal. Mach. Intell. 20, 23–28. doi:10.1109/34.655647
Li, J., and Allinson, N. M. (2008). A comprehensive review of current local fea- Schneiderman, H., and Kanade, T. (2000). “A statistical model for 3D object
tures for computer vision. Neurocomputing 71, 1771–1787; Neurocomputing detection applied to faces and cars,” in Proc. of the IEEE Conf. on Computer Vision
for Vision Research Advances in Blind Signal Processing. doi:10.1016/j.neucom. and Pattern Recognition (Hilton Head, SC: IEEE), 746–751.
2007.11.032 Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2013).
Li, S. Z., and Zhang, Z. (2004). Floatboost learning and statistical face detection. Overfeat: Integrated Recognition, Localization and Detection Using Convolutional
IEEE Trans. Pattern Anal. Mach. Intell. 26, 1112–1123. doi:10.1109/TPAMI. Networks. arXiv preprint arXiv:1312.6229.
2004.68 Serre, T., Wolf, L., and Poggio, T. (2005). “Object recognition with features inspired
Li, Y., Wang, S., Tian, Q., and Ding, X. (2015). Feature representation for statistical- by visual cortex,” in CVPR (2) (San Diego, CA: IEEE Computer Society),
learning-based object detection: a review. Pattern Recognit. 48, 3542–3559. doi: 994–1000.
10.1016/j.patcog.2015.04.018 Song, H. O., Zickler, S., Althoff, T., Girshick, R., Fritz, M., Geyer, C., et al. (2012).
Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I. J., “Sparselet models for efficient multiclass object detection,” in Computer Vision-
et al. (2012). “Unsupervised and transfer learning challenge: a deep learning ECCV 2012 (Florence: Springer), 802–815.

Frontiers in Robotics and AI | www.frontiersin.org 6 November 2015 | Volume 2 | Article 29


Verschae and Ruiz-del-Solar Object Detection: Current and Future Directions

Sun, M., Bao, S., and Savarese, S. (2012). Object detection using geometrical context ‘05: Proceedings of the 10th IEEE Int. Conf. on Computer Vision (ICCV’05) Vol 1
feedback. Int. J. Comput. Vis. 100, 154–169. doi:10.1007/s11263-012-0547-2 (Washington, DC: IEEE Computer Society), 90–97.
Sun, Z., Bebis, G., and Miller, R. (2006). On-road vehicle detection: a review. IEEE Wu, B., and Nevatia, R. (2007). “Cluster boosted tree classifier for multi-view, multi-
Trans. Pattern Anal. Mach. Intell. 28, 694–711. doi:10.1109/TPAMI.2006.104 pose object detection,” in ICCV (Rio de Janeiro: IEEE), 1–8.
Sung, K.-K., and Poggio, T. (1998). Example-based learning for viewed-based Wu, L., Hoi, S., and Yu, N. (2010). Semantics-preserving bag-of-words models and
human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20, 39–51. doi:10. applications. IEEE Trans. Image Process. 19, 1908–1920. doi:10.1109/TIP.2010.
1109/34.655648 2045169
Szegedy, C., Toshev, A., and Erhan, D. (2013). “Deep neural networks for object Wu, L., Hu, Y., Li, M., Yu, N., and Hua, X.-S. (2009). Scale-invariant visual language
detection,” in Advances in Neural Information Processing Systems 26, eds C. modeling for object categorization. IEEE Trans. Multimedia 11, 286–294. doi:10.
Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger (Harrahs and 1109/TMM.2008.2009692
Harveys: Curran Associates, Inc), 2553–2561. Yan, J., Lei, Z., Wen, L., and Li, S. Z. (2014). “The fastest deformable part model
Torralba, A., Murphy, K. P., and Freeman, W. T. (2007). Sharing visual features for object detection,” in Computer Vision and Pattern Recognition (CVPR), 2014
for multiclass and multiview object detection. IEEE Trans. Pattern Anal. Mach. IEEE Conference on (Columbus, OH: IEEE), 2497–2504.
Intell. 29, 854–869. doi:10.1109/TPAMI.2007.1055 Yang, M.-H., Ahuja, N., and Kriegman, D. (2000a). “Mixtures of linear subspaces
Torralba, A., and Sinha, P. (2001). “Statistical context priming for object detection,” for face detection,” in Proc. Fourth IEEE Int. Conf. on Automatic Face and Gesture
in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Recognition (Grenoble: IEEE), 70–76.
Conference on, Vol. 1 (Vancouver, BC: IEEE), 763–770. doi:10.1109/ICCV.2001. Yang, M.-H., Roth, D., and Ahuja, N. (2000b). “A SNoW-based face detector,”
937604 in Advances in Neural Information Processing Systems 12 (Denver: MIT press),
Tsai, C.-F. (2012). Bag-of-words representation in image annotation: a review. ISRN 855–861.
Artif. Intell. 2012, 19. doi:10.5402/2012/376804 Yang, M.-H., Kriegman, D., and Ahuja, N. (2002). Detecting faces in images:
Tuytelaars, T., and Mikolajczyk, K. (2008). Local invariant feature detectors: a a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24, 34–58. doi:10.1109/34.
survey. Found. Trends Comput. Graph. Vis. 3, 177–280. doi:10.1561/0600000017 982883
Verschae, R., and Ruiz-del-Solar, J. (2012). “Tcas: a multiclass object detector for Zafeiriou, S., Zhang, C., and Zhang, Z. (2015). A survey on face detection in the wild:
robot and computer vision applications,” in Advances in Visual Computing, past, present and future. Comput. Vis. Image Underst. 138, 1–24. doi:10.1016/j.
Volume 7431 of Lecture Notes in Computer Science, eds G. Bebis, R. Boyle, B. cviu.2015.03.015
Parvin, D. Koracin, C. Fowlkes, S. Wang, et al. (Berlin: Springer), 632–641. Zeng, X., Ouyang, W., and Wang, X. (2013). “Multi-stage contextual deep learning
Verschae, R., Ruiz-del-Solar, J., and Correa, M. (2008). A unified learning frame- for pedestrian detection,” in Computer Vision (ICCV), 2013 IEEE International
work for object detection and classification using nested cascades of boosted Conference on (Washington, DC: IEEE), 121–128.
classifiers. Mach. Vis. Appl. 19, 85–103. doi:10.1007/s00138-007-0084-0 Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., and Torralba, A. (2014). Object
Viola, P., and Jones, M. (2001). “Rapid object detection using a boosted cascade Detectors Emerge in Deep Scene Cnns. CoRR, abs/1412.6856.
of simple features,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Zhu, X., and Ramanan, D. (2012). “Face detection, pose estimation, and landmark
Recognition (Kauai: IEEE), 511–518. doi:10.1109/CVPR.2001.990517 localization in the wild,” in Computer Vision and Pattern Recognition (CVPR),
Viola, P., and Jones, M. (2002). “Fast and robust classification using asymmetric 2012 IEEE Conference on (Providence: IEEE), 2879–2886.
adaboost and a detector cascade,” in Advances in Neural Information Processing
System 14 (Vancouver: MIT Press), 1311–1318.
Conflict of Interest Statement: The authors declare that the research was con-
Viola, P., Jones, M., and Snow, D. (2005). Detecting pedestrians using patterns of
ducted in the absence of any commercial or financial relationships that could be
motion and appearance. Int. J. Comput. Vis. 63, 153–161. doi:10.1007/s11263-
construed as a potential conflict of interest.
005-6644-8
Viola, P., and Jones, M. J. (2004). Robust real-time face detection. Int. J. Comput.
Vis. 57, 137–154. doi:10.1023/B:VISI.0000013087.49260.fb Copyright © 2015 Verschae and Ruiz-del-Solar. This is an open-access article dis-
Weinland, D., Ronfard, R., and Boyer, E. (2011). A survey of vision-based methods tributed under the terms of the Creative Commons Attribution License (CC BY).
for action representation, segmentation and recognition. Comput. Vis. Image The use, distribution or reproduction in other forums is permitted, provided the
Underst. 115, 224–241. doi:10.1016/j.cviu.2010.10.002 original author(s) or licensor are credited and that the original publication in this
Wu, B., and Nevatia, R. (2005). “Detection of multiple, partially occluded humans journal is cited, in accordance with accepted academic practice. No use, distribution
in a single image by bayesian combination of edgelet part detectors,” in ICCV or reproduction is permitted which does not comply with these terms.

Frontiers in Robotics and AI | www.frontiersin.org 7 November 2015 | Volume 2 | Article 29

You might also like