VGG Object Category Detection
VGG Object Category Detection
The goal of object category detection is to identify and localize objects of a given type in an image. Examples
applications include detecting pedestrian, cars, or traffic signs in street scenes, objects of interest such as tools
or animals in web images, or particular features in medical image. Given a target class, such as people, a
detector receives as input an image and produces as output zero, one, or more bounding boxes around each
occurrence of the object class in the image. The key challenge is that the detector needs to find objects
regardless of their location and scale in the image, as well as pose and other variation factors, such as clothing,
illumination, occlusions, etc.
This practical explores basic techniques in visual object detection, focusing on image based models. The
appearance of image patches containing objects is learned using statistical analysis. Then, in order to detect
objects in an image, the statistical model is applied to image windows extracted at all possible scales and
locations, in order to identify which ones, if any, contain the object.
In more detail, the practical explores the following topics: (i) using HOG features to describe image regions, (ii)
building a HOG-based sliding-window detector to localize objects in images; (iii) working with multiple scales and
multiple object occurrences; (iv) using a linear support vector machine to learn the appearance of objects; (v)
evaluating an object detector in term of average precision; (vi) learning an object detector using hard negative
mining.
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 1/11
1/7/2019 VGG Practical
Getting started
Read and understand the requirements and installation instructions. The download links for this practical are:
After the installation is complete, open and edit the script exercise1.m in the MATLAB editor. The script
contains commented code and a description for all steps of this exercise, relative to Part I of this document. You
can cut and paste this code into the MATLAB window to run it, and will need to modify it as you go through the
session. Other files exercise2.m , exercise3.m , and exercise4.m are given for Part II, III, and IV.
Each part contains several Questions and Tasks to be answered/completed before proceeding further in the
practical.
In this part we will build a basic sliding-window object detector based on HOG features. Follow the steps below:
An analogous set of variables testImages , testBoxes , and so on are provided for the test data. Familiarise
yourself with the contents of these variables.
Question: what can you deduce about the object variability from the average image?
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 2/11
1/7/2019 VGG Practical
Question: most boxes extend slightly around the object extent. Why do you think this may be valuable in
learning a detector?
hogCellSize = 8 ;
trainHog = {} ;
for i = 1:size(trainBoxPatches,4)
trainHog{i} = vl_hog(trainBoxPatches(:,:,:,i), hogCellSize) ;
end
trainHog = cat(4, trainHog{:}) ;
HOG is computed by the VLFeat function vl_hog (doc). This function takes as parameter the size in pixels of
each HOG cell hogCellSize . It also takes a RGB image, represented in MATLAB as a w × h × 3 array
(extracted as a slice of trainBoxPatches ). The output is a w/hogCellSize × h/hogCellSize × 31
dimensional array. One such array is extracted for each example image end eventually these are concatenated
in a 4D array along the fourth dimension.
w = mean(trainHog, 4) ;
The model can be visualized by rendering w as if it was a HOG feature array. This can be done using the
render option of vl_hog :
figure(2) ; clf ;
imagesc(vl_hog('render', w)) ;
Spend some time to study this plot and make sure you understand what is visualized.
im = imread('data/signs-sample-image.jpg') ;
im = im2single(im) ;
hog = vl_hog(im, hogCellSize) ;
scores = vl_nnconv(hog, w, []) ;
The first two lines read a sample image and conver it to single format. The third line computes the HOG features
of the image using the vl_hog seen above. The fourth line convolves the HOG map hog with the model w .
It uses the function vl_nnconv 1 and returns a scores map.
Task: Work out the dimension of the scores arrays. Then, check your result with the dimension of the
array computed by MATLAB.
Question: Visualize the image im and the scores array using the provided example code. Does the
result match your expectations?
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 3/11
1/7/2019 VGG Practical
Note that bestIndex is a linear index in the range [1, M ] where M is the number of possible filter locations.
We convert this into a subscript (hx , hy ) using MATLAB ind2sub function:
(hx , hy ) are in units of HOG cells. We convert this into pixel coordinates as follows:
x = (hx - 1) * hogCellSize + 1 ;
y = (hy - 1) * hogCellSize + 1 ;
Question: Why are we subtracting -1 and summing +1? Which pixel (x, y) of the HOG cell (hx , hy ) is
found?
The size of the model template in number of HOG cell can be computed in several way; one is simply:
modelWidth = size(trainHog, 2) ;
modelHeight = size(trainHog, 1) ;
detection = [
x - 0.5 ;
y - 0.5 ;
x + hogCellSize * modelWidth - 0.5 ;
y + hogCellSize * modelHeight - 0.5 ;] ;
Note: the bounding box encloses exactly all the pixel of the HOG template. In MATLAB, pixel centers have
integer coordinates and pixel borders are at a distance ±1/2.
Question: Use the example code to plot the image and overlay the bounding box of the detected object. Did
it work as expected?
setup ;
targetClass = 'mandatory' ;
loadData(targetClass) ;
The mandatory target class is simply the union of all mandatory traffic signs.
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 4/11
1/7/2019 VGG Practical
Given the model w , as determined in Part I, we use the function detectAtMultipleScales in order to
search for the object at multiple scales:
Note that the function generates a figure as it runs, so prepare a new figure before running it using the figure
command if you do not want your current figure to be deleted.
Question: Open and study the detectAtMultipleScales function. Convince yourself that it is the same
code as before, but operated after rescaling the image a number of times.
Question: Visualize the resulting detection using the supplied example code. Did it work? If not, can you
make sense of the errors?
Question: Look at the array of scores maps generated by detectAtMultipleScales using the
example code. Do they make sense? Is there anything wrong?
Ino order to collect negative examples (features extracted from non-object patches), we loop through a number
of training images and sample patches uniformly:
Task: Identify the code that extract these patches in example2.m and make sure you understand it.
% Pack the data into a matrix with one datum per column
x = cat(4, pos, neg) ;
x = reshape(x, [], numPos + numNeg) ;
We also need a vector of binary labels, +1 for positive points and -1 for negative ones:
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 5/11
1/7/2019 VGG Practical
Finally, we need to set the parameter λ of the SVM solver. For reasons that will become clearer later, we use
instead the equivalent C parameter:
numPos = size(pos,4) ;
numNeg = size(neg,4) ;
C = 10 ;
lambda = 1 / (C * (numPos + numNeg)) ;
Question: Visualize the learned model w using the supplied code. Does it differ from the naive model
learned before? How?
Question: Does the learned model perform better than the naive average?
Task: Try different images. Does this detector work all the times? If not, what types of mistakes do you see?
Are these mistakes reasonable?
% Compute detections
[detections, scores] = detect(im, w, hogCellSize, scales) ;
Task: Open and study detect.m . Make sure that you understand how it works.
Question: Why do we want to return so many responses? In practice, it is unlikely that more than a handful
of object occurrences may be contained in any given image...
A single object occurrence generates multiple detector responses at nearby image locations and scales. In order
to eliminate these redundant detections, we use a non-maximum suppression algorithm. This is implemented by
the boxsuppress.m MATLAB m-file. The algorithm is simple: start from the highest-scoring detection, then
remove any other detection whose overlap[^overlap] is greater than a threshold. The function returns a boolean
vector keep of detections to preserve:
% Non-maximum suppression
keep = boxsuppress(detections, scores, 0.25) ;
For efficiency, after non-maximum suppression we keep just ten responses (as we do not expect more than a
few objects in any image):
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 6/11
1/7/2019 VGG Practical
1. Assign each candidate detection (bi , si ) a true or false label yi ∈ +1, −1 . To do so:
1. The candidate detections (bi , si ) are sorted by decreasing score si .
2. For each candidate detection in order: a. If there is a matching ground truth detection gj (
overlap(bi , gj ) larger than 50%), the candidate detection is considered positive (yi = +1 ).
Furthermore, the ground truth detection is removed from the list and not considered further. b.
Otherwise, the candidate detection is negative (yi = −1 ).
2. Add each ground truth object gi that is still unassigned to the list of candidates as pair (gj , −∞) with label
yj = +1 .
The overlap metric used to compare a candidate detection to a ground truth bounding box is defined as the ratio
of the area of the intersection over the area of the union of the two bounding boxes:
|A ∩ B|
overlap(A, B) = .
|A ∪ B|
Questions:
In order to apply this algorithm, we first need to find the ground truth bounding boxes in the test image:
% PASCAL-like evaluation
matches = evalDetections(...
gtBoxes, gtDifficult, ...
detections, scores) ;
The gtDifficult flags can be used to mark some ground truth object occurrence as difficult and hence
ignored in the evaluation. This is used in the PASCAL VOC challenge, but not here (i.e. no object occurrence is
considered difficult).
% Visualization
figure(1) ; clf ;
imagesc(im) ; axis equal ; hold on ;
vl plotbox(detections(: matches detBoxFlags +1) 'g' 'linewidth' 2) ;
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 7/11
1/7/2019 VGG Practical
vl_plotbox(detections(:, matches.detBoxFlags==+1), g , linewidth , 2) ;
vl_plotbox(detections(:, matches.detBoxFlags==-1), 'r', 'linewidth', 2) ;
vl_plotbox(gtBoxes, 'b', 'linewidth', 1) ;
axis off ;
Task: Use the supplied example code to evaluate the detector on one image. Look carefully at the output and
convince yourself that it makes sense.
figure(2) ; clf ;
vl_pr(matches.labels, matches.scores) ;
Question: There are a large number of errors in each image. Should you worry? In what manner is the PR
curve affected? How would you eliminate the vast majority of those in a practice?
Task: Open evalModel.m and make sure you understand the main steps of the evaluation procedure.
Use the supplied example code to run the evaluation on the entiere test set:
Note: The function processes an image per time, visualizing the results as it progresses. The PR curve is the
result of the accumulation of the detections obtained thus far.
Task: Open the evaluateModel.m file in MATLAB and add a breakpoint right at the end of the for loop.
Now run the evaluation code again and look at each image individually (use dbcont to go to the next
image). Check out the correct and incorrect matches in each image and their ranking and the effect of this in
the cumulative precision-recall curve.
Hard negative mining is a simple technique that allows finding a small set of key negative examples. The idea is
simple: we start by training a model without any negatives at all (in this case the solver learns a 1-class SVM),
and then we alternate between evaluating the model on the training data to find erroneous responses and adding
the corresponding examples to the training set.
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 8/11
1/7/2019 VGG Practical
Here moreNeg contains the HOG features of the top (highest scoring and hence most confusing) image
patches in the supplied training images.
Task: Examine evaluateModel.m again to understand how hard negatives are extracted.
The next step is to fuse the new negative set with the old one:
% Add negatives
neg = cat(4, neg, moreNeg) ;
Note that hard negative mining could select the same negatives at different iterations; the following code
squashes these duplicates:
evaluateModel(...
testImages, testBoxes, testBoxImages, ...
w, hogCellSize, scales) ;
In this last part, you will learn your own object detector. To this end, open and look at exercise5.m . You will
need to prepare the following data:
Run the code in example5.m to check that your training data looks right.
Task: Understand the limitations of this simple detector and choose a target object that has a good chance of
being learnable.
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 9/11
1/7/2019 VGG Practical
Hint: Note in particular that object instances must be similar and roughly aligned. If your object is not symmetric,
consider choosing instances that face a particular direction (e.g. left-facing horse head).
Task: Make sure you get sensible results. Go back to step 5.1 if needed and adjust your data.
Hint: For debugging purposes, try using one of your training images as test. Does it work at least in this case?
In particular, many objects in nature are symmetric and, as such, their images appear flipped when the objects
are seen from the left or the right direction (consider for example a face). This can be handled by a pair of
symmetric HOG templates. In this part we will explore this option.
Task: Using the procedure above, train a HOG template w for a symmetric object facing in one specific
direction. For example, train a left-facing horse head detector.
Task: Collect test images containing the object facing in both directions. Run your detector and convince
yourself that it works well only for the direction it was trained for.
HOG features have a well defined structure that makes it possible to predict how the features transform when the
underlying image is flipped. The transformation is in fact a simple permutation of the HOG elements. For a given
spatial cell, HOG has 31 dimensions. The following code permutes the dimension to flip the cell around the
vertical axis:
perm = vl_hog('permutation') ;
hog_flipped = hog(perm) ;
Note that this permutation applies to a single HOG cell. However, the template is a H × W × 31 dimensional
array of HOG cells.
Task: Given a hog array of dimension H × W × 31 , write MATLAB code to obtain the flipped feature
array hog_flipped .
Hint: Recall that the first dimension spans the vertical axis, the second dimension the horizontal axis, and the
third dimension feature channels. perm should be applied to the last dimension. Do you need to permute
anything else?
Task: Let w be the model you trained before. Use the procedure to flip HOG to generate w_flipped .
Then visualize both w and w_flipped as done in Sect. 1.3. Convince yourself that flipping was
successful.
We have now two models, w and w_flipped , one for each view of the object.
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 10/11
1/7/2019 VGG Practical
Task: Run both models in turn on the same image, obtaining two list of bounding boxes. Find a way to merge
the two lists and visualise the top detections. Convince yourself that you can now detect objects facing either
way.
Hint: Recall how redundant detections can be removed using non-maximum suppression.
History
Used in the Oxford AIMS CDT, 2014-18
1. This is part of the MatConvNet toolbox for convolutional neural networks. Nevertheless, there is no neural
network discussed here. ↩
www.robots.ox.ac.uk/~vgg/practicals/category-detection/index.html 11/11