0% found this document useful (0 votes)
38 views82 pages

Image Processing and Computer Vision Notes

The document discusses various computer vision topics like image acquisition, filtering, transforms, edge detection, shape detection, segmentation, object detection, motion modelling, optical flow, stereo vision. It provides details on techniques like convolution, Fourier transforms, Hough transform, thresholding and clustering for image segmentation.

Uploaded by

joe bloggs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views82 pages

Image Processing and Computer Vision Notes

The document discusses various computer vision topics like image acquisition, filtering, transforms, edge detection, shape detection, segmentation, object detection, motion modelling, optical flow, stereo vision. It provides details on techniques like convolution, Fourier transforms, Hough transform, thresholding and clustering for image segmentation.

Uploaded by

joe bloggs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 82

Contents

Acquisition and Representation...........................................................................................................4


Components of acquiring an image...................................................................................................4
Challenges faced by a computer vision system.................................................................................4
Pinhole Camera (The Camera Obscura).............................................................................................4
Digital Image Acquisition...................................................................................................................4
Modelling a spatial brightness pulse.................................................................................................4
The Point Spread Function.................................................................................................................5
Quantization......................................................................................................................................5
Spatial Sampling................................................................................................................................6
Shannon’s Sampling Theorem...........................................................................................................6
Sampling and Quantization Summary................................................................................................7
Filtering Images.....................................................................................................................................8
Image Transforms..............................................................................................................................8
Convolution.......................................................................................................................................8
Spatial Low/High Pass Filtering..........................................................................................................8
Normalising Convolution Results.......................................................................................................9
Gaussian Filter Kernel........................................................................................................................9
Noise Removal: The Median Filter...................................................................................................10
Frequency Domain and Transforms.....................................................................................................12
Signals as functions..........................................................................................................................12
Fourier’s Theorem...........................................................................................................................12
Frequency in images........................................................................................................................13
Conjugate Symmetry.......................................................................................................................15
Spatial and Frequency Domain........................................................................................................16
Relating frequencies to images........................................................................................................16
Convolution in the Spatial/Frequency Domain................................................................................18
Filters...............................................................................................................................................19
Interpretation of the Power Spectrum................................................................................................19
Edge Detection....................................................................................................................................20
Gradient...........................................................................................................................................20
Shape Detection..................................................................................................................................23
Hough Space....................................................................................................................................23
General Hough Transform...............................................................................................................24
Image Segmentation............................................................................................................................26
Thresholding....................................................................................................................................26
Split and Merge...............................................................................................................................27
Clustering.........................................................................................................................................27
Object Detection.................................................................................................................................27
Viola & Jones Method......................................................................................................................28
Features...........................................................................................................................................28
Integral Images................................................................................................................................28
Feature Selection.............................................................................................................................29
Adaboost.........................................................................................................................................29
Viola Jones Cascade/Early-Stopping................................................................................................30
Motion Modelling................................................................................................................................30
Perspective Projection Equations....................................................................................................31
3-D Rigid Motion..............................................................................................................................32
Fundamental Rotation Matrices..........................................................................................................33
3-D Motion Field..............................................................................................................................34
Optical Flow.........................................................................................................................................37
Motion Estimation...........................................................................................................................37
Motion Estimation- Constraining Optical Flow....................................................................................41
Constant Velocity Model.................................................................................................................41
Image example of spatial, temporal gradients & optical flow vector..............................................44
Frame Difference.............................................................................................................................44
Multiresolution :..............................................................................................................................46
Affine Motion Model.......................................................................................................................47
Horn-Schunk Algorithm...................................................................................................................48
Motion Segmentation..........................................................................................................................49
2.5D Layered Representation..........................................................................................................50
Affine Motion..................................................................................................................................51
Segmentation..................................................................................................................................52
Parametric Motion Regions.............................................................................................................53
Region Linking and Alignment.........................................................................................................53
Stereo..................................................................................................................................................53
Three Problems of Stereo................................................................................................................53
Stereo- Epipolar Geometry..................................................................................................................54
Simple Two-View Stereo..................................................................................................................54
More General Two-View Stereo......................................................................................................55
Epipolar Lines..................................................................................................................................57
Epipolar Geometry..........................................................................................................................60
Stereo- Reconstruction........................................................................................................................63
Image Points and Pixels...................................................................................................................64
Estimating the fundamental matrix from correspondences............................................................65
3-D reconstruction...........................................................................................................................66
Solving 3-D reconstruction..............................................................................................................67
Stereo- Correspondence Matching......................................................................................................69
Region Based Methods....................................................................................................................69
Feature-Based Methods..................................................................................................................71
Scale-Invariant Feature Transform (SIFT)........................................................................................71
Where to look for corresponding points in a calibrated setup........................................................73
Where to look for corresponding in an UNcalibrated setup............................................................73
Acquisition and Representation
Components of acquiring an image
 Detection: are there cars?
 Verification: is that a bus?
 Identification: is that a picture of a certain person?
 Object categorization: sky, building, flag, banner, bus, car, lamp
 Scene and context categorization: outdoor, city, traffic, etc.
 Rough 3D layout, depth ordering

Challenges faced by a computer vision system


 View-point variation- viewing same face from multiple POVs
 Illumination- even humans can struggle to recognize same object under different lighting
conditions
 Occlusion- partially obscured subject
 Scale- recognizing same object at different sizes
 Deformation- many objects, especially living ones, can change shape (e.g. a horse running vs
standing still)
 Background clutter- difficult to pick object from background
 Object intra-class variation- there are a lot of different types of chairs
 Local ambiguity- objects with similar local appearances
 World behind the image- extrapolating full 3D model of object from 2D image

Pinhole Camera (The Camera Obscura)


 Light is reflected off an object and travels through a pinhole into a very dark box
 This forms an inverted image
 The distance from the pinhole to the image plane the image is projected on is the focal
length
 The size of the pinhole and the focal length determine the image quality
 A small pinhole gathers less light, leading to a darker image
 A larger pinhole reduces sharpness (a blurry image), but will produce a brighter image

Digital Image Acquisition


 Rays of light reflected from an object pass through a lens and are projected (inverted) onto a
discrete sensor array
 Sensors have a limited resolution- the lower res, the coarser the image
 Sensors can only store the light as a series of discrete values- effectively, sensors register
average colour
 These discrete values might be say, a representation of colour as RGB values
 This produces an approximation of the true image

Modelling a spatial brightness pulse


 The Dirac Delta-Function is used to model images digitally
 The delta-t function exists between plus and minus infinity and is zero everywhere except
at zero itself, where its value is 1
 Thus, the area under the curve is always 1
 We use the Dirac Delta-Function to sample a signal by stepping through the signal and
recording a measure
 The closer the steps are together, the better the representation of the signal (sampling rate)


 The Dirac Delta-Function is useful because of the sifting property
 This shows that when we sample a function with the Dirac Delta-function (by taking the
integral of the function times the Dirac Delta-function), we are sampling it at 0
 If we instead shift the DDF by an amount alpha, and then do the same thing, we instead
sample our function at f(alpha); we have sifted out the value of function f at alpha
 This lets us step through a signal and sample it
 This sifting property can be used to express a 2D ‘image function’ (i.e., an image) as a linear
combination of 2D Dirac pulses located at points (a, b) that cover the whole image plane


 This is just the DDF in two dimensions

The Point Spread Function


 An optical system takes a true scene and represents it is an image
 Ideally an optical system would map point information to points
 However, optical systems are never ideal


 The point spread function is defined as the response of an imaging system to a point source
 The superposition principle states that an image is the sum of the PSF of all its points
 Thus all images are a ‘blurry’ representation of the real world, no matter how sharp they
seem to our eyes


 If the PSF, h(x, y) was more like a point, we would get a less blurry outcome image

Quantization
 Quantization is the process of mapping infinite continuous values with a smaller set of
discrete values
 Quantization level is how many bits are used to represent a given point

Spatial Sampling
 The sparser the sampling, the more aliased an image is
 Anti-aliasing can be achieved by removing all spatial frequencies above a critical limit (the
Shannon-Nyquist limit)

Shannon’s Sampling Theorem


 Sampling must be performed be above twice the highest (spatial) frequency of the signal to
be lossless
 Too few points, and you cannot recover the signal
Sampling and Quantization Summary


Filtering Images
Image Transforms
 Image transforms are a derivation of a new representation of the input data (e.g. by re-
encoding it in another parameter space)
 If all or some of the pixels in the image are simply changed to a new set of pixels, then the
transformation has occurred in the space of the image, and is said to have occurred in the
spatial domain
 If the transform causes the pixel to have new representation (e.g., by Fourier transform),
the image is said to be transformed into the frequency or Fourier domain
 Image transforms are classic processing techniques, used in filtering, compression, feature
extraction, etc.

Convolution
 Convolution is a mathematical operation on two functions (f and g) that produces a third
function (h = f * g) representing how the shape of one is modified by the other


 Here f is our signal and g is called our kernel
 Intuitively, we reverse the function g, then step the function g through time, and at each
time point look at the overlap between f(x) and g(x)- this gives the area under the curve
h(x)
 Thus the curve h(x) is given by integrating this area


 Auto-correlation is when a function is convolved with itself
 Discrete 2D convolution, as would be used in image processing, is defined as:


 2D convolution has a flipping effect, from left-to-right and top-to-bottom
 Replacing the minus signs in the 2D convolution definition gives us the discrete definition of
2D correlation; this would have no flipping at all
 This effectively superimposes the kernel directly on the signal
 Correlation and convolution are equivalent when the kernel is symmetric under 180-degree
rotation (e.g. 1, 2, 1)

Spatial Low/High Pass Filtering



 The low pass filter smooths an image, while the high pass filter weights the central pixel
more heavily, stretching it and leading to a sharpened version of the original image

Normalising Convolution Results


 The weights used during convolution will affect pixel values, which could cause results to fall
outside 0-255
 If all weights are positive, normalise such that they sum to 1
 If all there are positive and negative weights, normalise such that they should sum to 0

Gaussian Filter Kernel


 Very commonly used in image processing


 The value of sigma determines the size of the Gaussian kernel- the higher sigma is, the more
smoothing occurs

 Filtering with a 2D Gaussian can be computationally expensive
 Same result can be achieved quicker by using two 1D Gaussian filters
 This works because the linear Gaussian kernel is separable
 That is:


 It can be decomposed into row and column vectors


 Separability means that a 2D convolution can be reduced to the product of two 1D
convolutions- one on the rows, one on the columns

Noise Removal: The Median Filter


 When smoothing with an averaging filter, we don’t actually need to perform convolution
 Same effect can be achieved by simply averaging pixels in a pre-defined neighbourhood
 The median filter removes noise
o It returns the median value of pixels in a neighbourhood
o It’s non-linear
o It’s similar to a uniform blurring filter which returns the mean value of the pixels in
a neighbourhood of a pixel
o Best at removing salt and pepper noise- salt and pepper noise can be caused by
sudden changes in the image signal, which might set a pixel to 0 or 255, since the
original value falls out of that range


 As we can see, the median filter does a much better job removing salt and pepper noise
than a standard averaging filter, as the median filter eliminates extremes

 We can obtain the detail of an image by subtracting a smoothed image from the original-
then adding this detail to the original gives us a sharpened image

Frequency Domain and Transforms


Signals as functions
 Frequency allows us to characterise signals:
o Repeats over regular intervals with frequency u = 1/T cycles/sec
o Amplitude alpha (peak value)
o The phase theta (shift in degrees)

Fourier’s Theorem
 States that a signal can be broken into sines and cosines, each of which is weighted
differently and exists at a different frequency

 The sines and cosines are the basis functions of the representation and An and Bn are the
Fourier coefficients
 The sinusoids are harmonically related: each one has a frequency which is an integer
multiple of the fundamental frequency of the input signal


 A Fourier transform then essentially breaks down a complex signal into its constituent sines
and cosines

Frequency in images
 Frequency in an image is defined as the rate of change of intensity
 If similar pixels are grouped together, rate of change is low -> low frequency; if pixels change
drastically rapidly, like in an image of black and white stripes, rate of change is high -> high
frequency
 Images in general include a range of frequencies- an image might have high and low
frequency regions



 The Fourier transform is essentially a projection onto that exponential function
 We take a given u, v, sum the projections of all pixels in the image, and then repeat this for
all u, v in the image
 u and v have the same domain as the image


 When u and v both equal zero, cos(0) – sin(0) = 1, which gives us the slowest varying
frequency component, which is the average image graylevel

 The power spectrum (left) shows the magnitude of the frequencies

Conjugate Symmetry
 When we compute the Fourier space for an image, if we were to plot this we would have u,
v = 0 at the top left
 We want it in the middle, for intuitive reading
 We thus shift the space by replacing each quadrant with its diagonal opposite- this diagonal
symmetry is known as conjugate symmetry
 That is:

Spatial and Frequency Domain


 We often look at the log of the Fourier transform + 1 when doing visual analysis, as this
prevents lower value points from being ‘washed out’ and lost by very high magnitude points

Relating frequencies to images


 Lower frequencies in an image represent the contrast of an image, while higher frequencies
represent the detail



 Here the vertical line in the magnitude/power shows that the most frequency change occurs
moving vertically through the image (i.e., it represents moving between the horizontal rows
of train cars) while the horizontal line in the magnitude shows a weaker but still substantial
frequency change moving horizontally through the image, (i.e., it represents moving
between the vertically defined individual train cars)


 Here we remove the peaks in the Fourier space before performing an inverse Fourier
transform- this cleans up the lines of interference

 As we can see, phase encodes structure and image positions, while magnitude is simply a
measure of the intensity of points in the structure, that is, it encodes contrast

Convolution in the Spatial/Frequency Domain


 This can be useful when doing things like convolving large matrices- we can simply transform
them into the Fourier space, multiply them, and then do an inverse Fourier transform, which
is much cheaper computationally

Filters
 High pass filters allow high frequencies to pass through, removing contrast and preserving
detail, while low pass filter allow low freqs. Through, preserving contrast and removing
detail

Interpretation of the Power Spectrum


 Orientation of magnitude points in power spectrum corresponds to orientation of
frequencies in the image; for lines, the orientation in the power spectrum is perpendicular
to the orientation of the line (example seen in 18-19 past paper)
 Translation of the image would not change position of dots in power spectrum
 Rotation of the image would rotate the Fourier space and thus the dots by the same amount

 Scaling an image has an inverse result in the Fourier space- intuitively, increasing image size
leads to decrease in frequency as pixels are duplicated, and vice versa

Edge Detection
 Edges highlight the contours of objects
 Edges are sharp changes of image brightness
 Nuisance edges are edges which are not helpful in recognizing or identifying a shape
 We can detect edges by looking for a measure of change in a pixel’s neighbourhood
 We can look at derivative of the image to identify edges by looking for where the derivate
(and thus rate of change) is greatest

Gradient
 The gradient is a vector in the direction of the maximum growth of the function (i.e.,
greatest change)

 The edge vector is perpendicular to the gradient vector, and is in the direction of the edge
itself
 We calculate the derivative of the image as a discrete approximation of the actual
derivative (as our image is itself only a discrete approximation of a continuous function)
 We simply look at the differences of the pixels from left-to-right and top-to-bottom; we
subtract the pixel to the left from the one to the right, and so on
 We can do this quickly with convolution, by applying certain kernels to the image


 Calculating the derivative simply with central difference (that is, considering only pixels
immediately to the left and right or top and bottom of a pixel) is very sensitive to noise, and
so edges can be lost; the above convolution method considers a larger neighbourhood and
is less sensitive to noise

 The Prewitt Operator kernel is the 3x3 kernel discussed above
 By contrast, the Sobel operator uses a similar idea but gives greater weight to the central
pixels; approximately, this is the derivate of an image that has been Gaussian smoothed



 Image processing is not image detection- this is just highlighting edges, not recognizing
shapes

Shape Detection
Hough Space
 A straight line in 2D space can be described the parametric equation:

 As such, all straight lines in 2D space can be described by rho and theta- together, they form
the Hough Space of lines
 If we plot all the lines passing through a given point in the image space in the Hough space,
we get a sinusoidal curve; each point (theta, rho) on the curve represents a straight line
passing through the given point in the image space

 To represent an image in the Hough space, we look at every pixel in the image and consider
all the lines passing through it in the image space; then, for each line in the image space, we
sum the intensities of all pixels intersecting that line in the image, and plot this summed
intensity in the Hough space
 Peaks in the Hough space correspond to lines in the image space


 Extending this algorithm to say, a circle, would require a 3D Hough space- you would need x,
y coordinates for the centre of the circle and a radius. This approach becomes intractable for
more complex shapes, so we instead use the Generalized Hough Transform

General Hough Transform


 We pick an origin point for our shape, and store X, Y co-ords, then for each point on the
shape’s edge we store the vector from the point to the origin (that is, an angle and a radius)

Interpreting the Hough Space


 X direction represents theta, the angle of a line, and Y direction represent rho, the distance
of the line from the origin
 Each point in Hough Space represents a line in the image space; lines passing through the
same point in the image space lie on the same curve in the Hough space
 So each curve in Hough space represents a point; each point on that curve represents a
different possible line passing through said point in the image space- note that these lines
in image space might be mostly or entirely empty- the content of these lines is represented
in the Hough space by the intensity of the point
 Peaks in the Hough space represent lines in the image space
 If you have many lines in the same direction, but separated, you will see extended peaks (i.e.
peaks that look like vertical bars)- see problem sheet 3 skyscraper image for an example



Image Segmentation
 Process of spatially splitting an image into regions of pixels
 Perfect segmentation is difficult- pixels may straddle boundaries between objects, noise,
illumination, etc.
 Over-segmentation- pixels belong to the same region erroneously classified into multiple
regions
 Under-segmentation- pixels belong to different regions classified as same region
Thresholding
 All pixels over a certain brightness threshold are classified as a region of interest, below is
background

Split and Merge


 Homogeneity function- takes a region and returns a measure of how homogenous, 1 is
homogenous, 0 is inhomogeneous


 Split and merge tends to give blocky results, due to just splitting image into fours
 Can use adaptive homogeneity functions that vary based on region size

Clustering
 Cluster pixels by colour and group similar coloured pixels as a region
 Can use K-Means Clustering

Object Detection
 Bridges the semantic gap between given pixel values and meaningful objects
 Image regions need to be found and assigned semantic labels from a space of object classes
 Problems with classical shape detection (e.g. edge detection, colour histograms):
o High intra-class variance
o Low inter-class variance
o See first lecture for more

Viola & Jones Method


 Image is tested for object presence window by window
 The window is ‘slided’ and ‘scaled’ through the image
 Each window is judged w.r.t. to an object model and gives a response indicating object
presence or absence

Features
 Haar-like features: define regions and some operation, such that an object is likely to be
present given certain outcomes of the operation

Integral Images
 To avoid unnecessary duplicate calculations when considering features, we use integral
images
 Each pixel’s value is set to the value of all the pixels above and to the left of it
 This makes window proposals and feature calculations much quicker

 Can calculate pixel values for specific regions and subregions by subtracting the ‘corners’ of
the region:


 To find the corners for an arbitrary region shape, we give all pixels in the region a value of 1,
and 0 to those outside the region; we then differentiate with respect to x, and then
differentiate that result with respect to y; this gives us the corners of the image- we simply
add the values in spots where the 1s appear and subtract those where -1s appear

Feature Selection
 Could hand-select, but difficult and time-consuming
 We want to use training data to allow system to choose good features
Adaboost


 Adaboost trains a classifier on every single feature, then selects the classifier with the lowest
training error
 It then increases the weight of misclassified samples, and repeats the above step
 The final classifier is a linear combination of all the individual classifiers

Viola Jones Cascade/Early-Stopping


 Can use a simple classifier for rejecting most of the image, like background stuff, and then
only run the more complicated classifiers on a subset of the image
 Allows it to terminate early, makes it quicker for running real-time

Motion Modelling
 When an object or camera in a scene moves, the distance points on the object move in the
image between frames is known as the incremental 2-D motion, or the 2-D velocity in the
image plane

 When an object moves from point P to point Q, the point where it crosses the image plane
changes as well, from p to q. The distance pq is the 2-D motion, which can also be thought
of as a projection of the 3-D motion PQ into 2-D
 3-D motion does not uniquely map to 2-D motion, as can be seen below; p’ will the be the
point on the image plane for any 3-D point along the red ray from p’ to P’

Perspective Projection Equations


 Thus as Z, or distance from the camera increases, objects appears smaller, due to the
inverse relationship defined above
 Meanwhile focal length f has a positive relationship with image size

3-D Rigid Motion


 For simplicity, we assume that objects are rigid (don’t change shape), and can move by 3-D
translation and rotation
 All motion can be explained by a translation and rotation, i.e. for any point on an object, P’ =
RP + T
 Rotation matrix can be written as a product of three individual rotation matrices about
each axis
 The angles defining how much the object has rotated around each axis are known as the
Euler angles
 Matrix multiplication is not generally commutative, so order of multiplication matters
 However, here we make the assumption of small angle approximations: this states that for
small theta (which we assume is always the case due to dealing with incremental motion),
cos theta is approximately 1 and sin theta is approximately theta.
 Under this assumption, order of multiplication does not matter

Fundamental Rotation Matrices


 Above are the fundamental rotation matrices
 Not necessary for above rigid body motion through a small theta, but useful later
3-D Motion Field

 This equation relates an object’s position P’ to its old position P


 R – identity matrix gives 0s where there are 1s in R, thus giving the V component vectors
shown in the slide
 TX, TY, etc. are the X, Y components of the translation matrix

 The only thing that changes between points on an object is the values of small x, y and the
value of big Z, the distance from the camera


 When there is only rotation, the depth of the scene is irrelevant, as the translational term is
the only part of the equation that uses Z and is 0 when there is no translation
 When there is only translation, the angular velocity is obviously irrelevant


 The above shows that for translation, the same 3-D motion field does not necessarily
correspond to the same 2-D motion field, as depth is a factor
 Conversely, when there is only rotation, depth is not a factor and so the same 3-D motion
field produces the same 2-D motion field
 IMPORTANT NOTE: this derivation is for the theoretical 2-D motion field produced by
projection 3-D motion vectors into the image plane; it is NOT necessarily what we perceive
in a video sequence produced by the changing colours of pixels

Optical Flow
Motion Estimation
 Estimation of 2-D motion field from frames in an image sequence
 Relationship between variation in pixel values- apparent motion or optical flow- and true
motion is not straightforward

 Optical flow- perceived motion in video sequence caused by changes in pixel values
 Extreme examples of cases where optical flow is very different to true 2-D motion field:
o Non-zero apparent motion for zero motion field, e.g. static scene, moving light
source
o Zero apparent motion for non-zero motion field, e.g. a sphere of constant colour
rotating in diffuse lighting
 Sometimes not possible to determine 2-D motion flow without additional assumptions
 Assume optical flow results from brightness constancy constraint- this says that a moving
pixel retains its value between frames
 Alternatively phrased: two points in adjacent frames corresponding to the same 3-D point
have the same value (i.e. colour)

 Higher order terms in Taylor expansion tend to 0 for tiny change in x, y, t, so in order for the

brightness constancy constraint equation to hold


must sum to 0

 So the optical flow equation is:


o Rate of change of I in X direction times the optical flow in X + rate of change of I in Y
direction times optical flow in Y + rate of change of I with time

 If we replace ux and uy with the vector u = (ux, uy) and similarly

 Then the OFE becomes


 However, this is equation with two unknowns
 We split the vector u into two components up and un. We define un as the component of u
in the direction of ∇I (the spatial gradient), and un as the component perpendicular to it
 Then ∇I . up = 0, by definition (as up is defined perpendicular to ∇I), so we get:


 From which we can derive the equations for the magnitude and direction of un:


 Thus at any given point the OFE only allows us to estimate the normal flow- without further
information, we cannot calculate the magnitude of the motion parallel to the edge feature,
and so we do not know the full flow vector

 As shown above, we can find vn, the normal flow, which is the component of optical flow
perpendicular to the edge feature, but not the component parallel to it- thus the point could
now lie anywhere along the green dotted line

Motion Estimation- Constraining Optical Flow


Constant Velocity Model
 We constrain our model by assuming that the optical flow field is constant within small
regions- i.e. we assume adjacent flow vectors are the same
 Then:


 This works because if our assumption holds (that adjacent flow vectors are the same) then
then the OFE should tend to 0- thus the sum of squares of the OFEs throughout the region
also tends to 0
 Therefore, the best estimate of OF will be the one that minimizes the value of the OFE
throughout the region (as a perfect estimate would sum to 0 throughout the region)
 We can take the derivate w.r.t. to ux and uy, set them to zero and then solve for ux and uy+

Lucas and Kanade Algorithm


 Can fail, e.g. if A has no inverse
 Estimates the full flow (i.e. not just normal) assuming there is sufficient variation in the
spatial gradient
 It minimises the deviation from the optical flow equation within the region
 It assumes there is only one motion in the region
 It uses gradients to minimise the OFE
 So we can simply solve for the flow vector u by finding the inverse of A and multiplying it by
b
 This method can fail if A has no inverse
 Need a way to estimate Ix, Iy and It (the spatial and temporal gradients respectively) using
differences
 Recall that Ix and Iy are the rate of change of pixel values in the x and y directions


 If we assume the distance between pixels and frames is 1, then we can just find the
difference between pixel values separate by 1 in the relevant direction
 We average the difference between multiple values to reduce noise
 So, for example, we find the difference between the value of a pixel and its immediate right
neighbour at the same time/in the same frame, repeat multiple times and average the diff.-
this gives an approximation of Ix
 Can do the same thing for y or time
 In the slide above, and in Lucas & Kanade alg., we’re taking values from different frames
(two from each in the above example)
 Note however that in this case the position of the estimated gradient is between the pixels
along the x direction and between the frames along the t direction- this is fine, as long as the
other gradients in y and t are also computed in a similar way to be consistent

 So, in essence, we take two adjacent frames, then for each pixel in our region we compute
the temporal and spatial gradients between it and its immediate neighbours, and use that to
iteratively add to A and b; finally, we compute the flow vector u with the inverse of our final
A and our final b

Image example of spatial, temporal gradients & optical flow vector


 Recall edge detection- Ix shows where biggest changes in pixel value in the x direction,
similar for y
 It shows where biggest changes in pixel value lie in the time direction. Therefore, we see the
biggest values where motion brings a very different coloured point to the space, like with
the moving ball, train and calendar
 ux and uy, the optical flow, have greatest magnitude (the darker sections) where there is
most motion in the relevant direction- the ball and train move most horizontally, as shown
by ux, while the calendar moves most vertically, as shown by uy

Frame Difference
 Obviously, we don’t know our ground truth (the true motion field), so we can’t compare it
to our estimated optical flow
 We can however use our estimated flow vectors to move one frame towards the other and
look at the difference- in other words, apply the motion to one frame and check whether
we get the other frame
 This is motion compensated differencing

 The left image shows the difference in pixel values between frames, the right shows the
difference when the earlier frame is motion compensated and then compared with the latter
one
 If the optical flow approximation was exact, the right image would be all black, as the
difference would be 0 everywhere- this obviously isn’t the case
 However, we can see that the difference is significantly smaller in the right image, so our
approximation is doing a pretty good job
Multiresolution :

Affine Motion Model

Horn-Schunk Algorithm

Motion Segmentation

 Challenges:
o Need to know what a ‘homogenous motion’ region is, requires defining a 2-D motion
field model
o Difficult to determine ‘correct’ boundaries
o Generalised aperture problem- large regions needed for good motion estimate, but
large regions morel likely to contain motion boundaries
o Usual segmentation challenges, occlusions, temporal variation, objects moving in
and out of view, etc.
2.5D Layered Representation


 Video is represented as superpositioned layer of different depths

 We first estimate the motion of each pixel, such as by Lucas and Kanade
 The image is then split into grid blocks and each block is fitted to an Affine motion model
 K-means clustering is then used to find the dominant motions (each dominant motion will be
a centroid in K-means)
 Each pixel is then assigned to a dominant motion region based on which region most closely
matches the motion estimated generated for it in step 1
 Each dominant motion region becomes a separate layer, and we finally fit a final affine
motion field for each layer

Affine Motion
 When an object is sufficiently far from the camera, we can approximate it as a planar region
 This lets us define a quadratic motion field

 We can then ignore the 2nd order terms, so we got our affine motion model
Segmentation

 The black arrow is the L&K estimate for a specific pixel from the first step of the alg., and the
yellow green and white are the motion fields generated by using the affine parameters from
each motion region- we pick the closest one to the L&K estimate (in this case yellow) and
assign the pixel to that region

Parametric Motion Regions


 When we find the initial affine parameters for each block, they will be ‘corrupted’ around
boundaries where pixels within the same block belong to different motion regions
 That shouldn’t be the case now, so we refit an affine model to each motion region using all
the pixels assigned to each region in the previous step
Region Linking and Alignment

 We propagate motion forward based on the motion estimates calculated for each region,
which let us ‘link’ motion regions between frames
 We can then ‘overlap’ regions across multiple frames, which lets us extract colour, velocity
and alpha maps

Stereo
 Stereo vision- 3-D from two images taken from different viewpoints
 Objects appear in different positions in each viewpoint- this is called parallax
 Position of object in each image is dependent on depth- magnitude of position difference
(disparity) is in inversely proportional to depth
 That is, disparity is greater between views the closer an object is to the cameras
 Thus is if we know disparity, we can estimate scene structure from two images

Three Problems of Stereo


 To get a 3-D estimate from a stereo pair, we need must do the following
 Calibration- determine relative position and orientation of the cameras
 Correspondence- determine matching points in the stereo views (i.e. map the same 3-D
point to corresponding 2-D point in each image)
 Reconstruction- determine 3-D location in scene of matched points via triangulation
Stereo- Epipolar Geometry
Simple Two-View Stereo
 Assume coplanar image planes, and thus parallel principal axes
 Assume the same focal length for each camera

 Where XL – XR = the disparity, d


 Note that T, the distance between cameras, is proportional to d, as we can arrange to get d
=f*T/Z
 This means that as the distance between cameras gets bigger, there will be a bigger
disparity- however, this comes with a tradeoff, as the further apart the two cameras are the
less of the scene they will be able to see in common (i.e. the less overlap in their views),
which reduces the amount of scene structure that can be derived from our two images
 Note that for this to work XR and XL must correspond- that is, the are both the image plane
representations of the same 3-D point in the scene
 If this is not the case, and we get a mismatch, then we reconstruct an incorrect 3-D point
from our 2-D points

 Above, if we try to reconstruct point P with pl and incorrectly matched point qr, we end up
with incorrect point P’ instead- this will give an incorrect depth for our reconstructed points

More General Two-View Stereo


 Less constraints- only require overlapping field of view
 This model also allows us to model multiple pictures taken by a moving single camera as
multiple stereo images

 We have two coordinate systems, one with the left camera as origin and the other with the
right
 We describe a point P by PL, a vector in the left coordinate system, and the distance
between the cameras is given by the vector T in the left. coord. System
 PR in the right coord system gives the vector from the right camera to point P
 In the left coordinate system PR = -T + PL = PL – T due to how vectors work, but to define
w.r.t. to the right coordinate system we need to account for the different 3-D orientation of
the right camera relative to the left coordinate system
 The transpose of a rotation matrix is equivalent its inverse
 To use the left coord system to reach P from the right camera, we would need to perform
the inverse of the rotation that turns the left camera to match the right, and then follow the
vector
 Thus if we define a rotation matrix RT for the transposition of R as the rotation necessary to
transform orientation of left camera to match right, we can apply the inverse by simply
multiplying by R (as that’s the inverse of RT)
 Therefore we get PR in the right coordinate system equalling R(PL – T)
Epipolar Lines

 Obviously, we cannot know the depth of a 3-D point just from one image; from where a ray
crosses the image plane we get PL, and from that we know that the corresponding point in
3-D space must lie somewhere along that line PL
 Looking now at PR, we know that the 3-D point we’re looking for must correspond to one of
the image points such that the vector PR for that point intersects PL- this intersection is the
3-D point
 The points on the right image plane that give a PR that intersects PL lie along what is called
an epipolar line- this line is given by projecting PL onto the right image plane

 Note we can then pick one of these prospective vectors PR, and find points in the left image
plane that intersect with that line, as shown above- this then gives an epipolar line in the
left coordinate system; this epipolar line would be the same regardless of which PR along
the right epipolar line we chose
 All of these rays and lines are clearly coplanar- that is, they lie on the same plane, which is
called the epipolar plane
 Epipolar lines are only parallel when principal axes are parallel (Z axis) and image planes lie
in the same plane; IT IS TRUE THAT EPIPOLAR LINES ARE PARALLEL WHEN THE IMAGE
PLANES OF THE CAMERAS ARE IN THE SAME PLANE
 All epipolar lines in one image meet at a single point which may be at infinity
 The line joining the two centres of projection is in all epipolar planes
 An epipolar line is the projection of a 3-d ray emanating from the COP of the other camera
 Fundamental matrix depends on focal length
 Essential matrix depends on relative rotation
 There are many epipolar planes- in general, one for each pair of corresponding points
 An epipolar plane does cut each image along an epipolar line

Epipolar Geometry


 The cross product of two vectors gives a vector which is perpendicular to both
 Since T and PL both lie on the epipolar plane, their cross product must be perpendicular to
the epipolar plane
 PL – T, which is the value of PR in the left coordinate system, lies on the epipolar plane
 Therefore (PL – T).(T cross product PL) gives a dot product of 0, as the two vectors are
perpendicular/orthogonal

 All pairs PLN and PRN must satisfy that constraint equation
 The points where each line PRN crosses the right image plane gives the epipolar line
 To simplify the notation, we define the essential matrix as E = RS

 Because the right hand side of the equation is 0, we can simply eliminate the Z/f terms, to

get our final equation


 Note that when PL and E are known, this gives a linear relationship between components of
PR- i.e., it gives a line in the right image, that is the epipolar line
Stereo- Reconstruction

 Proof that the equation gives the equation epipolar line in right image
 PR has components x, y and f because it’s the vector from the camera to the image plane
the camera is focal length f away from plane for all points
Image Points and Pixels

 Converting between image plane coordinates and pixel coordinates needed (as obviously
pixel coordinates start with (0, 0) in the top left while image plane coordinates do not)
 Principal axis of the camera (z-axis) cuts image plane at (0, 0, f) in image plane coordinates-
let us define the point where the axis cuts the plane as the principal point (written above as
o hat x and o hat y)
 We can then convert a pixel p from pixel coordinates to image plane coordinates by
subtracting the pixel coordinates of the principal point from the pixel coordinates of p and
then multiplying by the dimensions of a pixel (given here by sx * sy)

 This gives us the same equation as before but with x and y in pixel coordinate form, where
the matrix ML is just implementing the conversion equations shown on the left

 This is the same equation defined in terms of pixel coordinates, which lets us apply it directly
to real images (assuming we know the characteristics needed to define the fundamental
matrix)

Estimating the fundamental matrix from correspondences


3-D reconstruction

 As we do not have perfect accuracy for the mapping of an image point to a 3-D point (due to
noise and calibration errors), we have a problem- the projected rays PR and PL will be
slightly off and not cross
 We want to find the point where the difference between them is closest, and estimate our
3-D point as being there
 This is the point where the distance vector between is perpendicular to both of them-
ideally, we’d estimate our 3-D point at the middle of this difference vector

 The first part of the final equation, is PL – PR (both in


left coordinate system; we simply do the inverse of the transformation to convert left
camera coordinates to right)
 The second part is the cross product of PL and PR (again both in left coordinate system; we
do not have the +T term in PR this time because we want the vector to be parallel to the
right ray, i.e. in the direction of PR just defined in the left coordinate system, and adding T
would change the direction)
 As the cross product is perpendicular to PL and PR, it must be in the same direction as the
difference vector d from the first part of the equation
 We need to solve for a and b, the scaling factor of PL and PR respectively, and c, a scalar
 If the difference vector is not perpendicular to PL and PR, then no c exists that will make the
equation = 0; therefore, solving this equation ensures we have a difference vector parallel to
both rays
 Solving this gives us the scaling factors of PL and PR, a and b respectively
Solving 3-D reconstruction

 Bringing T over to the RHS, we have four 3x1 matrices equated- essentially, we have 3
unknowns and 3 different equations in terms of those unknowns, so we can solve
 Converting to matrix form, we get:

 Where H is simply a 3x3 matrix combining the three LHS 3x1 matrices- can think of matrix x
vector multiplication as a weighted sum of the columns, where the vector gives the weights

 So we then can simply evaluate with our values of a and b to find the endpoints of the
difference vector, and then take the midpoint between them as our 3-D point estimate
Stereo- Correspondence Matching
 If we have a 2-D point in one image plane and we know the camera geometry (translation
and rotation between cameras) we can find the epipolar line the corresponding point in the
other image plane must lie on
 If we have a corresponding pair in the image planes and know the camera geometry we can
estimate the 3-D point

Region Based Methods


 Compare pixel values within regions in two views; for region in left image, compute similarity
with same size regions in right image
 Corresponding point is the centre of the most similar region in the right image

 Regions need to be large enough to be distinguishable from other regions (e.g., there will be
likely be many cases of, at the extreme low end, one pixel of a given colour throughout the
image)
 However if the region is too large, its likely to contain a depth discontinuity
 Also, camera perspective can lead to the same 3-D area projecting to a larger 2-D region in
one view than the other (e.g. if one camera is significantly closer to the object), violating our
model assumptions
 Region based matching is quick and simple to implement, but obviously flawed as described
above

 Disparity d is essentially the offset between a point in the left image and the corresponding
point in the right
 For all pixels in the chosen size region, we compute a similarity measure between the
corresponding pixel in the left region and that pixel + disparity measure in the right- these
similarity measures are summed to give cost c
 We seek to choose d such that c is minimized- that is, choose the corresponding pixel in the
right image such the region surrounding is most similar to the region surrounding the pixel in
the left image

 Remember, the value of a pixel u or v is its colour value


 Note that with normalised cross correlation, a perfect match has a similarity value of 1, and
a perfect mismatch has value -1

Feature-Based Methods
 Restrict search to sparse set of features
 Find salient (distinct) points in each view, like corners
 Match points by comparing pixels or image descriptors in local regions about each point
 This leads to a sparse 3-D reconstruction, i.e. a sparse point cloud, which needs further
processing if we want to fill in the gaps
 An example of the SIFT feature based method follows

Scale-Invariant Feature Transform (SIFT)


 Feature-based method


 Repeatedly Gaussian blur the image, and difference the adjacent images to build a
Difference of Gaussian tree
 This leads to highlighted edges and distinct regions showing in the DoG tree
 Key points are extrema (points which have the maximum or minimum value in a
neighbourhood both on the same and adjacent levels of the tree [as shown in the image]),
and are usually present in both views, possibly at different levels- this leads to the scale
invariance property of the method
 If the images are imaged at a different scale, the same point will appear on different levels in
the two image’s DoG trees
 SIFT only computes descriptors for salient points
 SIFT minimises descriptor change with viewing angle (though there will probably be some)
 The descriptor is based on histograms of gradient angles in 4x4 blocks

Where to look for corresponding points in a calibrated setup


 As previously shown, for a calibrated stereo set up, corresponding points lie on epipolar
lines
 As such, given in a point in the left image, we can limit our search for the corresponding
point in the right image to a band along the epipolar line (we use a band because the
epipolar line is an approximation due to noise and such, we assume we’re slightly off)
 This increases speed of matching and reduces mismatches

Where to look for corresponding in an UNcalibrated setup


 When geometry is unknown, we have to match points using only pixel values, with a region
or feature based approach
 Often leads to mismatches among true matches- outliers are mismatches, inliers are true
matches

 We know that inliers will be related by epipolar constraint


 We don’t know F, the fundamental matrix, as we don’t know camera geometry, but we can
estimate it given enough corresponding points
 We essentially match points using pixel values, use those corresponding pairs to compute an
estimate for F, then assess support for F amongst other correspondence (by checking how
many matches obey the constraint for that given F), then repeat until best F is found

 So, we first find possible matches with pixel values (region/feature based methods)
 Then we select a random subset of these matches, and use them to compute F
 For each matching pair, we then use our approximated F to find the epipolar line the point in
the right image should lie along; if it is within a certain threshold of the line, we classify it as
an inlier, otherwise it is an outlier
 The ratio of inliers vs. outliers gives us the support for this value of F
 We then repeat the process many times, selecting a new subset of possible matches to use
to compute F, and choose the F value with the most support at the end
 With the best F selected, we discard our outliers and recompute a final F using only the
inliers

You might also like