Image Processing and Computer Vision Notes
Image Processing and Computer Vision Notes
The Dirac Delta-Function is useful because of the sifting property
This shows that when we sample a function with the Dirac Delta-function (by taking the
integral of the function times the Dirac Delta-function), we are sampling it at 0
If we instead shift the DDF by an amount alpha, and then do the same thing, we instead
sample our function at f(alpha); we have sifted out the value of function f at alpha
This lets us step through a signal and sample it
This sifting property can be used to express a 2D ‘image function’ (i.e., an image) as a linear
combination of 2D Dirac pulses located at points (a, b) that cover the whole image plane
This is just the DDF in two dimensions
The point spread function is defined as the response of an imaging system to a point source
The superposition principle states that an image is the sum of the PSF of all its points
Thus all images are a ‘blurry’ representation of the real world, no matter how sharp they
seem to our eyes
If the PSF, h(x, y) was more like a point, we would get a less blurry outcome image
Quantization
Quantization is the process of mapping infinite continuous values with a smaller set of
discrete values
Quantization level is how many bits are used to represent a given point
Spatial Sampling
The sparser the sampling, the more aliased an image is
Anti-aliasing can be achieved by removing all spatial frequencies above a critical limit (the
Shannon-Nyquist limit)
Filtering Images
Image Transforms
Image transforms are a derivation of a new representation of the input data (e.g. by re-
encoding it in another parameter space)
If all or some of the pixels in the image are simply changed to a new set of pixels, then the
transformation has occurred in the space of the image, and is said to have occurred in the
spatial domain
If the transform causes the pixel to have new representation (e.g., by Fourier transform),
the image is said to be transformed into the frequency or Fourier domain
Image transforms are classic processing techniques, used in filtering, compression, feature
extraction, etc.
Convolution
Convolution is a mathematical operation on two functions (f and g) that produces a third
function (h = f * g) representing how the shape of one is modified by the other
Here f is our signal and g is called our kernel
Intuitively, we reverse the function g, then step the function g through time, and at each
time point look at the overlap between f(x) and g(x)- this gives the area under the curve
h(x)
Thus the curve h(x) is given by integrating this area
Auto-correlation is when a function is convolved with itself
Discrete 2D convolution, as would be used in image processing, is defined as:
2D convolution has a flipping effect, from left-to-right and top-to-bottom
Replacing the minus signs in the 2D convolution definition gives us the discrete definition of
2D correlation; this would have no flipping at all
This effectively superimposes the kernel directly on the signal
Correlation and convolution are equivalent when the kernel is symmetric under 180-degree
rotation (e.g. 1, 2, 1)
The value of sigma determines the size of the Gaussian kernel- the higher sigma is, the more
smoothing occurs
Filtering with a 2D Gaussian can be computationally expensive
Same result can be achieved quicker by using two 1D Gaussian filters
This works because the linear Gaussian kernel is separable
That is:
It can be decomposed into row and column vectors
Separability means that a 2D convolution can be reduced to the product of two 1D
convolutions- one on the rows, one on the columns
As we can see, the median filter does a much better job removing salt and pepper noise
than a standard averaging filter, as the median filter eliminates extremes
We can obtain the detail of an image by subtracting a smoothed image from the original-
then adding this detail to the original gives us a sharpened image
Fourier’s Theorem
States that a signal can be broken into sines and cosines, each of which is weighted
differently and exists at a different frequency
The sines and cosines are the basis functions of the representation and An and Bn are the
Fourier coefficients
The sinusoids are harmonically related: each one has a frequency which is an integer
multiple of the fundamental frequency of the input signal
A Fourier transform then essentially breaks down a complex signal into its constituent sines
and cosines
Frequency in images
Frequency in an image is defined as the rate of change of intensity
If similar pixels are grouped together, rate of change is low -> low frequency; if pixels change
drastically rapidly, like in an image of black and white stripes, rate of change is high -> high
frequency
Images in general include a range of frequencies- an image might have high and low
frequency regions
The Fourier transform is essentially a projection onto that exponential function
We take a given u, v, sum the projections of all pixels in the image, and then repeat this for
all u, v in the image
u and v have the same domain as the image
When u and v both equal zero, cos(0) – sin(0) = 1, which gives us the slowest varying
frequency component, which is the average image graylevel
Conjugate Symmetry
When we compute the Fourier space for an image, if we were to plot this we would have u,
v = 0 at the top left
We want it in the middle, for intuitive reading
We thus shift the space by replacing each quadrant with its diagonal opposite- this diagonal
symmetry is known as conjugate symmetry
That is:
Here the vertical line in the magnitude/power shows that the most frequency change occurs
moving vertically through the image (i.e., it represents moving between the horizontal rows
of train cars) while the horizontal line in the magnitude shows a weaker but still substantial
frequency change moving horizontally through the image, (i.e., it represents moving
between the vertically defined individual train cars)
Here we remove the peaks in the Fourier space before performing an inverse Fourier
transform- this cleans up the lines of interference
As we can see, phase encodes structure and image positions, while magnitude is simply a
measure of the intensity of points in the structure, that is, it encodes contrast
This can be useful when doing things like convolving large matrices- we can simply transform
them into the Fourier space, multiply them, and then do an inverse Fourier transform, which
is much cheaper computationally
Filters
High pass filters allow high frequencies to pass through, removing contrast and preserving
detail, while low pass filter allow low freqs. Through, preserving contrast and removing
detail
Edge Detection
Edges highlight the contours of objects
Edges are sharp changes of image brightness
Nuisance edges are edges which are not helpful in recognizing or identifying a shape
We can detect edges by looking for a measure of change in a pixel’s neighbourhood
We can look at derivative of the image to identify edges by looking for where the derivate
(and thus rate of change) is greatest
Gradient
The gradient is a vector in the direction of the maximum growth of the function (i.e.,
greatest change)
The edge vector is perpendicular to the gradient vector, and is in the direction of the edge
itself
We calculate the derivative of the image as a discrete approximation of the actual
derivative (as our image is itself only a discrete approximation of a continuous function)
We simply look at the differences of the pixels from left-to-right and top-to-bottom; we
subtract the pixel to the left from the one to the right, and so on
We can do this quickly with convolution, by applying certain kernels to the image
Calculating the derivative simply with central difference (that is, considering only pixels
immediately to the left and right or top and bottom of a pixel) is very sensitive to noise, and
so edges can be lost; the above convolution method considers a larger neighbourhood and
is less sensitive to noise
The Prewitt Operator kernel is the 3x3 kernel discussed above
By contrast, the Sobel operator uses a similar idea but gives greater weight to the central
pixels; approximately, this is the derivate of an image that has been Gaussian smoothed
Image processing is not image detection- this is just highlighting edges, not recognizing
shapes
Shape Detection
Hough Space
A straight line in 2D space can be described the parametric equation:
As such, all straight lines in 2D space can be described by rho and theta- together, they form
the Hough Space of lines
If we plot all the lines passing through a given point in the image space in the Hough space,
we get a sinusoidal curve; each point (theta, rho) on the curve represents a straight line
passing through the given point in the image space
To represent an image in the Hough space, we look at every pixel in the image and consider
all the lines passing through it in the image space; then, for each line in the image space, we
sum the intensities of all pixels intersecting that line in the image, and plot this summed
intensity in the Hough space
Peaks in the Hough space correspond to lines in the image space
Extending this algorithm to say, a circle, would require a 3D Hough space- you would need x,
y coordinates for the centre of the circle and a radius. This approach becomes intractable for
more complex shapes, so we instead use the Generalized Hough Transform
Image Segmentation
Process of spatially splitting an image into regions of pixels
Perfect segmentation is difficult- pixels may straddle boundaries between objects, noise,
illumination, etc.
Over-segmentation- pixels belong to the same region erroneously classified into multiple
regions
Under-segmentation- pixels belong to different regions classified as same region
Thresholding
All pixels over a certain brightness threshold are classified as a region of interest, below is
background
Split and merge tends to give blocky results, due to just splitting image into fours
Can use adaptive homogeneity functions that vary based on region size
Clustering
Cluster pixels by colour and group similar coloured pixels as a region
Can use K-Means Clustering
Object Detection
Bridges the semantic gap between given pixel values and meaningful objects
Image regions need to be found and assigned semantic labels from a space of object classes
Problems with classical shape detection (e.g. edge detection, colour histograms):
o High intra-class variance
o Low inter-class variance
o See first lecture for more
Features
Haar-like features: define regions and some operation, such that an object is likely to be
present given certain outcomes of the operation
Integral Images
To avoid unnecessary duplicate calculations when considering features, we use integral
images
Each pixel’s value is set to the value of all the pixels above and to the left of it
This makes window proposals and feature calculations much quicker
Can calculate pixel values for specific regions and subregions by subtracting the ‘corners’ of
the region:
To find the corners for an arbitrary region shape, we give all pixels in the region a value of 1,
and 0 to those outside the region; we then differentiate with respect to x, and then
differentiate that result with respect to y; this gives us the corners of the image- we simply
add the values in spots where the 1s appear and subtract those where -1s appear
Feature Selection
Could hand-select, but difficult and time-consuming
We want to use training data to allow system to choose good features
Adaboost
Adaboost trains a classifier on every single feature, then selects the classifier with the lowest
training error
It then increases the weight of misclassified samples, and repeats the above step
The final classifier is a linear combination of all the individual classifiers
Motion Modelling
When an object or camera in a scene moves, the distance points on the object move in the
image between frames is known as the incremental 2-D motion, or the 2-D velocity in the
image plane
When an object moves from point P to point Q, the point where it crosses the image plane
changes as well, from p to q. The distance pq is the 2-D motion, which can also be thought
of as a projection of the 3-D motion PQ into 2-D
3-D motion does not uniquely map to 2-D motion, as can be seen below; p’ will the be the
point on the image plane for any 3-D point along the red ray from p’ to P’
Perspective Projection Equations
Thus as Z, or distance from the camera increases, objects appears smaller, due to the
inverse relationship defined above
Meanwhile focal length f has a positive relationship with image size
Above are the fundamental rotation matrices
Not necessary for above rigid body motion through a small theta, but useful later
3-D Motion Field
The only thing that changes between points on an object is the values of small x, y and the
value of big Z, the distance from the camera
When there is only rotation, the depth of the scene is irrelevant, as the translational term is
the only part of the equation that uses Z and is 0 when there is no translation
When there is only translation, the angular velocity is obviously irrelevant
The above shows that for translation, the same 3-D motion field does not necessarily
correspond to the same 2-D motion field, as depth is a factor
Conversely, when there is only rotation, depth is not a factor and so the same 3-D motion
field produces the same 2-D motion field
IMPORTANT NOTE: this derivation is for the theoretical 2-D motion field produced by
projection 3-D motion vectors into the image plane; it is NOT necessarily what we perceive
in a video sequence produced by the changing colours of pixels
Optical Flow
Motion Estimation
Estimation of 2-D motion field from frames in an image sequence
Relationship between variation in pixel values- apparent motion or optical flow- and true
motion is not straightforward
Optical flow- perceived motion in video sequence caused by changes in pixel values
Extreme examples of cases where optical flow is very different to true 2-D motion field:
o Non-zero apparent motion for zero motion field, e.g. static scene, moving light
source
o Zero apparent motion for non-zero motion field, e.g. a sphere of constant colour
rotating in diffuse lighting
Sometimes not possible to determine 2-D motion flow without additional assumptions
Assume optical flow results from brightness constancy constraint- this says that a moving
pixel retains its value between frames
Alternatively phrased: two points in adjacent frames corresponding to the same 3-D point
have the same value (i.e. colour)
Higher order terms in Taylor expansion tend to 0 for tiny change in x, y, t, so in order for the
From which we can derive the equations for the magnitude and direction of un:
Thus at any given point the OFE only allows us to estimate the normal flow- without further
information, we cannot calculate the magnitude of the motion parallel to the edge feature,
and so we do not know the full flow vector
As shown above, we can find vn, the normal flow, which is the component of optical flow
perpendicular to the edge feature, but not the component parallel to it- thus the point could
now lie anywhere along the green dotted line
This works because if our assumption holds (that adjacent flow vectors are the same) then
then the OFE should tend to 0- thus the sum of squares of the OFEs throughout the region
also tends to 0
Therefore, the best estimate of OF will be the one that minimizes the value of the OFE
throughout the region (as a perfect estimate would sum to 0 throughout the region)
We can take the derivate w.r.t. to ux and uy, set them to zero and then solve for ux and uy+
Can fail, e.g. if A has no inverse
Estimates the full flow (i.e. not just normal) assuming there is sufficient variation in the
spatial gradient
It minimises the deviation from the optical flow equation within the region
It assumes there is only one motion in the region
It uses gradients to minimise the OFE
So we can simply solve for the flow vector u by finding the inverse of A and multiplying it by
b
This method can fail if A has no inverse
Need a way to estimate Ix, Iy and It (the spatial and temporal gradients respectively) using
differences
Recall that Ix and Iy are the rate of change of pixel values in the x and y directions
If we assume the distance between pixels and frames is 1, then we can just find the
difference between pixel values separate by 1 in the relevant direction
We average the difference between multiple values to reduce noise
So, for example, we find the difference between the value of a pixel and its immediate right
neighbour at the same time/in the same frame, repeat multiple times and average the diff.-
this gives an approximation of Ix
Can do the same thing for y or time
In the slide above, and in Lucas & Kanade alg., we’re taking values from different frames
(two from each in the above example)
Note however that in this case the position of the estimated gradient is between the pixels
along the x direction and between the frames along the t direction- this is fine, as long as the
other gradients in y and t are also computed in a similar way to be consistent
So, in essence, we take two adjacent frames, then for each pixel in our region we compute
the temporal and spatial gradients between it and its immediate neighbours, and use that to
iteratively add to A and b; finally, we compute the flow vector u with the inverse of our final
A and our final b
Recall edge detection- Ix shows where biggest changes in pixel value in the x direction,
similar for y
It shows where biggest changes in pixel value lie in the time direction. Therefore, we see the
biggest values where motion brings a very different coloured point to the space, like with
the moving ball, train and calendar
ux and uy, the optical flow, have greatest magnitude (the darker sections) where there is
most motion in the relevant direction- the ball and train move most horizontally, as shown
by ux, while the calendar moves most vertically, as shown by uy
Frame Difference
Obviously, we don’t know our ground truth (the true motion field), so we can’t compare it
to our estimated optical flow
We can however use our estimated flow vectors to move one frame towards the other and
look at the difference- in other words, apply the motion to one frame and check whether
we get the other frame
This is motion compensated differencing
The left image shows the difference in pixel values between frames, the right shows the
difference when the earlier frame is motion compensated and then compared with the latter
one
If the optical flow approximation was exact, the right image would be all black, as the
difference would be 0 everywhere- this obviously isn’t the case
However, we can see that the difference is significantly smaller in the right image, so our
approximation is doing a pretty good job
Multiresolution :
Affine Motion Model
Horn-Schunk Algorithm
Motion Segmentation
Challenges:
o Need to know what a ‘homogenous motion’ region is, requires defining a 2-D motion
field model
o Difficult to determine ‘correct’ boundaries
o Generalised aperture problem- large regions needed for good motion estimate, but
large regions morel likely to contain motion boundaries
o Usual segmentation challenges, occlusions, temporal variation, objects moving in
and out of view, etc.
2.5D Layered Representation
Video is represented as superpositioned layer of different depths
We first estimate the motion of each pixel, such as by Lucas and Kanade
The image is then split into grid blocks and each block is fitted to an Affine motion model
K-means clustering is then used to find the dominant motions (each dominant motion will be
a centroid in K-means)
Each pixel is then assigned to a dominant motion region based on which region most closely
matches the motion estimated generated for it in step 1
Each dominant motion region becomes a separate layer, and we finally fit a final affine
motion field for each layer
Affine Motion
When an object is sufficiently far from the camera, we can approximate it as a planar region
This lets us define a quadratic motion field
We can then ignore the 2nd order terms, so we got our affine motion model
Segmentation
The black arrow is the L&K estimate for a specific pixel from the first step of the alg., and the
yellow green and white are the motion fields generated by using the affine parameters from
each motion region- we pick the closest one to the L&K estimate (in this case yellow) and
assign the pixel to that region
We propagate motion forward based on the motion estimates calculated for each region,
which let us ‘link’ motion regions between frames
We can then ‘overlap’ regions across multiple frames, which lets us extract colour, velocity
and alpha maps
Stereo
Stereo vision- 3-D from two images taken from different viewpoints
Objects appear in different positions in each viewpoint- this is called parallax
Position of object in each image is dependent on depth- magnitude of position difference
(disparity) is in inversely proportional to depth
That is, disparity is greater between views the closer an object is to the cameras
Thus is if we know disparity, we can estimate scene structure from two images
We have two coordinate systems, one with the left camera as origin and the other with the
right
We describe a point P by PL, a vector in the left coordinate system, and the distance
between the cameras is given by the vector T in the left. coord. System
PR in the right coord system gives the vector from the right camera to point P
In the left coordinate system PR = -T + PL = PL – T due to how vectors work, but to define
w.r.t. to the right coordinate system we need to account for the different 3-D orientation of
the right camera relative to the left coordinate system
The transpose of a rotation matrix is equivalent its inverse
To use the left coord system to reach P from the right camera, we would need to perform
the inverse of the rotation that turns the left camera to match the right, and then follow the
vector
Thus if we define a rotation matrix RT for the transposition of R as the rotation necessary to
transform orientation of left camera to match right, we can apply the inverse by simply
multiplying by R (as that’s the inverse of RT)
Therefore we get PR in the right coordinate system equalling R(PL – T)
Epipolar Lines
Obviously, we cannot know the depth of a 3-D point just from one image; from where a ray
crosses the image plane we get PL, and from that we know that the corresponding point in
3-D space must lie somewhere along that line PL
Looking now at PR, we know that the 3-D point we’re looking for must correspond to one of
the image points such that the vector PR for that point intersects PL- this intersection is the
3-D point
The points on the right image plane that give a PR that intersects PL lie along what is called
an epipolar line- this line is given by projecting PL onto the right image plane
Note we can then pick one of these prospective vectors PR, and find points in the left image
plane that intersect with that line, as shown above- this then gives an epipolar line in the
left coordinate system; this epipolar line would be the same regardless of which PR along
the right epipolar line we chose
All of these rays and lines are clearly coplanar- that is, they lie on the same plane, which is
called the epipolar plane
Epipolar lines are only parallel when principal axes are parallel (Z axis) and image planes lie
in the same plane; IT IS TRUE THAT EPIPOLAR LINES ARE PARALLEL WHEN THE IMAGE
PLANES OF THE CAMERAS ARE IN THE SAME PLANE
All epipolar lines in one image meet at a single point which may be at infinity
The line joining the two centres of projection is in all epipolar planes
An epipolar line is the projection of a 3-d ray emanating from the COP of the other camera
Fundamental matrix depends on focal length
Essential matrix depends on relative rotation
There are many epipolar planes- in general, one for each pair of corresponding points
An epipolar plane does cut each image along an epipolar line
Epipolar Geometry
The cross product of two vectors gives a vector which is perpendicular to both
Since T and PL both lie on the epipolar plane, their cross product must be perpendicular to
the epipolar plane
PL – T, which is the value of PR in the left coordinate system, lies on the epipolar plane
Therefore (PL – T).(T cross product PL) gives a dot product of 0, as the two vectors are
perpendicular/orthogonal
All pairs PLN and PRN must satisfy that constraint equation
The points where each line PRN crosses the right image plane gives the epipolar line
To simplify the notation, we define the essential matrix as E = RS
Because the right hand side of the equation is 0, we can simply eliminate the Z/f terms, to
Proof that the equation gives the equation epipolar line in right image
PR has components x, y and f because it’s the vector from the camera to the image plane
the camera is focal length f away from plane for all points
Image Points and Pixels
Converting between image plane coordinates and pixel coordinates needed (as obviously
pixel coordinates start with (0, 0) in the top left while image plane coordinates do not)
Principal axis of the camera (z-axis) cuts image plane at (0, 0, f) in image plane coordinates-
let us define the point where the axis cuts the plane as the principal point (written above as
o hat x and o hat y)
We can then convert a pixel p from pixel coordinates to image plane coordinates by
subtracting the pixel coordinates of the principal point from the pixel coordinates of p and
then multiplying by the dimensions of a pixel (given here by sx * sy)
This gives us the same equation as before but with x and y in pixel coordinate form, where
the matrix ML is just implementing the conversion equations shown on the left
This is the same equation defined in terms of pixel coordinates, which lets us apply it directly
to real images (assuming we know the characteristics needed to define the fundamental
matrix)
3-D reconstruction
As we do not have perfect accuracy for the mapping of an image point to a 3-D point (due to
noise and calibration errors), we have a problem- the projected rays PR and PL will be
slightly off and not cross
We want to find the point where the difference between them is closest, and estimate our
3-D point as being there
This is the point where the distance vector between is perpendicular to both of them-
ideally, we’d estimate our 3-D point at the middle of this difference vector
Bringing T over to the RHS, we have four 3x1 matrices equated- essentially, we have 3
unknowns and 3 different equations in terms of those unknowns, so we can solve
Converting to matrix form, we get:
Where H is simply a 3x3 matrix combining the three LHS 3x1 matrices- can think of matrix x
vector multiplication as a weighted sum of the columns, where the vector gives the weights
So we then can simply evaluate with our values of a and b to find the endpoints of the
difference vector, and then take the midpoint between them as our 3-D point estimate
Stereo- Correspondence Matching
If we have a 2-D point in one image plane and we know the camera geometry (translation
and rotation between cameras) we can find the epipolar line the corresponding point in the
other image plane must lie on
If we have a corresponding pair in the image planes and know the camera geometry we can
estimate the 3-D point
Regions need to be large enough to be distinguishable from other regions (e.g., there will be
likely be many cases of, at the extreme low end, one pixel of a given colour throughout the
image)
However if the region is too large, its likely to contain a depth discontinuity
Also, camera perspective can lead to the same 3-D area projecting to a larger 2-D region in
one view than the other (e.g. if one camera is significantly closer to the object), violating our
model assumptions
Region based matching is quick and simple to implement, but obviously flawed as described
above
Disparity d is essentially the offset between a point in the left image and the corresponding
point in the right
For all pixels in the chosen size region, we compute a similarity measure between the
corresponding pixel in the left region and that pixel + disparity measure in the right- these
similarity measures are summed to give cost c
We seek to choose d such that c is minimized- that is, choose the corresponding pixel in the
right image such the region surrounding is most similar to the region surrounding the pixel in
the left image
Feature-Based Methods
Restrict search to sparse set of features
Find salient (distinct) points in each view, like corners
Match points by comparing pixels or image descriptors in local regions about each point
This leads to a sparse 3-D reconstruction, i.e. a sparse point cloud, which needs further
processing if we want to fill in the gaps
An example of the SIFT feature based method follows
Repeatedly Gaussian blur the image, and difference the adjacent images to build a
Difference of Gaussian tree
This leads to highlighted edges and distinct regions showing in the DoG tree
Key points are extrema (points which have the maximum or minimum value in a
neighbourhood both on the same and adjacent levels of the tree [as shown in the image]),
and are usually present in both views, possibly at different levels- this leads to the scale
invariance property of the method
If the images are imaged at a different scale, the same point will appear on different levels in
the two image’s DoG trees
SIFT only computes descriptors for salient points
SIFT minimises descriptor change with viewing angle (though there will probably be some)
The descriptor is based on histograms of gradient angles in 4x4 blocks
So, we first find possible matches with pixel values (region/feature based methods)
Then we select a random subset of these matches, and use them to compute F
For each matching pair, we then use our approximated F to find the epipolar line the point in
the right image should lie along; if it is within a certain threshold of the line, we classify it as
an inlier, otherwise it is an outlier
The ratio of inliers vs. outliers gives us the support for this value of F
We then repeat the process many times, selecting a new subset of possible matches to use
to compute F, and choose the F value with the most support at the end
With the best F selected, we discard our outliers and recompute a final F using only the
inliers