Pothole Detection Using FPGA
Pothole Detection Using FPGA
Indian Roads
by
Shonil Vijay
Roll No: 05329001
Shonil Vijay
July 2006,
Indian Institute of Technology, Bombay
ii
Abstract
The objective of this project is to develop a low cost vision-based driver assistance
system over FPGA, to provide a solution of detection and avoidance of potholes in the
path of a vehicle on road. The project will use an FPGA model to deploy image
processing algorithms efficiently so that the output can be achieved in real time.
Pothole avoidance may be considered similar to other obstacle avoidance except that
the potholes are depressions rather than extrusions from a surface. A vision approach was
used since the potholes were different visually from the background surface.
The detection of defects in road surfaces is necessary for locating a pothole. The
system as a whole will be far more efficient than common methods such as manual
inspection and almost as accurate.
iii
Contents
1. Introduction…………………………………………………………. 1
1.1 Motivation……………………………………………………... 1
1.2 Problem Statement……………………………………………... 2
2. Design Choices………….…………………………………………... 3
2.1 PC Digital Signal Processing Programs…….…………………. 3
2.2 Application Specific Integrated Circuits………………………. 4
2.3 Dedicated Digital Signal Processors…………………………... 4
2.4 Field Programmable Gate Arrays……………………………... 5
6. References………………………………………………………….... 17
iv
1. Introduction
1.1 Motivation
Interest in Collision Avoidance Systems comes from the problems caused by traffic
congestion worldwide and a synergy of new information technologies for simulation,
real-time control and communications networks. Traffic congestion has been increasing
world-wide as a result of increased motorization, urbanization, population growth and
changes in population density. Congestion reduces utilization of the transportation
infrastructure and increases travel time, air pollution, fuel consumption and most
importantly traffic accidents.
Worldwide, around 1.2 million people died as a result of road traffic accidents in
2002. This represents an average of 3,242 persons dying each day around the world. In
addition to these deaths, around 50 million people globally are estimated to be injured or
disabled every year. Projections indicate that these figures will increase by about 65%
over the next 20 years. Road accidents are currently world's eleventh leading cause of
death, but by 2020, it will become third, behind deaths linked to heart disease and mental
illness. In United States alone, around 6.2 million traffic accidents occur due to
automobile crashes in 2004. These accidents accounted for 42,636 deaths and 2,788,000
nonfatal injuries. 31,693 occupants of passenger cars were killed in traffic crashes and an
additional 2,543,000 were injured, accounting for 85% of all fatalities and 95% of all
injuries. In 2000, the economic cost of motor vehicle crashes was $230.6 billion.
-1-
Also the images have to be processed further to obtain useful data from them. There are a
number of research projects on-going aiming to extract objective traffic condition
indicators from the video images. But the huge volume of data involved in video images
dictates that such systems requires substantial computation power, and consequently, will
not be low cost devices for a considerable period of time. Also these laser beam or radar
based systems adds $1500 to $3000 to the costs of a car, and thus are being used in
luxury cars only. Finally GPS based systems have very high operational costs as they
need to obtain the precise positions of the vehicles in every few milliseconds.
Fig 1: Example defects. From left, cracking, popouts, wear and polishing and
potholes.
The different types of defects can be briefly described as:
• Cracking – areas where the road surface has been split apart, classified as
longitudinal transverse and crocodile or combination cracking.
• Pop-outs – areas where pressure from blow has forced the surface of the road to
be raised.
• Wear and Polishing – areas of smooth black tar occurring on asphalt surfaces.
• Potholes – areas where there is an absence of asphalt or a hole. Potholes usually
emerge from areas where there has been severe cracking.
Of the above different types of defects, the one of particular interest and the problem
at hand are the potholes. Early detection of cracks in road surfaces, allows maintenance to
be performed before cracks develop into more serious problems, such as potholes and
pop-outs.
This research discusses a solution for the automatic detection and avoidance of such
potholes in the path of an autonomous vehicle operating in an unstructured environment.
Pothole avoidance may be considered similar to other obstacle avoidance except that the
potholes are depressions rather than extrusions from a surface.
A vision approach is used since the potholes were significantly different visually from
the background surface. According to electromagnetic spectrum, nowadays vision based
-2-
researches can be categorized into three classes according to the range of operating
wavelength, namely, visible spectrum region (0.38~0.8µm), Near Infra-Red region (NIR)
(0.8~1.2µm), and Far Infra-Red region (FIR) (7~14µm).
However, human cannot perceive the wavelength of the above region expect the
visible spectrum, and they can see the NIR or FIR images through the sensors which are
capable of detecting energy radiated by a band of the electromagnetic spectrum. For cost
and performance considerations, the CCD camera which operates in the visible spectrum
band will be taken as our sensing device to develop this system.
Recently, Field Programmable Gate Array (FPGA) technology has become a viable
target for the implementation of algorithms suited to video image processing applications.
The unique architecture of the FPGA has allowed the technology to be used in many such
applications encompassing all aspects of video image processing.
As image sizes and bit depths grow larger, software has become less useful in the
video processing realm. Real-time systems such as those that are the target of this project
are required for the high speeds needed in processing video. DSP systems are being
employed to selectively reduce the amount of data to process, ensuring that only relevant
data is passed on to a human user. Eventually, it is expected that most video processing
can and will take place in DSP systems, with little human interaction. This is obviously
advantageous, since humans are perhaps not entirely accurate.
2. Design Choices
There are several different choices a designer has when implementing a DSP system
of any sort. Hardware, of course, offers much greater speed than a software
implementation, but one must consider the increase in development time inherent in
creating a hardware design. Most software designers are familiar with C, but in order to
develop a hardware system, one must either learn a hardware design language such as
VHDL or Verilog, or use a software-to-hardware conversion scheme, such as Streams-C,
which converts C code to VHDL, or MATCH, which converts MATLAB code to VHDL.
While the goals of such conversion schemes are admirable, they are currently in
development and surely not suited to high-speed applications such as video processing.
Ptolemy is a system that allows modeling, design, and simulation of embedded systems.
Ptolemy provides software synthesis from models. While this type of system may be a
dominant design platform in the future, it is still under much development, meaning that
it may not be a viable design choice for some time. A discussion on the various viable
options for DSP system design is found below.
-3-
MATLAB is such an environment. Although it was created for manipulating
matrices in general, it is well suited to some image processing applications. MATLAB
treats an image as a matrix, allowing a designer to develop optimized matrix operations
implementing an algorithm. However, if the eventual goal is a hardware device, the
algorithms are instead often written to operate similarly to the proposed hardware system,
which results in an even slower algorithm.
One area where DSPs are particularly useful is the design of floating point systems.
On ASICs and FPGAs, floating-point operations are rather difficult to implement. For the
scope of this project, this is not an issue because all images consist of only integer data.
-4-
2.4 Field Programmable Gate Arrays
Field Programmable Gate Arrays (FPGAs) represent reconfigurable computing
technology, which is in some ways ideally suited for video processing. Reconfigurable
computers are processors which can be programmed with a design, and then
reprogrammed (or reconfigured) with virtually limitless designs as the designer’s needs
change. FPGAs generally consist of a system of logic blocks (usually look up tables and
flip-flops) and some amount of Random Access Memory (RAM), all wired together
using a vast array of interconnects. All of the logic in an FPGA can be rewired, or
reconfigured, with a different design as often as the designer likes. This type of
architecture allows a large variety of logic designs dependent on the processor’s
resources), which can be interchanged for a new design as soon as the device can be
reprogrammed.
The goal of this project is for real-time (30 frames per second) processing of
grayscale image data, a goal in which an FPGA system using parallel algorithms should
have little difficultly achieving.
-5-
Basic system analysis has been neglected from most obstacle detection papers, but it
is necessary to examine both the detection requirements and the system requirements as a
whole so that adequate trade-offs can be made to ensure good system performance. This
chapter will explore several basic methods used for on-road obstacle detection including
optical flow and stereo, but I will also briefly consider the possibility of other detection
modalities.
Through no lack of effort, there have been no satisfactory solutions so far to the
problem of small static obstacle detection at highway speeds. Although there have been
good results published for vehicle detection, there have been few results reported on the
detection of small static obstacles. Moreover, these few results have generally been
reported in vague terms which give the reader little ability to compare methods. Papers
on cross-country obstacle detection systems are typically no better at reporting results in
a standardized way, although this is probably more excusable since rough terrain is
difficult to describe in an analytical fashion and an obstacle may be less well-defined in
cross-country applications.
These difficulties might make the on-road problem seem almost intractable given
that the off-road obstacle detection problem is still far from being solved. Fortunately,
however, roads are reasonably planar. Given our previous analysis, it should be quite
clear that we need to use this fact in order to win in this scenario. Assuming planarity of
the road improves our signal-to-noise ratio greatly. This simplifies a number of things.
We don’t need to build maps, estimate heights of objects, and check for rollover
conditions, etc. We only need to find areas which violate the planarity constraint and
avoid them.
For many of the computer vision methods I describe, there are both dense and
discrete methods. Dense methods provide some estimate (range, optical flow, etc.) at
every image pixel. Discrete methods first use some feature detection method to pick out
candidate potential obstacles and then attempt to calculate range, optical flow, or some
-6-
other measure in order to determine whether a feature (or group of features) actually
belongs to an obstacle.
In the near future, it is likely that only discrete methods will be able to achieve the
low processing latencies necessary for highway obstacle detection without special
purpose hardware. For this reason, this chapter focuses on discrete methods, although I
do not ignore the dense methods since these will become feasible as computers become
faster.
I name the process of finding obstacles once given a set of image features or patches,
obstacle determination. I focus on the problem of obstacle determination in this chapter.
Because an image is a mapping of 3-D space to 2-D space, a single image can not
determine whether an image patch belongs to the ground plane or not (Fig 2). Additional
information, in the form of more images or certain assumptions or a priori knowledge
about potential obstacles, must be used to perform obstacle determination. If an image
feature or patch belongs to the ground plane, then it is not a threat. If it does not,
however, it should be considered an obstacle and avoided. I refer to the obstacle
determination problem throughout the chapter.
Fig 2: A single image cannot distinguish whether image patch C is caused by A (an obstacle), or B (a
ground plane feature, i.e. no obstacle). We call the problem of using additional information to disambiguate
the two obstacle determination.
-7-
close to yobs, then it is likely that it corresponds to an obstacle. If it is closer to ygnd,
however, then it most likely corresponds to a ground-plane point. By examining the flow
(in the y-direction in the illustrated case), i.e. the movement of y0 at t0 to a new point in
the second image, the algorithm can theoretically detect whether the point belongs to an
obstacle or to the ground plane.
Fig 3: A point y0 on the image plane at time t0 can be mapped to either point P or Q in the world (or any
point in-between). Another image taken at time t1 after the camera has moved a distance d can be used to
disambiguate between P and Q depending on whether it is mapped to yobs or ygnd.
There has been a number of optical flow or similar techniques proposed in the
obstacle detection literature. Although some of these techniques have had success in
detecting large static obstacles, optical flow methods are best at detecting moving objects
which have motion fields significantly different from the background. Another vehicle,
for example, can be detected and tracked using optical flow techniques, even if it is
moving with low relative velocity with respect to our own vehicle because it is still
moving with respect to the background. This may be sensed by comparing the direction
of the calculated flow vector with the direction of the model flow vector. I briefly
describe several methods for calculating optical flow, and then illustrate why optical flow
is too insensitive to be effectively used in the basic static obstacle determination problem
for small obstacles at large distances.
Edge-based methods have been used with success for detecting vehicles and tracking
them over time. Typically, these methods involve an edge extraction step followed by a
dynamic programming method to match edges between images. The algorithms tend to
run faster than analytical methods since they compute one flow vector only for every
edge rather than every pixel. Reliable edges can be matched fairly robustly in these
schemes because of the length and orientation constraints that can be used in matching.
-8-
One problem with these methods is that the flow component along the edge is generally
unreliable insofar as the edge-detectors themselves often find broken lines, etc. Broken
lines can cause mismatches that make the flow component normal to the edge wrong too,
but this happens less frequently.
Pixel-based discrete methods, on the other hand, only compute the optical flow at
points where the flow is fully constrained without resorting to smoothness constraints.
Typically, the algorithm will choose local intensity minima or maxima or use an interest
operator, e.g. Moravec interest operator, to select points in an image. Then the algorithm
matches points using correlation similar to that in stereo matching. Using the observation
that under a planar ground assumption, an obstacle only changes the magnitude and not
the direction of the flow vector, it is possible to constrain the search to a single dimension
along the direction of the model flow vector (with the drawback that it may not allow the
detection of moving objects). The flow vector for the pixel is then just the vector from the
original pixel position to the location of the maximum correlation in the new image.
As in optical flow, the search space for a corresponding point in the second image
can be reduced to one dimension. In the case of stereo, the match must fall along the epi-
polar line. Given a single image, the same features are found and essentially the same
matching procedure is run to match features to the second image, although the search is
made along different directions. Stereo has the advantage, however, that if the cameras
are situated with the optical axes perpendicular to the line connecting the optical centers,
the search space is constrained to the same image row. This makes stereo simpler to
implement and faster than optical flow. The conceptual difference is that one set of
images is taken with a spatial base line, and one with a temporal baseline. I show that in
these highway scenarios, the spatial baseline provides better detect ability.
-9-
Fig 4: Stereo camera setup. The distance of the point from the cameras is proportional to the baseline
and inversely proportional to the disparity
3.4 RADAR (Radio Detection and Ranging)
Radar is an excellent means of detecting other vehicles because radar works at long
ranges and is relatively unaffected by rain or snow. The radar was capable of detecting
vehicles at distances of up to 200 meters with a range resolution of approximately 0.1
meters. The sensor had a 3o vertical field of view and a 12 o horizontal field of view.
Bearing to a target could be estimated via wavefront reconstruction, and when combined
with geometric information about the road, potential obstacles could be mapped to an
individual lane. Since radars provide direct range and may also provide a doppler velocity
measurement, they will most likely be a standard sensor for automated vehicles.
Unfortunately, current radars are not able to reliably detect small objects at ample
distances. Metal surfaces are good radar reflectors, and hence make vehicle detection
fairly easy. The ease with which an object may be detected at a given range is related to
its radar cross section. Vehicles have a much larger radar cross section (10 m2) than
people (0.2 to 2m2), and most road debris will have an even smaller radar cross section,
making them undetectable.
Laser striping is an indirect ranging method that has been used on a number of
robots (especially indoor robots). Laser striping uses a fixed CCD camera and a laser
stripe (visible to the CCD) that is swept across a scene. Triangulating between the known
direction of the laser and the direction provided by the laser stripe position on the CCD
provides a range estimate at each image row. Laser striping can provide 3-D geometry
fairly quickly since the laser only needs to be scanned in one direction, and computation
is simple. There are a couple problems with laser striping. First, the laser must be easily
detectable within the image (significantly stronger than ambient light). As with any
triangulation system, range accuracy improves as the distance between the laser and CCD
- 10 -
is increased. However, as this distance is increased, the problem of “shadowing” worsens.
Shadowing occurs when points visible to the laser may not be visible to the camera and
vice versa. For the simplest systems (a beam diverged by a cylindrical lens), laser striping
requires more laser power than direct methods and hence is most useful indoors or at
short distances.
Laser rangefinders avoid the shadowing problem by keeping the transmitted and
received beams approximately co-axial, and measuring range directly. A 3-D laser
scanner operates by sweeping a laser across the scene in two dimensions. At each pixel,
the instrument measures the time that it takes for a laser beam to leave the sensor, strike a
surface, and return. There are several methods for measuring the time. Many sensors also
provide an intensity measurement at each pixel by measuring the energy of the returned
laser signal. Thus, a full sweep in two-dimensions can provide both a depth map and an
intensity image.
Once the laser provides a depth map for the scene, the data is typically transformed
into an elevation map based on the known position of the sensor. Terrain-typing
algorithms may then be run on the resulting elevation map, and discrete obstacles may be
detected by looking for discontinuities in the elevation map. However, there are a number
of problems with this method. First, it is computationally expensive. Second, the discrete
grid size cannot represent vertical surfaces, and hence may miss some potentially
dangerous obstacles.
Most ranging software systems have been image-based, meaning that the system
waited for an entire laser image to be acquired and processed before obstacles were
reported to the path generation module. Hebert proposed a system that processes range
data pixel by pixel to reduce the latencies involved and improve system efficiency. A
pixel by pixel method also reduces the dependency on a particular range sensor, since
methods which use entire images are tuned to the specific field-of-view and geometry of
the sensor. Reported latencies were reduced to under 100 ms or under. With the addition
of a planning module, the system was demonstrated in multi-kilometer traverses through
unknown terrain.
Despite problems with previous laser range methods, laser range would likely
provide adequate obstacle detection capability for highway environments given adequate
laser power. The distance between an obstacle and the road surface behind it is the same
as in the stereo vision system. A laser sensor mounted at a height of 1.5 meters looking at
the road at 60 meters would see an obstacle of 20 cm at a range of 52 meters. Given
adequate power, this difference should be easily detectable by a laser rangefinder. Our
current laser scanner is not powerful enough to provide reliable range estimates from the
road at these distances. The real problem with laser rangefinders currently is that they are
too expensive as a solution for obstacle detection for consumer vehicles.
- 11 -
3.6 Other Vision-Based Detection Modalities
Although the most promising obstacle detection methods have been discussed, there
is the potential for new methods based on numerous visual cues. At least 12 depth cues
are available to humans, only two of which are binocular. And there are other cues
unrelated to depth that are used for obstacle detection (shape, color, luminance, etc.).
Unfortunately, many of the depth cues operate at distances shorter or longer than the
distances with which we are concerned (on the order of 50 meters). Most are also difficult
to encode in a computer program because they require a priori knowledge: typical size of
the obstacle, etc. It would be impractical to teach a system what every potential obstacle
might look like. Nevertheless, some of these other cues may be useful to future obstacle
detection systems. One depth cue that can easily be used on highways to estimate the
distance to a suspected obstacle, however, is linear perspective effects. Linear perspective
states that the distance between two points subtends a smaller angle at larger depths.
Since we know approximately the width of the road, we can use the linear perspective
effects to estimate the distance to a given image row. Relative height may be used as a
depth cue because given level ground, objects which are closer to the horizon are also
farther away. Once calibrated using linear perspective effects, the vertical location of an
object in the image could be used to estimate its distance.
Several examples of vision tasks belonging to this three-level hierarchy are shown in
Table 1.
- 12 -
Task level Computational Examples
characteristics
Low Small neighborhood Edge detection,
data access, simple filtering,
operations, large image morphology
amount of data
Intermediate Small neighborhood, Hough transform,
more complex connected
operations component
labeling, relaxation
High Nonlocal data access, Point matching, tree
nonpolynomial and matching, graph
complex operations matching
Table 1: Examples of the three-level vision task hierarchy
The work for this project is based on the usage of image processing algorithms using
these pixel windows to calculate their output. Although a pixel window may be of any
size and shape, a square 3x3 size was chosen for this application because it is large
enough to work properly and small enough to implement efficiently on hardware.
4.2.1 Algorithm
This filter works by analyzing a neighborhood of pixels around an origin pixel, for
every valid pixel in an image. Often, a 3x3 area, or window, of pixels is used to calculate
its output. For every pixel in an image, the window of neighboring pixels is found. Then
the pixel values are sorted in ascending, or rank, order. Next, the pixel in the output
- 13 -
image corresponding to the origin pixel in the input image is replaced with the value
specified by the filter order. Fig. 6 shows an example of this algorithm for a median filter
(order 5), a filter that is quite useful in noise filtering.
As is evident in the above figure, it is possible to use any order up to the number of
pixels in the window. Therefore a rank order filter using a 3x3 window has 9 possible
orders and a rank order filter using a 5x5 window has 25 possible orders. No matter what
the window size used in a particular rank order filter, using the middle value in the sorted
list will always result in a median filter. Similarly, using the maximum and minimum
values in the sorted list always results in the flat dilation and erosion of the image,
respectively. These two operations are considered part of the morphological operations,
and are discussed in the next sub-section.
- 14 -
Fig. 7: Concept of Structuring Element Fitting and Not Fitting
Algorithm
There are two fundamental operations in morphology: erosion and dilation. It is common
to think of erosion as shrinking (eroding) an object in an image. Dilation does the
opposite; it grows the image object. Both of these concepts depend on the structuring
element and how it fits within the object. For example, if a binary image is eroded, the
resultant image is one where there is a foreground pixel for every origin pixel where it’s
surrounding structuring element-sized fit within the object. The output of a dilation
operation is a foreground pixel for every point in the structuring element at a point where
the origin fits within an image object.
One of the strongest features of morphological image processing extends from the fact
that the basic operators, performed in different orders, can yield many different, useful
results. For example, if the output of an erosion operation is dilated, the resulting
operation is called an opening. The dual of opening, called closing, is a dilation followed
by erosion. These two secondary morphological operations can be useful in image
restoration, and their iterative use can yield further interesting results, such as
skeletonization and granulometries of an input image.
Grayscale erosion and dilation can be achieved by using a rank order filter as well.
Erosion corresponds to a rank order filter of minimum order, and dilation corresponds to
a rank order filter of maximum order. The reason for this is that the result of a minimum
order ran k order filter is the minimum value in the pixel neighborhood, which is exactly
what an erosion operation is doing. This also holds true for a maximum order rank order
filter and a dilation operation. However, the rank order filter only works as a
morphological operation with a flat structuring element. This is because the rank order
filter window works as a sort of structuring element consisting of all ones. Still, this is a
- 15 -
powerful feature, since grayscale morphology using flat structuring elements accounts for
the most common usage of morphology.
In a grayscale image, erosion tends to grow darker areas, and dilation tends to grow
lighter areas. Opening and closing each tend to emphasize certain features in the image,
while de -emphasizing others. Iteratively, the morphological operations can be used to
pick out specific features in an image, such as horizontal or vertical lines.
4.4 Convolution
Convolution is another commonly used algorithm in DSP systems. It is from a class
of algorithms called spatial filters. Spatial filters use a wide variety of masks, also known
as kernels, to calculate different results, depending on the function desired. For example,
certain masks yield smoothing, while others yield low pass filtering or edge detection.
4.4.1 Algorithm
The convolution algorithm can be calculated in the following manner. For each input
pixel window, the values in that window are multiplied by the convolution mask. Next,
those results are added together and divided by the number of pixels in the window. This
value is the output for the origin pixel of the output image for that position.
The input pixel window is always the same size as the convolution mask. The output
pixel is rounded to the nearest integer. As an example, Fig. 9 shows an input pixel
window, the convolution mask, and the resulting output. This convolution mask in this
example is often used as a noise –cleaning filter.
Convolution Output
= (50*1 + 10*1 + 20*1 + 30*1 + 70*2 + 90*1 + 40*1 + 60*1 + 80*1)/9 = 57.7778
=> 58
The results for this algorithm carried over an entire input image will result in an
output image with reduced salt-and-pepper noise. An important aspect of the convolution
algorithm is that it supports a virtually infinite variety of masks, each with its own
feature. This flexibility allows for many powerful uses.
- 16 -
5. Summary and Future Work
Many of the key imaging functions break down into highly repetitive tasks that are
well suited to modern FPGAs, especially those with built-in hardware multipliers and on-
chip RAM. The remaining tasks require hardware more suited to control flows and
decision making such as DSPs.
To gain the full benefit of both approaches, systems are required that can effectively
combine both FPGAs and DSPs. With the addition of standard imaging functions written
in either VHDL or C all of the key building blocks are available for making an image
processing system. Using the DSP functions provided with the camera we can implement
the hardware for the project.
This research has been an excellent learning experience in various engineering
disciplines. The Digital Signal Processing algorithms are now to be implemented in
MATLAB so that the results can be compared as well as simulations can be done for
optimum performance.
6. References
[1] Hussain, Z.: “Digital Image Processing – Practical Applications of Parallel
Processing Techniques,” Ellis Horwood, West Sussex, UK, 1991.
[2] Pratt, W.: “Digital Image Processing,” Wiley, New York, NY, 1978.
[3] Ratha, N. K. & Jain, A. K.: “Computer Vision Algorithms on Reconfigurable Logic
Arrays,” IEEE, 1999
[6] Kruger, W. & Enkelman, W.: “Real-Time Estimation and Tracking of Optical Flow
Vectors for Obstacle Detection,” Germany
[7] Nakai, H. & Takeda, N.: “A Practical Stereo Scheme for Obstacle Detection in
Automotive Use,” IEEE, 2004
[8] Wang, C. & Huang S.: “Driver Assistance System for Lane Detection and Vehicle
Recognition with Night Vision,” National Taiwan University, Taipei, Taiwan
[9] Hancock, J.: “Laser Intensity-Based Obstacle Detection and Tracking,” Thesis,
Carnegie Mellon University, Pittsburgh, Pennsylvania
- 17 -