Joshi 2009 635 PDF
Joshi 2009 635 PDF
Joshi 2009 635 PDF
Phones
Anand Joshi† , Mi Zhang§ , Ritesh Kadmawala† , Karthik Dantu† , Sameera Poduri† and
Gaurav S. Sukhatme†§
†
Computer Science Department, § Electrical Engineering Department
University of Southern California, Los Angeles, CA 90089, USA
{ananddjo,mizhang,kadmawal,dantu,sameera,gaurav}@usc.edu
1 Introduction
Optical character recognition (OCR) is a powerful tool for bringing information from
our analog lives into the increasingly digital world. This technology has long seen
use in building digital libraries, recognizing text from natural scenes, understanding
hand-written office forms, and etc. By applying OCR technologies, scanned or camera-
captured documents are converted into machine editable soft copies that can be edited,
searched, reproduced and transported with ease [15]. Our interest is in enabling OCR
on mobile phones.
Mobile phones are one of the most commonly used electronic devices today.
Commodity mobile phones with powerful microprocessors (above 500MHz), high-
resolution cameras (above 2megapixels), and a variety of embedded sensors (ac-
celerometers, compass, GPS) are widely deployed and becoming ubiquitous. By fully
exploiting these advantages, mobile phones are becoming powerful portable computing
platforms, and therefore can process computing-intensive programs in real time.
In this paper, we explore the possibility to build a generic framework for devel-
oping OCR-based applications on mobile phones. We believe this mobile solution to
extract information from physical world is a good match for future trend [17]. How-
ever, camera-captured documents have some drawbacks. They suffer a lot from focus
loss, uneven document lighting, and geometrical distortions, such as text skew, bad ori-
entation, and text misalignment [16]. Moreover, since the system is running on a mobile
phone, real time response is also a critical challenge. We have developed a framework
2 Authors Suppressed Due to Excessive Length
2 Related Work
There are currently several commercially available OCR systems on the market today
such as ABBYY FineReader, OmniPage, and Microsoft Office Document Imaging. In
addition, the research and opensource communities also offer comparable systems like
GOCR [3], OCRAD [6], Tesseract[10] and OCROPUS [8]. ABBYY also provides a
Mobile OCR Engine [1] for the mobile phones which is claimed to provide real time
processing with a very high accuracy.
Several academic projects and commercial products have tried to use mobile phone
cameras to build interesting applications. In [25], the authors presented a mobile sensing
system that analyzes images of air sensors taken on cell phones and extracts indoor air
pollution information by comparing the sensor to a calibrated color chart using image
processing and computer vision techniques. However, all the processing in this system
is performed at the backend server and not in real time. In [15], the authors designed
an image preprocessing suite on top of OCR engine to improve the accuracy of text
recognition in natural scene images. Again, all the processing is implemented at the
back end server with no implementation on the mobile device. Nowadays some mobile
phones are equipped with business card reader application which facilitates users to
store contact information from business cards directly on their mobile phones [26].
OCRdroid: A Framework to Digitize Text Using Mobile Phones 3
Also in [14], authors have discussed about a mobile application to capture barcodes of
any item and get detailed information about its ratings, price, and reviews.
4 Design Considerations
OCR systems have been under developed in both academia and industry since the
1950s. Such systems use knowledge-based and statistical pattern recognition techniques
to transform scanned or photographed text images into machine-editable text files. Nor-
mally, a complete OCR process includes 5 main steps [13]: (1) noise attenuation, (2)
image binarization (to black and white), (3) text segmentation, (4) character recognition,
and (5) post-processing, such as spell checking. These steps are very effective when ap-
plied to document text, which when collected by a scanner, is generally aligned and
has clear contrast between text and its uniform background. However, taking pictures
from a portable camera, especially the one embedded inside a mobile device, may leads
to various artifacts in the images and as a result, causes even the best available OCR
engine to fail. Problems include uneven document lighting, perception distortion, text
skew, and misalignment. Some of these issues are illustrated in Figure 1. In addition,
since the system is installed on mobile devices, real time response is another critical
issue that needs to be considered. Among the issues mentioned above, some of them
exhibit inherent tradeoffs and must be addressed in a manner that suits our applications.
This section presents a pertinent range of design issues and tradeoffs, and discusses
proposed approaches applicable to our OCRdroid framework.
Fig. 1: Different issues arising in camera-captured documents: (a) shading (b) flooding (c) blur
(d) skew (e) tilted (f) misalignment
complications occur when trying to use artificial light, i.e. flash, which results in light
flooding.
Proposed Approach: Binarization has long been recognized as a standard method
to solve the lightning issue. The goal of binarization process is to classify image pix-
els from the given input grayscale or color document into either foreground (text) or
background and as a result, reduces the candidate text region to be processed by later
processing steps. In general, the binarization process for grayscale documents can be
grouped into two broad categories: global binarization, and local binarization [23].
Global binarization methods like Otsu’s algorithm [19] try to find a single threshold
value for the whole document. Each pixel is then assigned to either foreground or back-
ground based on its grey value. Global binarization methods are very fast and they give
good results for typical scanned documents. However, if the illumination over the doc-
ument is not uniform, for instance, in the case of camera-captured documents, global
binarization methods tend to produce marginal noise along the page borders. Local bi-
narization methods, such as Niblack’s algorithm [18], and Sauvola’s algorithm [21],
compute thresholds individually for each pixel using information from the local neigh-
borhood of that pixel. Local algorithms are usually able to achieve good results even on
severely degraded documents with uneven lightning conditions. However, they are of-
ten slow since computation of image features from the local neighborhood is to be done
for each image pixel. In this work, in order to handle uneven lightning conditions for
6 Authors Suppressed Due to Excessive Length
4.4 Misalignment
Issue: Text misalignment happens when the camera screen covers a partial text region,
in which irregular shapes of the text characters are captured and imported as inputs to
OCRdroid: A Framework to Digitize Text Using Mobile Phones 7
the OCR engine. Figure 1(f) shows an example of a misaligned image. Misalignment
issue generally arises when people casually take their pictures. Our experiment results
indicate that the OCR result is significantly affected by misalignment. Moreover, mis-
aligned images may lead to loss of important data.
Proposed Approach: We define that a camera-captured image is misaligned if any
of the four screen borders of the phone cuts through a single line of text, either horizon-
tally, or vertically. Based on this definition, we set a 10-pixel wide margin along each
border as depicted in Figure 2.
If any foreground pixel is detected within any of those 4 margins, there is high prob-
ability that the text is being cut and hence we can conclude that the image is misaligned.
We faced two major challenges while designing an algorithm for this task
1. Our algorithm should provide real time response.
2. Our algorithm must be able to deal with random noise dots and should not classify
them as part of any text.
Considering issues of imperfect lighting condition, we decided to perform a binarization
preprocessing step to infer whether we have detected a foreground pixel or not. Despite
the fact that global binarization algorithms tends to run faster, we preferred to adopt
local binarization algorithm as the basis for our alignment detection algorithm because
of two reasons
1. Images captured from camera suffer from illumination variation and blur. Global
binarization algorithms specifically designed for flatbed scanners fail to handle the
local subtleties in the images and thus produce poor results in such situations.
2. Our alignment-checking algorithm is a local algorithm by nature, since only pixels
within the four margins needs to be checked.
8 Authors Suppressed Due to Excessive Length
Issue: Since many digital cameras are designed to operate over a variety of distances,
focus becomes a significant factor. Sharp edge response is required for the best character
segmentation and recognition [16]. At short distances and large apertures, even slight
perspective changes can cause uneven focus.
Proposed Approach: We adopted the AutoFocus API provided by Android SDK
to avoid any blurring seen in out-of-focus images. Whenever we start the application, a
camera autofocus handler object is initiated so that the camera itself can focus on the
text sources automatically.
OCRdroid: A Framework to Digitize Text Using Mobile Phones 9
Fig. 4: A perfectly aligned receipt with noise dots located within the top margin and circled by a
red ellipse. A WxW local box is defined to help filter out the noise dots
We developed the OCRdroid phone client program using Google’s Android software
stack (version 1.1) in the Windows environment. Android is the first truly open and
comprehensive platform designed specifically for mobile devices. Compared to other
mobile platforms such as Nokia S60, and Windows Mobile, Android is more efficient
and easy to program. Our client software requires access to phone camera, embedded
sensors, background services, network services, relational database and file-system in-
side the phone. All these necessary APIs are provided by Android SDK (version 1.1).
10 Authors Suppressed Due to Excessive Length
phone is tilted more than the allowed threshold, users are prevented from capturing any
images. Figure 7 presents a screenshot showing how the tilt detection module works.
The image on the right shows a case where user has somewhat tilted the phone. As a
result, a red colored bounding box is displayed, indicating the camera is not properly
oriented. As soon as the orientation is perfect, the box turns to green and the user is
allowed to capture an image.
Fig. 7: Screenshot demonstrating tilt detection capability of OCRDroid using embedded orienta-
tion sensor. A green (red) rectangle indicates correct (incorrect) tilt.
12 Authors Suppressed Due to Excessive Length
Once the user takes a picture, the misalignment detection module is initiated to ver-
ify if the text source is properly aligned. In case of a misaligned image, the client pops
up an appropriate notification to instruct the user where misalignment is detected. As
soon as the text source is properly aligned, the new image is transported to the server
over a HTTP connection. An OCR data receiver module keeps listening on the connec-
tion and passes the OCR result sent from the server to information extraction module,
where corresponding information is extracted and parsed into different categories. Fi-
nally, the OCR result is displayed on the screen and automatically stored inside the local
database for future references.
OCRdroid backend server is an integration of an OCR engine and a web server. Figure
8 presents the architectural diagram for the backend server. Once images are received
at web server, shell scripts are called through PHP (5.0) in sequence to perform bina-
rization and image rotation (if any skew is detected). Here at each step, conversion of
image is done using a popular open source tool called ImageMagick [4]. Once the im-
age preprocessing procedures are completed, the intermediate result is fed to the OCR
engine to perform text segmentation and character recognition.
There are many open source as well as commercial OCR engines available, each
with its own unique strengths and weaknesses. A detailed list of OCR engines is avail-
able at [11]. We tested some of the open source OCR engines such as OCRAD [6],
Tesseract [10], GOCR [3] and Simple OCR [9]. Based on our experiments, we found
that Tesseract gave the best results. We also tested a complete document analysis tool
called OCRopus [5]. It first performed document layout analysis and then used Tesser-
act as its OCR engine for character recognition. However, OCRopus gave rise to one ad-
ditional complication in our PocketPal application because of its inherent multi-column
support. The document layout analyser identified receipts to be in a 2-column format.
As a result, it displayed the names of all shopping items followed by their correspond-
ing prices. This required us to carry out extra operations to match the name and the
price of each item. Therefore, we choose Tesseract as our OCR engine.
Once the whole process finishes successfully, OCRdroid server responds to the
client side with an OK message. On receieving this message, the client sends back a
HTTP request to ask for the OCR result. Finally, OCRdroid server sends the text result
back as soon as the request is received.
The system was tested on 10 distinct black and white images without illustrations. Tests
were performed under three distinct lighting conditions. All the test images were taken
by HTC G1’s embedded 3.2 megapixel camera. The images and corresponding results
can found at our project website [7].
In order to measure the accuracy of OCR, we adopt two metrics proposed by the Infor-
mation Science Research Institute at UNLV for the Fifth Annual Test of OCR Accuracy
[20]
Character Accuracy This metric measures the effort required by a human editor to
correct the OCR-generated text. Specifically, we compute the minimum number of edit
operations (character insertions, deletions, and substitutions) needed to fully correct the
text. We refer to this quantity as the number of errors made by the OCR system. The
character accuracy is defiend as:
#characters − (#errors)
(1)
#characters
14 Authors Suppressed Due to Excessive Length
Word Accuracy In a text retrieval application, the correct recognition of words is much
more important than the correct recognition of numbers or punctuations. We define a
word to be any sequence of one or more letters. If m out of n words are recognized cor-
rectly, the word accuracy is m/n. Since full-text searching is almost always performed
on a case-insensitive basis, we consider a word to be correctly recognized even if one
or more letters of the generated word are in the wrong case (e.g., ”transPortatIon”).
In addition, real time response is another very important metric for mobile applica-
tions. We evaluate timing performance in terms of processing time taken by each of the
preprocessing steps.
In this section, we give both qualitative and quantitative analysis of the performance
improvement brought by each of our preprocessing steps.
(a) An image of text source taken(b) OCR output without binariza- (c) OCR output with binarization
under normal lighting condition tion
conditions when image is captured. To measure the performance, we carried out our
experiments under three distinct lighting conditions:
– Normal lighting condition: This refers to situations when images are captured out-
doors in the presence of sunlight or indoors in an adequately lit room.
– Poor lightening condition: This describes situations when users take images outdoors
during night or capture images in rooms which have very dim lighting.
– Flooding condition: This describes situations when the source of light is very much
focused on a particular portion of the image, whereas the remaining portion is dark.
OCRdroid: A Framework to Digitize Text Using Mobile Phones 15
Type of lighting Conditions Character Accuracy Without Binarization(%) Character Accuracy With Binarization(%)
Normal 46.09 96.94
Poor 9.09 59.59
Flooding 27.74 59.33
Table 1: Comparision of OCR character accuracy with and without binarization under three light-
ing conditions
Table 1 lists the character accuracy of OCR output with and without binarization
under three lighting conditions. As expected, OCR output with binarization achieves
much higher peformance than their counterparts without binarization. Under normal
lighting condition, the OCR output with binarization achieves 96.94% character accu-
racy. However, the performance degrades significantly if lighting condition is poor or
flooded. The main reason to cause this problem is the text in the shaded regions tends
to have very dark and broad boundaries. This confuses the OCR engine, and as a result,
leads to relatively low character accuracy.
Skew Detection and Auto-Rotation Our Tesseract OCR engine failed to produce any
meaningful results if the input image is skewed more than 5◦ . In order to examine
the performance of our skew detection and auto-rotation algorithm, we rotated all the
images in our text corpus by 5◦ , 10◦ , 15◦ , 25◦ , and 35◦ in both clockwise and counter-
clockwise directions using Gimp Image Processing ToolBox [2]. These rotated images
are pictured under three different lighting conditions and then passed to the OCR en-
gine. Figure 10 and Figure 11 demonstrate the performance in terms of average charac-
ter accuracy and average word accuracy.
As presented, our skew detection and auto rotation algorithm works quite well for
image rotation up to 30◦ in both clockwise and counter-clockwise directions. On the
other hand, even under normal lighting condition, the performance drops sharply with
image rotation at 35◦ . However, in real-world applications, it is reasonable to believe
that general users would not take images at such high degree of rotation.
Fig. 10: Average Character Accuracy of OCR output at different rotation angles under three
different lighting conditions
Fig. 11: Average Word Accuracy of OCR output at different rotation angles under three different
lighting conditions
OCRdroid: A Framework to Digitize Text Using Mobile Phones 17
Lighting Conditions Type of Images No of Images No of Images Detected Misalgined No of Images Detected Properly Aligned
Misalgined 15 14 1
Normal
Properly Aligned 15 2 13
Misalgined 15 14 1
Poor
Properly Aligned 15 7 8
Misalgined 15 13 2
Flooding
Properly Aligned 15 6 9
Perception Distortion When taking pictures, if the camera is not parallel to the text
source being captured, the resulting image suffers from some perspective distortion.
However, it is very hard for us to measure the tilting angles directly. Therefore, we
tuned the thresholds in the program and then checked the performance. Based on our
experiments, we found the OCR accuracy is highly susceptible to camera orientation.
The character accuracy drops sharply if we tilt the camera over 12◦ . Therefore, we set
the threshold for our orientation sensor to [-10◦ , 10◦ ]. The upper graph in Figure 13
shows a document image captured when the mobile phone is titled 20◦ and the corre-
sponding OCR text output. The poor OCR accuracy indicates OCR results are strongly
correlated to the orientation of the camera. As a contrast, the lower graph presents the
image taken with 5◦ orientation and its OCR result with 100% accuracy. This contrast
clearly demonstrates that orientation sensor plays an important role in ensuring good
performance.
Fig. 13: Comparison of OCR results with and without perception distortion
is relatively big since we can not shut down some native applications running in the
phone. In addition, since we used Wi-Fi as the link between the mobile device and the
backend server, the network latency is negligible compared to other processing steps.
The result is summarized in Table 3.
Table 3: Experimental results indicating maximum time taken by each of the prepro-
cessing steps
OCRdroid: A Framework to Digitize Text Using Mobile Phones 19
Our OCRdroid framework only works in the case where the background of the text
source is simple and uniform. However, in some cases, the source may contain a very
complex background, such as pages of magazines with illustrations, or icons of compa-
nies printed on receipts. We are working on applying text detection algorithms to detect
regions that most likely contain text and then separate text regions from the complex
background.
In some cases, image documents or receipts are quite long and can not be captured
within one single frame due to the limited screen size of mobile phones. We are cur-
rently investigating algorithms that can merge the OCR results from images capturing
different portions of the same text source and make sure they are merged in a proper
sequence without data lost or repetition.
We are working on a new application on the top of the OCRdroid framework - Pock-
etScan. This application allows users to quickly scan ingredients of any medicines as
well as food and check if they contain some particular chemical to which the user is
allergic.
8 Conclusions
In this paper, we present the design and implementation of a generic framework called
OCRdroid for developing OCR-based applications on mobile phones. We focus on us-
ing orientation sensor, embedded high-resolution camera, and digital image processing
techniques to solve OCR issues related to camera-captured images. One of the key tech-
nical challenges addressed by this work is a mobile solution for real time text misalign-
ment detection. In addition, we have developed two applications called PocketPal and
20 Authors Suppressed Due to Excessive Length
References
1. Abbyy Mobile OCR Engine. http://www.abbyy.com/mobileocr/.
2. Gimp - the GNU Image Manipulation Program. http://www.gimp.org/.
3. GOCR - A Free Optical Character Recognition Program. http://jocr.sourceforge.net/.
4. ImageMagick: Convert, Edit, and Compose Images. http://www.imagemagick.org.
5. OCR resources (OCRopus). http://sites.google.com/site/ocropus/ocr-resources.
6. OCRAD - The GNU OCR. http://www.gnu.org/software/ocrad/.
7. OCRdroid - website. http://www-scf.usc.edu/ ananddjo/ocrdroid/index.php.
8. OCRopus - Open Source Document Analysis and OCR System.
http://sites.google.com/site/ocropus/Home.
9. Simple OCR - Optical Character Recognition. http://www.simpleocr.com/.
10. Tesseract OCR Engine. http://code.google.com/p/tesseract-ocr/.
11. Wikipedia OCR details. http://en.wikipedia.org/wiki/Optical character recognition.
12. Faisal Shafait A, Daniel Keysers A, and Thomas M. Breuel B. Efficient implementation of
local adaptive thresholding techniques using integral images, 2008.
13. W. Bieniecki, S. Grabowski, and W. Rozenberg. Image preprocessing for improving ocr ac-
curacy. Perspective Technologies and Methods in MEMS Design, MEMSTECH 2007, 2007.
14. Ohbuchi Eisaku, Hanaizumi Hiroshi, and Hock Lim Ah. Barcode readers using the camera
device in mobile phones. In CW ’04: Proceedings of the 2004 International Conference on
Cyberworlds. IEEE Computer Society, 2004.
15. Megan Elmore and Margaret Martonosi. A morphological image preprocessing suite for ocr
on natural scene images, 2008.
16. J. Liang, D. Doermann, and H. P. Li. Camera-based analysis of text and documents: a survey.
International Journal on Document Analysis and Recognition, 7(2-3):84–104, July 2005.
17. Pranav Mistry and P Maes. Quickies: Intelligent sticky notes. In International Conference
on Intelligent Environments, 2008.
18. W. Niblack. An Introduction to Digital Image Processing. Prentice Hall, 1986.
19. N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions on
Systems, Man and Cybernetics, 1979.
20. S. V. Rice, F. R. Jenkins, and T. A. Nartker. OCR accuracy: UNLV’s fifth annual test. IN-
FORM, September 1996.
21. J. Sauvola and M. Pietikainen. Adaptive document image binarization. Pattern Recognition,
2000.
22. M. Seeger and C. Dance. Binarising camera images for OCR. In ICDAR, 2001.
23. M. Sezgin and B. Sankur. Survey over image thresholding techniques and quantitative per-
formance evaluation, 2004.
24. A. Ulges, C. H. Lampert, and T. M. Breuel. Document image dewarping using robust esti-
mation of curled text lines. In ICDAR, 2005.
25. K. Whitesell, B. Kutler, N. Ramanathan, and D. Estrin. A system determining indoor air
quality from images air sensor captured cell phones, 2008.
26. Luo Xi-Ping, Li Jun, and Zhen Li-Xin. Design and implementation of a card reader based
on build-in camera. In ICPR ’04: Proceedings of the Pattern Recognition, 17th International
Conference on (ICPR’04) Volume 1. IEEE Computer Society, 2004.