basa
basa
basa
Systems, Jagannath Gupta Institute of Engineering and Technology, Jaipur, March 4-5, 2011
Abstract—Odia is one of the oldest and popular character and the template. One measure increases the
languages of India, spoken by more than 44 million amount of similarity when a pixel in the observed character is
people, especially in Odisha, India. Some characters in identical to the same pixel in the template image. If the pixels
Odia are made up of more than one connected symbols. differ, the measure of similarity may be decreased. After all
Compound characters are written by associating templates have been compared with the observed character
modifiers with consonants, resulting in a huge number of image, the character’s identity is assigned as the identity of
possible combinations, running into hundreds of the most similar template when correlation coefficient is
thousands. Therefore, systems developed for recognition maximum.
of other scripts, like Roman, cannot be used directly for Template matching technique can used for a small set of
the Odia language. In the present work, we have proposed postures, requires small amount of calibration, no advance
robust structural solution for Odia character recognition learning of patterns and is quite accurate. But this technique
where, a given text is segmented into lines and then each does have limitations. The limitation is the small number of
line is segmented into individual words and then each possible postures that can be recognized. If the application
word is segmented into individual characters or basic requires a large posture set, then template matching will not
symbols. Basic symbols are identified as the fundamental work better.
units of segmentation used for recognition. Using unique The main challenge in the handwritten character
structure of some characters we have found better result recognition involves a development of a method that can
as comparison to other methods. generate the description of the handwritten objects in a short
period of time. In this study we propose a simple yet robust
I. INTRODUCTION structural solution for performing Odia (the official language
During the past thirty years, substantial research efforts of Odisha) character recognition.
have been devoted to character recognition that is used to Rest of the paper is organized as follows. Section II de-
translate human readable characters to machine readable scribes character modelling. In Section III, character recogni-
codes. An immense effort has been spent on character recog- tion using sub-structure based method is explained. The
nition, because it provides a solution for processing large vol- experiments and results are discussed in section IV, Finally
umes of data automatically in a large variety of scientific and conclusion of the paper is given in section V.
business applications. Handwriting is converted to the digital
form either by scanning the written paper or by writing with II. CHARACTER MODELLING
special pen on an electronic surface such as a digitizer
combined with a liquid crystal display [1]. The two A. Odia literature
approaches are distinguished as offline and online The number of characters in Odia is large. Two or more
handwriting, respectively. In the online case, the two- characters may combine to form compound character, as a
dimensional co-ordinates of successive points of the result the total number of characters to be recognized is more
handwriting as a function of time are stored in order [2]. For than 200.
offline handwriting, only the completed writing is available 1) Properties of Odia Script: The properties of the Odia
as an image [3]. Offline systems are therefore less accurate script that are useful for building the character recognition
than online systems. are:
Template matching, or matrix matching, is one of the most
common classification methods [4],[5]. In template matching,
individual image pixels are used as features. Classification is
performed by comparing an input character image with a set
of templates (or prototypes) from each character class. Each
Figure 1. Set of Odia Vowels and Consonants
comparison results in a similarity measure between the input
• The Odia basic characters consist of vowels and and salt and pepper noise. Smoothing operations are often
consonants which are shown in Fig. 1. As in other used to eliminate the artifacts introduced during the image
Indian scripts, the concept of upper lower case is capture.
absent here. Two main approaches of noise reduction are:
• The first vowel is never printed after a consonant in a) Filter by masking.
a word and can occur only at the beginning of a b) Morphological Operations i.e by erosion, dilation.
word. 3) Image Segmentation: Character Segmentation is a two
stage segmentation process in which the subscripts of the
word are removed first and then the individual characters are
segmented. Image Segmentation plays a crucial role in
Character Recognition [7]. If one views an image as depicting
Figure 2. Modified vowels attached to the first consonant and some a scene composed of different objects, regions. Then
commonly occurring compound characters segmentation is the decomposition of an image into these
objects and regions by associating or ‘labeling’ each pixel
• A vowel (other than the first one) following a with the object that it corresponds to.
consonant takes some modified shape as shown in There are two types of segmentation:
Fig.2. Depending on the vowel, this modified shape • Implicit Segmentation: The words are recognized
is placed to the left, right (or on both sides), top or entirely without segmenting them into letters. This is
bottom of the consonant. The modified shapes are most effective and viable only when the set of
called modifiers or allographs. The vowel allographs possible words is small and known in advance, such
do not disturb the shape of the basic characters (in as the recognition of bank checks and postal address.
the middle zone) to which they are attached.
• Explicit Segmentation: In explicit approaches one
• If the shape in the middle zone is altered by tries to identify the smallest possible word segments
combining two or more consonants, the resultant (primitive segments) that may be smaller than
shape is termed as compound character. In some letters, but surely can-not be segmented further.
cases, a consonant preceding or following another Later in the recognition process these primitive
consonant is represented by a modifier called segments are assembled into letters based on input
consonant modifier. from the character recognizer. The advantage of this
B. Preprocessing strategy is that it is robust and quite straightforward,
but is not very flexible.
It is necessary to perform several document analysis oper-
ations prior to recognition of text in scanned document. The
a) Line Segmentation: The handwritten text must be
common operations are:
divided first into lines.
1) Thresholding: The task of thresholding is the extraction
of the foreground from the background [6]. The histogram of
grayscale values of a document image typically consists of
two peaks: one is corresponding to the foreground and
another is corresponding to the white background. Hence, the
(a) (b)
task of determining the threshold grayscale value is the
determining of an ‘optimal’ value in the valley between the
two peaks. Two categories of thresholding are:
(c) (d) (e)
• Globally - picks one threshold value for the entire
document image which is often based on an Figure 3(a) Odia handwritten text (b) Binary filter image of handwritten text
estimation of the background level from the (c) First line (d) Second line and (d) Third line of handwritten text after line
intensity histogram of the image. segmentation.
(a) (b)
Figure 4(a) First word and (b) Second word of first line
Figure 8(a) Odia Handwritten ‘Kho’ Character of size 100X100 (b) Cropped
image of size 100X20 (c) Cropped image of only non-zero elements.
A. Recognition of Characters
Figure 6(b) Every Odia character that belongs to either group-I or
group-II having at least one or more unique shape in some
In Odia handwritten characters, the vertical line takes 20%
portion. We have to extract that portion and create a sub-
out of the total width at the right most part. So in our work
image template database separately for group-I and group-II
we have cropped only that portion of vertical line for
characters by using lots of handwritten samples of a
detection. Figure 9 shows the flow chart for finding group-I
particular character.
and group-II characters. If all the rows of cropped image
contain at least a non-zero element, than it is called ‘vertical
IV. EXPERIMETS AND RESULTS
line is present’. But most of the times vertical line is not
exactly straight. So for this case we have to calculate the When we take a test image of a character, first we have to
number of connecting rows. find whether it belongs to group-I or group-II character as
If the number of connecting rows is greater than 65% from shown in fig.7. Then extract the unique shape as sub-image
the total number of rows of the handwritten character, than it and match with either group-I sub-image template database or
is also called ‘vertical line is present’. The pixel group-II sub-image template database.
connectedness is checked from top to bottom of the cropped Experiments are performed on different handwritten Odia
image. characters. Instead of describing in detail, we are describing
here only for one character which is a difficult task.
Recognition of Odia ‘Kho’ character: Recognition of
handwritten Odia ‘Kho’ character is a difficult task, because
it is almost similar to Odia ‘Gaa’ character as shown in fig. 9.
The marked shown in fig. 9(a) and 9(b) is the only difference
between Odia ‘Kho’ and ‘Gaa’ character.
Figure 12(b)
REFERENCES
Figure 10(b). Database of the sub-image of ‘Gaa’ character [1] U. K.Roy, T.Pal and F. Kimura, “Oriya handwritten numeral recognition
systems,” computer Vision and Pattern Recognition Unit, Indian
Statistical Institute, Kolkata-108, India.
[2] V. S. C. H. Swethalakshmi, Anitha Jayaraman and C. C. Sekhar, “Online
handwritten character recognition of devanagari and telugu characters
using support vector machines,” department of Computer Science and
Engineering, Department of Biotechnology, Indian Institute of Technol-
Figure 11(a) and (b) ogy Madras, Chennai - 600 036, India.
[3] T. W. U. Pal1 and F. Kimura2, “A system for off-line oriya handwritten
Figure 11(a) shows the input handwritten Odia ‘Kho’ character recognition using curvature feature,” computer Vision and
character. Before matching with the template databases, we Pattern Recognition Unit, Indian Statistical Institute, Kolkata-108, India.
have to extract the unique portion of the input handwritten [4] L. Song and Y. Lin, “Study on the vision reading algorithm based on
template matching and neural network,” in Proceedings of International
character as shown in fig.11(b) that distinguish Odia ‘Kho’ Joint Conference on Neural Networks, ser. Orlando, Florida, USA,
and ‘Gaa’ character. Then matching is performed with the August 2007, pp. 12 – 17.
databases. Here we have calculated the correlation coefficient [5] R. S. P. Jayashree R.Prasad, Dr.U.V.Kulkarni, “Template matching
as similarity measure. algo-rithm fof gujrati character recognition,” second International
Conference on Emerging in Engineering and Technology,ICETET-09.
Figure 12(a) shows the correlation coefficients of the input
[6] R. C.Gonzalez and R. E.Woods, Digital Image Processing, 3rd ed.
sub-image with all the templates of ‘Kho’ character database, Pearson.
where we have observed the maximum value (>0.5) of [7] B. D. Mohammad Isbat Sakib Chowdhury and M. S. Rahman, “Segmen-
correlation as compared to fig,12(b) which corresponds to tation of printed bangla characters using structural properties of bangla
‘Gaa’ character. script,” in 5th International Conference on Electrical and Computer
Engineering, ser. ICECE, 20-22 December 2008.