0% found this document useful (0 votes)
2 views4 pages

basa

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 4

Presented in the National Conference on Recent Advances in Microwave Tubes, Devices and Communication

Systems, Jagannath Gupta Institute of Engineering and Technology, Jaipur, March 4-5, 2011

HANDWRITTEN ODIA CHARACTER


RECOGNITION
Debasish Basa1, Sukadev Meher2
Department of Electronics and Communication Engineering
National Institute of Technology, Rourkela, India
1
debasishbasa@gmail.com
2
sukadevmeher@gmail.com

Abstract—Odia is one of the oldest and popular character and the template. One measure increases the
languages of India, spoken by more than 44 million amount of similarity when a pixel in the observed character is
people, especially in Odisha, India. Some characters in identical to the same pixel in the template image. If the pixels
Odia are made up of more than one connected symbols. differ, the measure of similarity may be decreased. After all
Compound characters are written by associating templates have been compared with the observed character
modifiers with consonants, resulting in a huge number of image, the character’s identity is assigned as the identity of
possible combinations, running into hundreds of the most similar template when correlation coefficient is
thousands. Therefore, systems developed for recognition maximum.
of other scripts, like Roman, cannot be used directly for Template matching technique can used for a small set of
the Odia language. In the present work, we have proposed postures, requires small amount of calibration, no advance
robust structural solution for Odia character recognition learning of patterns and is quite accurate. But this technique
where, a given text is segmented into lines and then each does have limitations. The limitation is the small number of
line is segmented into individual words and then each possible postures that can be recognized. If the application
word is segmented into individual characters or basic requires a large posture set, then template matching will not
symbols. Basic symbols are identified as the fundamental work better.
units of segmentation used for recognition. Using unique The main challenge in the handwritten character
structure of some characters we have found better result recognition involves a development of a method that can
as comparison to other methods. generate the description of the handwritten objects in a short
period of time. In this study we propose a simple yet robust
I. INTRODUCTION structural solution for performing Odia (the official language
During the past thirty years, substantial research efforts of Odisha) character recognition.
have been devoted to character recognition that is used to Rest of the paper is organized as follows. Section II de-
translate human readable characters to machine readable scribes character modelling. In Section III, character recogni-
codes. An immense effort has been spent on character recog- tion using sub-structure based method is explained. The
nition, because it provides a solution for processing large vol- experiments and results are discussed in section IV, Finally
umes of data automatically in a large variety of scientific and conclusion of the paper is given in section V.
business applications. Handwriting is converted to the digital
form either by scanning the written paper or by writing with II. CHARACTER MODELLING
special pen on an electronic surface such as a digitizer
combined with a liquid crystal display [1]. The two A. Odia literature
approaches are distinguished as offline and online The number of characters in Odia is large. Two or more
handwriting, respectively. In the online case, the two- characters may combine to form compound character, as a
dimensional co-ordinates of successive points of the result the total number of characters to be recognized is more
handwriting as a function of time are stored in order [2]. For than 200.
offline handwriting, only the completed writing is available 1) Properties of Odia Script: The properties of the Odia
as an image [3]. Offline systems are therefore less accurate script that are useful for building the character recognition
than online systems. are:
Template matching, or matrix matching, is one of the most
common classification methods [4],[5]. In template matching,
individual image pixels are used as features. Classification is
performed by comparing an input character image with a set
of templates (or prototypes) from each character class. Each
Figure 1. Set of Odia Vowels and Consonants
comparison results in a similarity measure between the input
• The Odia basic characters consist of vowels and and salt and pepper noise. Smoothing operations are often
consonants which are shown in Fig. 1. As in other used to eliminate the artifacts introduced during the image
Indian scripts, the concept of upper lower case is capture.
absent here. Two main approaches of noise reduction are:
• The first vowel is never printed after a consonant in a) Filter by masking.
a word and can occur only at the beginning of a b) Morphological Operations i.e by erosion, dilation.
word. 3) Image Segmentation: Character Segmentation is a two
stage segmentation process in which the subscripts of the
word are removed first and then the individual characters are
segmented. Image Segmentation plays a crucial role in
Character Recognition [7]. If one views an image as depicting
Figure 2. Modified vowels attached to the first consonant and some a scene composed of different objects, regions. Then
commonly occurring compound characters segmentation is the decomposition of an image into these
objects and regions by associating or ‘labeling’ each pixel
• A vowel (other than the first one) following a with the object that it corresponds to.
consonant takes some modified shape as shown in There are two types of segmentation:
Fig.2. Depending on the vowel, this modified shape • Implicit Segmentation: The words are recognized
is placed to the left, right (or on both sides), top or entirely without segmenting them into letters. This is
bottom of the consonant. The modified shapes are most effective and viable only when the set of
called modifiers or allographs. The vowel allographs possible words is small and known in advance, such
do not disturb the shape of the basic characters (in as the recognition of bank checks and postal address.
the middle zone) to which they are attached.
• Explicit Segmentation: In explicit approaches one
• If the shape in the middle zone is altered by tries to identify the smallest possible word segments
combining two or more consonants, the resultant (primitive segments) that may be smaller than
shape is termed as compound character. In some letters, but surely can-not be segmented further.
cases, a consonant preceding or following another Later in the recognition process these primitive
consonant is represented by a modifier called segments are assembled into letters based on input
consonant modifier. from the character recognizer. The advantage of this
B. Preprocessing strategy is that it is robust and quite straightforward,
but is not very flexible.
It is necessary to perform several document analysis oper-
ations prior to recognition of text in scanned document. The
a) Line Segmentation: The handwritten text must be
common operations are:
divided first into lines.
1) Thresholding: The task of thresholding is the extraction
of the foreground from the background [6]. The histogram of
grayscale values of a document image typically consists of
two peaks: one is corresponding to the foreground and
another is corresponding to the white background. Hence, the
(a) (b)
task of determining the threshold grayscale value is the
determining of an ‘optimal’ value in the valley between the
two peaks. Two categories of thresholding are:
(c) (d) (e)
• Globally - picks one threshold value for the entire
document image which is often based on an Figure 3(a) Odia handwritten text (b) Binary filter image of handwritten text
estimation of the background level from the (c) First line (d) Second line and (d) Third line of handwritten text after line
intensity histogram of the image. segmentation.

b) Word Segmentation: For Odia script, spacing between


• Locally (Adaptive) - uses different values for each
the words is greater than the spacing between the characters
pixel according to the local area information.
in a word. This spacing between the words is used for word
segmentation. The spacing between the words is found by
2) Noise Reduction: Digital image can have noise, intro-
taking the Vertical Connecting Pixel (VCP) of an input text
duced from the scanning devices and/or transmission
line. VCP is the sum of ON pixels along every column of the
medium. In order to achieve an accurate result, all non-word
image. In VCP, the width of the zero-valued valleys is more
data must be removed. There are three common type of noise
between the words in the line when compared to the width of
in handwriting known as: background noise, shadow noise
zero-valued valleys that exists between the characters in a
word. This information is used to separate words from the
input text lines as shown in fig.4.

(a) (b)
Figure 4(a) First word and (b) Second word of first line

c) Character Segmentation: We know that Odia is a non-


cursive script. So, spacing between the characters in a word is
used for character segmentation as shown in fig.5. For word
segmentation also VCP is used.

(a) (b) (c)


Figure 5(a) First character (b) Second character and (c) Third character of
first word.

III. CHARACTER RECOGNITION USING SUB-STRUCTURE


BASED METHOD
In Odia some characters contain a vertical line at the right
most part. According to this property all the character are
divided into two groups. Figure 7.Flow chart for finding group-I and group-II characters.
• Group I- A vertical line is not present at the right
most side as shown in fig. 6(a). Sometimes the connectedness is not greater than 65%. For
these cases we have to rotate 180 degree and we have to
• Group II- A vertical line is present at the right most calculate the connectedness. If the connectedness is not
side as shown in fig. 6(b). greater than 65% than it is called ‘vertical line is not present’
as shown in figure 8.

Figure 6(a) (a) (b) (c)

Figure 8(a) Odia Handwritten ‘Kho’ Character of size 100X100 (b) Cropped
image of size 100X20 (c) Cropped image of only non-zero elements.

A. Recognition of Characters
Figure 6(b) Every Odia character that belongs to either group-I or
group-II having at least one or more unique shape in some
In Odia handwritten characters, the vertical line takes 20%
portion. We have to extract that portion and create a sub-
out of the total width at the right most part. So in our work
image template database separately for group-I and group-II
we have cropped only that portion of vertical line for
characters by using lots of handwritten samples of a
detection. Figure 9 shows the flow chart for finding group-I
particular character.
and group-II characters. If all the rows of cropped image
contain at least a non-zero element, than it is called ‘vertical
IV. EXPERIMETS AND RESULTS
line is present’. But most of the times vertical line is not
exactly straight. So for this case we have to calculate the When we take a test image of a character, first we have to
number of connecting rows. find whether it belongs to group-I or group-II character as
If the number of connecting rows is greater than 65% from shown in fig.7. Then extract the unique shape as sub-image
the total number of rows of the handwritten character, than it and match with either group-I sub-image template database or
is also called ‘vertical line is present’. The pixel group-II sub-image template database.
connectedness is checked from top to bottom of the cropped Experiments are performed on different handwritten Odia
image. characters. Instead of describing in detail, we are describing
here only for one character which is a difficult task.
Recognition of Odia ‘Kho’ character: Recognition of
handwritten Odia ‘Kho’ character is a difficult task, because
it is almost similar to Odia ‘Gaa’ character as shown in fig. 9.
The marked shown in fig. 9(a) and 9(b) is the only difference
between Odia ‘Kho’ and ‘Gaa’ character.

Figure 9(a) Figure 9 (b)


Figure 12(a)
To distinguish between Odia ‘Kho’ and ‘Gaa’ character,
two databases of the sub-image marked in fig.9 are created.
Figure 10 shows the two template database where 35 samples
of different handwritten characters are taken.

Figure 12(b)

From fig. 12, we conclude that the recognition rate of


‘Kho’ character is more than ‘Gaa’ character and hence the
test image character is a ‘Kho’ character.

Figure 10(a). Database of the sub-image of ‘Kho’ character V. CONCLUSION


Recognition rate is highly affected by similarity of various
characters. There are more similar characters which in turn
degrade the recognition rate. We have treated individual
image pixels as features, where each comparison results the
similarity measure between the input character and the
database. The comparison is performed on pixel by pixel
basis.

REFERENCES
Figure 10(b). Database of the sub-image of ‘Gaa’ character [1] U. K.Roy, T.Pal and F. Kimura, “Oriya handwritten numeral recognition
systems,” computer Vision and Pattern Recognition Unit, Indian
Statistical Institute, Kolkata-108, India.
[2] V. S. C. H. Swethalakshmi, Anitha Jayaraman and C. C. Sekhar, “Online
handwritten character recognition of devanagari and telugu characters
using support vector machines,” department of Computer Science and
Engineering, Department of Biotechnology, Indian Institute of Technol-
Figure 11(a) and (b) ogy Madras, Chennai - 600 036, India.
[3] T. W. U. Pal1 and F. Kimura2, “A system for off-line oriya handwritten
Figure 11(a) shows the input handwritten Odia ‘Kho’ character recognition using curvature feature,” computer Vision and
character. Before matching with the template databases, we Pattern Recognition Unit, Indian Statistical Institute, Kolkata-108, India.
have to extract the unique portion of the input handwritten [4] L. Song and Y. Lin, “Study on the vision reading algorithm based on
template matching and neural network,” in Proceedings of International
character as shown in fig.11(b) that distinguish Odia ‘Kho’ Joint Conference on Neural Networks, ser. Orlando, Florida, USA,
and ‘Gaa’ character. Then matching is performed with the August 2007, pp. 12 – 17.
databases. Here we have calculated the correlation coefficient [5] R. S. P. Jayashree R.Prasad, Dr.U.V.Kulkarni, “Template matching
as similarity measure. algo-rithm fof gujrati character recognition,” second International
Conference on Emerging in Engineering and Technology,ICETET-09.
Figure 12(a) shows the correlation coefficients of the input
[6] R. C.Gonzalez and R. E.Woods, Digital Image Processing, 3rd ed.
sub-image with all the templates of ‘Kho’ character database, Pearson.
where we have observed the maximum value (>0.5) of [7] B. D. Mohammad Isbat Sakib Chowdhury and M. S. Rahman, “Segmen-
correlation as compared to fig,12(b) which corresponds to tation of printed bangla characters using structural properties of bangla
‘Gaa’ character. script,” in 5th International Conference on Electrical and Computer
Engineering, ser. ICECE, 20-22 December 2008.

You might also like