Handwritten Tamil Character Recognition Using SVM: Prof. Dr.J.Venkatesh, C. Sureshkumar
Abstract: Hand written Tamil Character recognition refers to recognition it is easier to recognize English alphabets and
the process of conversion of handwritten Tamil character into numerals than Tamil characters. Tamil has the longest
Unicode Tamil character. The scanned image is segmented into unbroken literary tradition amongst the Dravidian
paragraphs using spatial space detection technique, paragraphs
languages. Tamil is inherited from Brahmi script. The
into lines using vertical histogram, lines into words using
horizontal histogram, and words into character image glyphs
earliest available text is the Tolkaappiyam, a work
using horizontal histogram. Each image glyph is subjected to describing the language of the classical period. There are
feature extraction procedure, which extracts the features such several other famous works in Tamil like Kambar
as character height, character width, number of horizontal Ramayana and Silapathigaram but few supports in Tamil
lines(long and short), number of vertical lines(long and short), which speaks about the greatness of the language. For
horizontally oriented curves, the vertically oriented curves, example, Thirukural is translated into other languages due
number of circles, number of slope lines, image centroid and
to its richness in content. It is a collection of two sentence
special dots. The extracted features considered for recognition
are given to Support Vector Machine (SVM) where the poems efficiently conveying things in a hidden language
characters are classified using supervised learning algorithm. called Slaydai in Tamil. Tamil has 12 vowels and 18
These classes are mapped onto Unicode for recognition. Then consonants. These are combined with each other to yield
the text is reconstructed using Unicode fonts. This character 216 composite characters and 1 special character (aayutha
recognition finds applications in document analysis where the ezhuthu) counting to a total of (12+18+216+1) 247
handwritten document can be converted to editable printed characters.
document. This approach can be extended to recognition and
reproduction of hand written documents in South Indian
languages. 1.2 Vowels
Tamil vowels are called uyireluttu (uyir – life, eluttu –
Keywords: Character recognition, Unicode, Support letter). The vowels are classified into short (kuril) and long
Vector Machines (SVM).
(five of each type) and two diphthongs, /ai/ and /auk/, and
three "shortened" (kuril) vowels. The long (nedil) vowels
1. Introduction are about twice as long as the short vowels. The diphthongs
are usually pronounced about 1.5 times as long as the short
Tamil is an ancient language with a rich literary tradition vowels, though most grammatical texts place them with the
and Ancient India was popular in several fields such as long vowels.
medicine, astronomy and business. Ancient people recorded
their knowledge in various fields in palm leaves. The 1.3 Consonants
handwritten text written in palm leaves decayed over a Tamil consonants are known as meyyeluttu (mey - body,
period of time. It is very difficult to preserve them in the eluttu - letters). The consonants are classified into three
same form. This paper proposes a new approach for categories with six in each category: vallinam - hard,
converting handwritten Tamil script using unicode. The mellinam - soft or Nasal, and itayinam - medium. Unlike
style of writing and the font were different compared to most Indian languages, Tamil does not distinguish aspirated
present day scripts. Lot of software tools is available only to and unaspirated consonants. In addition, the voicing of
read present day printed Tamil text with better recognition plosives is governed by strict rules in centamiḻ. Plosives are
and accuracy. unvoiced if they occur word-initially or doubled. Elsewhere
they are voiced, with a few becoming fricatives
1.1 Tamil Language intervocalically. Nasals and approximants are always
voiced. As commonplace in languages of India, Tamil is
Tamil is a South Indian language spoken widely in characterised by its use of more than one type of coronal
Tamilnadu in India. Handwritten character recognition is a consonants. Retroflex consonants include the retroflex
approximant , which among the Dravidian languages is also
difficult problem due to the great variations of writing
found in Malayalam (example Kozhikode), disappeared
styles, different size and orientation angle of the characters.
from Kannada in pronunciation at around 1000 AD (the
Among different branches of handwritten character
30 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 3, December 2009
dedicated letter is still found in Unicode), and was never size of the input image is as specified by the user and can be
present in Telugu. Dental and alveolar consonants also of any length but is inherently restricted by the scope of the
contrast with each other, a typically Dravidian trait not vision and by the scanner software length.
found in the neighboring Indo-Aryan languages.
2.3 Preprocessing
1.4 Tamil Unicode
This is the first step in the processing of scanned image. The
scanned image is pre processed for noise removal. The
The Unicode Standard is the Universal Character encoding resultant image is checked for skewing. There are
scheme for written characters and text. It defines the uniform possibilities of image getting skewed with either left or right
way of encoding multilingual text that enables the orientation. Here the image is first brightened and binarized.
exchange of text data internationally and creates the The function for skew detection checks for an angle of
foundation of global software. The Tamil Unicode range is orientation between ±15 degrees and if detected then a
U+0B80 to U+0BFF [3].The Unicode characters are simple image rotation is carried out till the lines match with
comprised of 2 bytes in nature. For example, the Unicode the true horizontal axis, which produces a skew corrected
for the character is 0B85; the Unicode for the image.
character is 0BAE+0BC0. The Unicode is designed for
various other Tamil characters.
Scan Document
5. Character recognition
Authors Profile
Prof. Dr. J. Venkatesh received a MBA
