OCR Sanskrit CNN
OCR Sanskrit CNN
OCR Sanskrit CNN
Abstract—Ancient Sanskrit manuscripts are a rich source of characters are either less frequent or completely absent in
knowledge about Science, Mathematics, Hindu mythology, Indian Hindi text, Hindi OCRs would not be trained to segment
civilization, and culture. It therefore becomes critical that access and classify such characters correctly. Subsequently, the Hindi
to these manuscripts is made easy, to share this knowledge with
the world and to facilitate further research on this Ancient OCRs would display poor results in Sanskrit text.
literature. In this paper, we propose a Convolutional Neural Most of the recent Indic OCR systems make use of machine
Network (CNN) based Optical Character Recognition system learning algorithms such as support vector machines (SVMs)
(OCR) which accurately digitizes Ancient Sanskrit manuscripts [12] and artificial neural networks (ANNs) [11,16] to classify
(Devanagari Script) that are not necessarily in good condition.
letters in the image. These classifier models used in the OCRs
We use an image segmentation algorithm for calculating pixel
intensities to identify letters in the image. The OCR considers are trained with input images that are often downsampled
typical compound characters (half letter combinations) as sepa- by applying PCA [15,16], Gabour Filters [15,27], Geometric
rate classes in order to improve the segmentation accuracy. The Feature Graphs [27] etc., in order to reduce the complexity of
novelty of the OCR is its robustness to image quality, image the data. However, this results in a loss of important informa-
contrast, font style and font size, which makes it an ideal choice
tion necessary to make the classifier robust. For example, the
for digitizing soiled and poorly maintained Sanskrit manuscripts.
SVM classifier [12] displays different classification accuracy
for different font styles, showing that it does not generalize to
Index Terms—Devanagari Script, Sanskrit, Hindi, Deep Learn-
ing, OCR, digitization, Optical character recognition, CNN
different font styles. In addition to this, existing Indic OCRs
display poor results on degraded or poorly maintained docu-
ments or materials and their digitizing capability is limited to
I. I NTRODUCTION good quality text documents [27].
Sanskrit is gaining importance in various academic commu- In order to develop a robust OCR system which can digitize
nities due to the presence of ancient scientific and mathemat- soiled and noisy documents with high accuracy, we propose
ical research work written in this language. Scientists all over the use of Convolution neural networks (convnets) as opposed
the world, are spending increasing amount of time trying to to the popular use of SVMs and ANNs, as convnets possess
understand these ancient research manuscripts. However, the very high learning capacity and the capability to handle high
lack of accurately digitized and tagged versions of Sanskrit dimensional data such as images [2]. Convnets have displayed
manuscripts is a major bottleneck. In addition to this, the poor these characteristics consistently in various large-scale image
maintenance and text quality adds to the problem. Hence, it classification and video recognition tasks. Popular Convnet
becomes essential to digitize such ancient manuscripts which architectures such as the GoogLeNet [5], ResNet [24], VGG
are not only important for research but, are also an important Net [6] have achieved state of the art results in popular image
part of the culture and heritage of India. In order to facilitate classification challenges like the ILSVRC challenge or the
digitization of ancient Sanskrit material, we build an Indic ImageNet Challenge. In addition to this, researchers make
Optical Character Recognition System (OCR), specifically for use of convnets for various other tasks such as human pose
Sanskrit. estimation, dense semantic segmentation [26] etc.
In the recent years, several OCRs have been developed for The main contributions of the paper are 1) Developing
various Indian languages such as Hindi, Bangla, Telugu etc. an OCR framework for Sanskrit which can digitize soiled
[10,11,12,13]. However, very little work has been done to and poorly maintained documents 2) The use of CNNs as
develop good OCRs for Sanskrit. Even though both Hindi and classifiers for Sanskrit OCRs 3) A Sanskrit letter dataset
Sanskrit are written in the Devanagari script, it is important to consisting of 11,230 images belonging 602 classes 1 .
use a Sanskrit OCR instead of a Hindi OCR to digitize Sanskrit The rest of the paper is organized as follows. We first review
text due to the significant difference in complexity between the related work in section 2 and describe the features of
the two languages. Sanskrit text consists of several compound Devanagari script in section 3. In section 4, we discuss the
characters which are formed by different combinations of half approach used to segment letters in the image and the proce-
letter and full letter consonants. Some examples of compound
characters are shown in Fig 3 and Fig 4. Since such compound 1 Link for dataset: https://github.com/avadesh02/Sanskrit-letter-dataset
448
Figure 3: Examples of compound characters that are consid-
ered as unique classes
449
A B C D
Input Image Input Image Input Image Input Image
6 WeightLayers 8 WeightLayers 8 WeightLayers 8 WeightLayers
Conv3-32 Conv3-32 Conv3-64 Conv3-64
Conv3-32 Conv3-32 Conv3-64 Conv3-64
Conv3-32 Conv3-64 Conv3-64
Maxpool Maxpool Maxpool Maxpool
Conv3-64 Conv3-64 Conv3-64
Conv3-64 Conv3-64 Conv3-64 Conv3-64
Conv3-64 Conv3-64 Conv3-64
Maxpool Maxpool Maxpool Maxpool
Fc-2048 Fc-2048 Fc-4096 Fc-4096(dp 0.2)
Fc-1024(dp 0.2) Fc-1024(dp 0.2) Fc-2048(dp 0.2) Fc-2048(dp 0.2)
Softmax Softmax Softmax Softmax Figure 5: Pictorial representation of the proposed convnet
architecture for the Sanskrit OCR.
Table I: Convnet architectures trained on the data. Convolution
layer is depicted as Conv<filter size> <number of filters>.
Dropout is represented as dp <probability> conv3-64 layer (Table I). The training error for convnet B did
not stagnate, rather convnet B started to overfit on the data.
This proved that slight changes in the architecture would be
is followed by a ReLU function [2,4]. The ReLU functions sufficient to improve the results. As a result, convnet C is
do not saturate while training the convnet and eventually designed by doubling the number of neurons in each fully
help avoid the vanishing gradient problem [2,24]. A max- connected layer of convnet B. However, Convnet C showed a
pooling operation is used after every 2 or 3 Convolution slight improvement in results before it started to overfit on the
operations, depending on the architecture design, to reduce data. In order to prevent overfitting, a dropout layer is added
the computational intensity of the architecture. A kernel/filter after the first fully connected layer. The resulting architecture,
size of 2x2 with a stride of 2 pixels is used in each max convnet D, achieves the best results on the dataset. Hence
pooling operation. The final convolution layer is followed by convnet D is the proposed classifier for the OCR (Fig 5). The
2 fully connected layers. Subsequently, the convnet terminates model accuracies of the convnets on the train data are shown
with a softmax layer. The number of channels in each fully in Table II.
connected layer and the number of filters in each convolution In all the convnets, a series of convolution operations with
operation varies for different convnets. 3x3 filters are used to generate an effective larger receptive
field of 5x5 or 7x7, instead of directly using a 5x5 or 7x7
C. Training filters. This is because two consecutive 3x3 filters produce an
The Convnets are trained by optimizing on a cross-entropy effective receptive field of a 5x5 filter, but at the same time
loss function using mini batch gradient descent and back- reduce the number of parameters in the network. Similarly,
propagation [3]. The batch size is set to 32, with a nestrov three consecutive 3x3 filters produce an effective receptive
momentum of 0.9. The learning rate is set to a constant field of a 7x7 filter, while reducing the number of parameters
value of 0.001, i.e. this value is not altered during training. A in the network [6]. In addition to this, increasing the number of
constant dropout of 0.2 [1] is used, to prevent over-fitting. The convolution operations would also increase the non-linearity
weights are randomly initialized with a standard deviation of which leads to an increase in the learning capacity of the
0.1. Each convnet is subjected to a variable number of epochs networks.
during training (between 120 to 140 epochs). While training, We avoided using top-performing convnets such as
the validation and train accuracies were closely observed. GoogLeNet [5], Microsoft Resnet [24], VGG Net [6] because
Training is terminated when the validation accuracy started they are designed for more complex datasets like the ImageNet
to drop while the training accuracy continued to improve. In [ILSVRC] which contain larger images and more number of
other words, each Convnet is trained till just before they started classes. Hence, such deep and complex convnet architectures
to overfit on the data. would overfit on relative simple dataset such as ours. In
addition to this, these deep and complex convnets would
demand huge computing power which is not available in
D. Rationale for proposed architecture
ordinary computers for which the proposed OCR has been
The various modifications made on the baseline architecture designed to enable prevalent use. Finally, these complex nets
were based on the philosophy of Karen et al [6], i.e. the depth display slow run times while classifying images which make
of the convnet is increased while keeping the size of the filters them unsuitable for an OCR where hundreds of letters must
same. In addition to this, the number of channels in the fully be classified in each page.
connected layers are also altered. All convnets architectures
trained on the data are shown in Table I.
E. Implementation Details
Initially convnet A (Table I) is trained on the data using
the procedure mentioned in section V C. The training error of The entire software is implemented in python. Image seg-
convnet A reached a constant value, showing that a more com- mentation and letter localization are carried out with the help
plex model is required to attain better results. Subsequently, of open source libraries like OpenCV and PIL. The convnets
Convnet B is designed by adding an additional conv3-32 and are implemented on Keras [7] using Tensorflow [14] as the
450
Convnet architecture Train Error (%) Validation Error (%)
A 10 11.97
B 7.9 12.46
C 6.5 11.94
D 4.93 4
451
Sanskrit manuscripts and documents. Computing, 2009. ARTCom’09. International Conference on,
pp. 31-38. IEEE, 2009.
VII. C ONCLUSION AND F UTURE S COPE [12] C. V. Jawahar, M. N. S. S. K. Pavan Kumar, S. S. Ravi
Kiran. A Bilingual OCR for Hindi-Telugu Documents and its
We present an OCR for Sanskrit (Devanagari script). We
Applications. ICDAR 2003, 0-7695-1960-1/03 $17.00 © 2003
introduce a novel approach of using convnets as classifiers
IEEE
for Indic OCRs. We show that convnet are more suitable
[13] B. B. Chaudhuri, U. Pal. An OCR System to Read Two
than SVMs and ANNs, for multi-class image classification
Indian Language Scripts: Bangla and Devanagari (Hindi). 0-8
problems. In addition to this, we show that our OCR is ideal
186-7898-4/97 $10.00 0 1997 IEEE
for digitizing old and poorly maintained material as it robust
[14] Abadi, Martín, Ashish Agarwal, Paul Barham, Eugene
to font size and style, image quality and contrast.
Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado et al.
To improve the OCR system further, learning can be intro-
"Tensorflow: Large-scale machine learning on heterogeneous
duced for letter segmentation and identification. This could be
distributed systems." arXiv preprint arXiv:1603.04467 (2016).
achieved with the help of a selective search algorithm followed
[15] Pal, U., and B. B. Chaudhuri. "Indian script char-
by an R-CNN [23].
acter recognition: a survey." pattern Recognition 37, no. 9
(2004):1887-1899.
VIII. R EFERENCES [16] Dineshkumar, R., and J. Suganthi. "A research survey
[1] Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, on Sanskrit offline handwritten character recognition." KTVR
Ilya Sutskever, and Ruslan R. Salakhutdinov. "Improving neu- Knowledge Park for Engineering and Technology, Hindusthan
ral networks by preventing coadaptation of feature detectors." College of Engineering and Technology Tamilnadu (2013).
arXiv preprint arXiv:1207.0580 (2012). [17] Dineshkumar, R., and J. Suganthi. "Sanskrit character
[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hin- recognition system using neural network." Indian Journal of
ton. "Imagenet classification with deep convolutional neural Science and Technology 8, no. 1 (2015): 65.
networks." In Advances in neural information processing sys- [18] “Google ocr,” https://support.google.com/drive/answer/
tems, pp. 1097-1105. 2012. 176692? hl=en.
[3] LeCun, Yann, Bernhard Boser, John S. Denker, Don- [19] MD Zeiler, R Fergus. Visualizing and Understanding
nie Henderson, Richard E. Howard, Wayne Hubbard, and Convolutional Networks. European conference on computer
Lawrence D. Jackel. "Backpropagation applied to handwritten vision, 2014
zip code recognition." Neural computation 1, no. 4 (1989): [20] Itrans converter and code: http://www.aczoom.com/itra-
541-551. ns
[4] Nair, Vinod, and Geoffrey E. Hinton. "Rectified linear [21] Sanskrit Chandamama source: https://archive.org/detai-
units improve restricted Boltzmann machines." In Proceedings ls/Chandamama
of the 27th international conference on machine learning [22] Yadav, Divakar, Sonia Sánchez-Cuadrado, and Jorge
(ICML-10), pp. 807-814. 2010.. Morato. "Optical character recognition for Hindi language
[5] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre using a neural-network approach." JIPS 9, no. 1 (2013): 117-
Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, 140.
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper [23] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian
with convolutions." In Proceedings of the IEEE conference on Sun. "Faster R-CNN: Towards real-time object detection with
computer vision and pattern recognition, pp. 1-9. 2015. region proposal networks." In Advances in neural information
[6] Simonyan, Karen, and Andrew Zisserman. "Very deep processing systems, pp. 91-99. 2015.
convolutional networks for large-scale image recognition." [24] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and
arXiv preprint arXiv:1409.1556 (2014). Jian Sun. "Deep residual learning for image recognition." In
[7] Keras GitHub repository:https://github.com/fchollet/ke- Proceedings of the IEEE conference on computer vision and
ras. pattern recognition, pp. 770-778. 2016.
[8] R Smith, google inc. An overview of Tesseract OCR [25] Bansal, Veena, and R. M. K. Sinha. "Integrating
engine ICDAR 2007 Distributed Systems Preliminary White knowledge sources in Devanagari text recognition system."
Paper, November 9, 2015. IEEE Transactions on Systems, Man, and Cybernetics-Part A:
[9] Dalal, Navneet, and Bill Triggs. "Histograms of oriented Systems and Humans 30, no. 4 (2000): 500-505.
gradients for human detection." In Computer Vision and Pat- [26] Long, Jonathan, Evan Shelhamer, and Trevor Darrell.
tern Recognition, 2005. CVPR 2005. IEEE Computer Society "Fully convolutional networks for semantic segmentation." In
Conference on, vol. 1, pp. 886-893. IEEE, 2005. Proceedings of the IEEE Conference on Computer Vision and
[10] Sankaran, Naveen, and C. V. Jawahar. "Recognition of Pattern Recognition, pp. 3431-3440. 2015.
printed Devanagari text using BLSTM Neural Network." In [27] Jayadevan, R., Satish R. Kolhe, Pradeep M. Patil, and
Pattern Recognition (ICPR), 2012 21st International Confer- Umapada Pal. "Offline recognition of Devanagari script: A sur-
ence on, pp. 322-325. IEEE, 2012. vey." IEEE Transactions on Systems, Man, and Cybernetics,
[11] Rahiman, M. Abdul, and M. S. Rajasree. "A detailed Part C (Applications and Reviews) 41, no. 6 (2011): 782-796.
study and analysis of ocr research in south indian scripts."
In Advances in Recent Technologies in Communication and
452