Convolutional Multi-Directional Recurrent Network For of Ine Handwritten Text Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2016 15th International Conference on Frontiers in Handwriting Recognition

Convolutional Multi-directional Recurrent Network for


Offline Handwritten Text Recognition

Zenghui Sun, Lianwen Jin∗ , Zecheng Xie, Ziyong Feng, Shuye Zhang
College of Electronic and Information Engineering
South China University of Technology
Guangzhou, China
sunfreding@gmail.com, ∗ lianwen.jin@gmail.com, xiezcheng@foxmail.com,
feng.ziyong@foxmail.com, shuye.cheung@gmail.com

Abstract—In this paper, we propose a new network ar- text recognition [4]. Shi et al. [5] applied convolutional net-
chitecture called Convolutional Multi-directional Recurrent works to extract a high-dimensional feature sequence from
Network (CDRN) for offline handwritten text recognition. the input image and then developed a deep recurrent model
The conventional recurrent neural network model obtains the
local context from limited directions, whereas we build up to recognize the feature sequence. Recently, contextual and
the multi-directional long short-term memory (MDirLSTM) multiscale representations have been studied. Visin et al. [6]
module to abstract contextual information in various directions. proposed a network architecture named ReNet that used four
Moreover, we develop a shortcut connection strategy in our recurrent neural networks (RNN) to sweep horizontally and
proposed architecture for faster yet better convergence. In vertically in both directions across the image. Liu et al. [7]
cooperation with the aforementioned methods, the proposed
architecture also benefits from the following properties: (1) and Bell et al. [8] presented a technique that combined local
it obtains informative features of the input directly without and global context features.
involving hand-crafted features and segmentation; and (2) Considering the aforementioned work, the combination
it is an end-to-end trainable model whose components are of convolutional neural networks (CNN) and RNNs is a
trained conjointly. We evaluate the performance of the pro- feasible solution for the OHTR problem. In this paper, we
posed method on two databases: IAM words and IRONOFF.
Our experimental results demonstrate a significant increase propose a convolutional multi-directional recurrent network
in recognition performance using MDirLSTM and shortcut (CDRN), whose detailed architecture is shown in Fig. 1. The
connections, which suggests the effectiveness of these two advantages of the proposed model are listed as followed. (1)
proposed methods. CDRN is end-to-end trainable. All its components can be
Keywords-Offline handwritten text recognition, multi- jointly trained to fit each other. With the use of convolution
directional LSTM, shortcut connection, end-to-end trainable layers, explicit segmentation and hand-crafted features are
no longer required in our proposed model. (2) Traditional
I. I NTRODUCTION RNN extracts contextual features in a certain direction.
In our model, we advocate the use of multi-directional
Unconstrained offline handwritten text recognition long short-term memory (MDirLSTM) to extract the local
(OHTR), which is the problem of transforming an image context from different directions. With the combination of
into readable texts, remains a challenging recognition task. information from different directions, the model can provide
Some related attempts have been made to address this a high-level abstract representation of the input. (3) Multi-
problem. For example, the systems in the works [1] [2] first stage features are applied in our proposed model using
segment the images into vertical frames and extract hand- shortcut connections [9] [10].
crafted features from each of them, after which a hidden The remaining parts of the paper are organized as follows.
Markov model (HMM) module is applied. Hand-crafted In Sec. II, we introduce related work. In Sec. III, we illustrate
features are essential to the aforementioned systems. All the framework of the proposed model, in particular the
these hand-crafted features demand careful design but still architecture of MDirLSTM and advantages of using shortcut
may not be sufficiently representative for the classifiers. connections. In Sec. IV, we present the experimental results.
Instead of using the features described above, Bluche et Finally, we conclude the paper.
al. [3] adopted a convolutional neural network for feature
extraction, which outperformed the HMM system using II. R ELATED W ORK
explicit hand-crafted features.
A. Convolutional neural networks
Besides, because of the unconstrained patterns and cursive
nature of handwritten texts, it is difficult to segment them Generally, a convolutional network [11] consists of s-
correctly. Segment-free strategies have proved to be useful in tacked convolutional layers and pooling layers, and it serves

2167-6445/16 $31.00 © 2016 IEEE 240


DOI 10.1109/ICFHR.2016.50
 

 


  

               

Figure 1: Architecture of the proposed convolutional multi-directional recurrent network: the architecture consists of three
parts. Fully convolutional layers are first in use to extract feature sequences from the input images. The output of the last
convolution layer is shared by two connections. Then multi-directional LSTM (MDirLSTM) modules are developed to extract
contextual information in different directions. The detailed implementation of MDirLSTM modules are shown. The output
of the first MDirLSTM module is then combined with the source input from the shortcut connection. The transcription layer
is applied to derive a label sequence from the per-frame predictions.

as a powerful feature extractor. A convolutional layer gen- C. Multi-dimensional RNNs


erates its feature maps by convolving input with a group of Conventional LSTM (or BLSTM) was designed to pro-
kernel filters, then nonlinear activation functions are applied. cess one-dimensional (1-D) sequence data. In the OHTR
In the task of unconstrained off-line text recognition, the problem, the input is multi-dimensional, therefore it needs
size of input images varies. Instead of conducting sliding- to be sliced or convoluted into 1-D sequences. To some
window or segmentation on the input images, a fully con- extent, such methods cannot explore the local context of
volutional network (FCN) [12] is adopted as the feature different dimensions thoroughly. Multidimensional LSTM
extractor. As illustrated in [5], an FCN takes arbitrary-sized (MDLSTM) [17] is a feasible solution to address this prob-
images as input and generates corresponding-sized feature lem. The basic idea of MDLSTM is to scan the input in each
maps for the upcoming layers. dimension and then generate an integrated representation
using the surrounding context. Another practical solution is
to use spatial RNN [6] [8]. Spatial RNN extends the uni-
B. Traditional RNNs dimensional RNN by applying different RNNs to different
directions: (1) bottom to top; (2) top to bottom; (3) left to
Recurrent neural network (RNN) [13], which generally right; and (4) right to left. The outputs of each RNN are
consists of an input, hidden state, and output, is specially then combined and generate a high-level feature expression.
designed to model time series. Because of its structure Compared with MDLSTM, spatial RNN has the sig-
and learning mechanism, an RNN can capture contextual nificant advantage of lower complexity. In each direction,
information from the input features, which is helpful for spatial RNN uses a conventional RNN model rather than
sequence recognition. However, the RNN model is difficult the multi-dimensional version. Moreover, each RNN is only
to train mainly due to the gradient vanishing and exploding dependent along horizontal or vertical sequences and the
problem [14]. To address this problem, long-short term internal computations of different directions can be pro-
memory (LSTM) architecture [15] was proposed. The basic cessed simultaneously. Furthermore, in each direction, all
unit of an LSTM network is the memory block which independent rows or columns can be computed in parallel.
contains one or more memory cells and three gating units. Such an implementation method effectively reduces the
Specifically, these three nonlinear gates, namely the input computation time and resources.
gate, output gate, and forget gate, are used to provide
III. A RCHITECTURE
regulation of write, read and reset operations for the cells.
One drawback of conventional LSTM is that it can only A. Our proposed architecture
make use of the previous context. To access contextual The network architecture of the proposed model, as shown
knowledge in both directions, bidirectional LSTM (BLSTM) in Fig. 1, consists of three components. An image is pro-
[16] was proposed. cessed by a deep fully convolutional network, which extracts

241
are then shared by two independent BLSTMs, each of which
computes the recurrence vertically and horizontally. The
௩௙ ࢠ௜ǡ௝
ࢠ௜ିଵǡ௝ ௩௕
ࢠ௜ାଵǡ௝ outputs of BLSTMs are then concatenated to combine the
࢟௜ିଵǡ௝ ࢟௜ାଵǡ௝
local context in different directions. After that, another 1 × 1
convolution is applied to mix this information together as a
dimension reduction.
࢞௜ିଵǡ௝ାଵ
݆ ࢞௜ାଵǡ௝ାଵ With the architecture described, an MDirLSTM mod-
࢞௜ିଵǡ௝
࢟௛௕
௜ିଵǡ௝ାଵ
࢞௜ାଵǡ௝
࢟௛௕
௜ାଵǡ௝ାଵ ule is developed to exploit context in various directions.
௛௙
࢟௜ିଵǡ௝ିଵ ௛௙
࢟௜ାଵǡ௝ିଵ We use {xi,j } to denote the input of MDirLSTM, where
࢞௜ିଵǡ௝ିଵ ࢞௜ାଵǡ௝ିଵ x ∈ RH×W ×C (H, W , and C denote the height, width,
and number of channels, respectively). Note that i denotes
the vertical index and j the horizontal index. In a spatial
݅
LSTM layer, there are four LSTMs (two BLSTMs) that
Figure 2: Relationship between the output zi,j and its input move in the cardinal directions: right, left, down, and up.
in diagonal directions. For conciseness, only the processes For vertical directions, the layer needs to scan the input in
in vertical directions are shown. a top-down direction in addition to a bottom-up direction.
The calculation is as follows:
vf vf
yi,j = L (yi−1,j , xi,j ), i = 1, · · · , H, (1)
feature sequences for the following network modules. On
vb vb
top of the last two convolution layers, the MDirLSTM yi,j = L (yi+1,j , xi,j ), i = H, · · · , 1, (2)
modules are applied to extract the local context of different vf
directions. After the MDirLSTM module, a 1×1 convolution where L represents the recurrent operation, yi,j denotes the
layer is then used to generate a higher-level representation output of the vertical-forward process (i.e., scan in the top-
vb
of the local context. At the last convolution layer, feature down direction) and yi,j indicates the output of the vertical-
maps are convoluted to a fixed height of one, and thus the backward process (i.e., scan in the bottom-up direction). For
second MDirLSTM module can be regarded as two stacked the horizontal directions, the procedure is similar, as follows:
BLSTM layers. Fully connected layers are then incorporated hf hf
yi,j = L (yi,j−1 , xi,j ), j = 1, · · · , W, (3)
to enhance performance, and a transcription layer derives a
hb hb
label sequence from the per-frame predictions. Furthermore, yi,j = L (yi,j+1 , xi,j ), j = W, · · · , 1, (4)
we adopt shortcut connections in our proposed model. The hf hb
remainder of this section introduces the MDirLSTM module where yi,j and yi,j denote the output of the horizontal-
and shortcut connection architecture, and the benefits of forward and horizontal-backward process, respectively. After
adopting them in our model. scanning in four directions, the outputs are concatenated to
vf vb hf hb
obtain a composite output yi,j = (yi,j , yi,j , yi,j , yi,j ) =

B. MDirLSTM hb
(φi,j , yi,j

), where yi,j ∈ RC with C the number of
As illustrated in Section II.C, for the OHTR problem, we vf vb hf
channels of this spatial LSTM and φi,j = (yi,j , yi,j , yi,j ).
need a feasible solution instead of using traditional RNNs
to obtain contextual information in different dimensions. In the MDirLSTM module, the output of the first spatial
Compared with MDLSTM, spatial RNN is a better choice LSTM is then used as the input for the second spatial
because it is more efficient and easily parallelizable. To LSTM. We then demonstrate how the second spatial LSTM
avoid the problems of gradient exploding and vanishing, obtains contextual information from the diagonal directions.
spatial LSTM is used. However, spatial LSTM only summa- As stated above, yi,j is the input of the second spatial
rizes the local context of four directions to generate feature LSTM. For a certain index i, j, the relationship between
vf
maps, leaving the diagonal directions (i.e., top left, top zi,j and xi−1,j+1 is described as follows:
right, bottom left, and bottom right) out of consideration. vf
zi,j vf
= L (zi−1,j , yi,j ), (5)
Therefore, we propose an MDirLSTM module to obtain vf vf
contextual information in all directions. zi−1,j = L (zi−2,j , yi−1,j )
The MDirLSTM module consists of two spatial LSTMs vf hb
= L (zi−2,j , (φi−1,j , yi−1,j ), (6)
between which a 1 × 1 convolution layer is applied for hb hb
yi−1,j = L (yi−1,j+1 , xi−1,j ), (7)
feature extraction. The detailed architecture of MDirLSTM
hb hb
is shown in Fig. 1. In our model, the MDirLSTM module is yi−1,j+1 = L (yi−1,j+2 , xi−1,j+1 ), (8)
on top of the convolutional layers. As suggested by [8] [18], vf
where zi,j denotes the output of the vertical-forward pro-
the input-hidden transition is a 1 × 1 convolution for better
cess. From Eqs. (5) to (8) we can conclude that the output
abstraction. The output feature maps of the 1×1 convolution

242
  h  !" #$%&
h h 
h  !" #$%&
     h   !# '&
h h 
  h  !" # '&
h h 
   h  !" #($&
 h   !#($&
 # !%&

 h   !#($&

 
$ h  $ h !#( &
Figure 3: Structure of a shortcut connection used in CDRN.  # !%&
)*
  
of the second spatial LSTM extracts contextual information
from the top-right direction. The relationship with other Figure 4: Detailed architecture of CDRN. Different colors
diagonal directions can be demonstrated using a similar represent corresponding layer types, as illustrated in Fig. 1.
procedure. A detailed explanation is provided in Fig. 2. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding
After the input is processed by the second spatial LSTM, size respectively. For different databases, the padding size
every unit of the output feature maps combines the con- of some convolution layers is different.
textual information in multiple directions. In this way, our
model can extract the high-level feature representation that
investigates the local context in all directions thoroughly. to determine the proper combination method for shortcut
C. Multi-stage feature and shortcut connections connections in our model, which we present in the next
Practices and theories about the shortcut connection have section. Third, to apply shortcut connections to the layers
been studied in recent years and been proved to be an with multiple spatial sizes, we propose a fixed-shape pooling
effective approach [9] [10]. The basic idea of the shortcut strategy. Similar to the RoI pooling proposed in [19], fixed-
connection is to connect output features from a lower layer shape pooling uses max pooling to convert the input feature
to a higher layer while bypassing intermediate layers. With map into a smaller map with a fixed spatial shape H × W .
shortcut connections, information of multiple stages can Specifically, if the spatial size of the input map is denoted
be merged. The multi-stage features combine global and as h × w, we first divide the input feature map into H × W
local contexts, and they can be helpful when clarifying sub-cells, which have the identical size of h/H × w/W .
local confusion. Furthermore, shortcut connections provide Then we apply max-pooling in each sub-cell to obtain the
auxiliary gradient information, which accelerates the training corresponding feature map of a certain shape.
of a deep network model and helps it to converge better.
IV. E XPERIMENTS
As indicated in Fig. 1, we apply shortcut connections to
the convolution layer at the bottom of the first MDirLSTM To evaluate the effectiveness of the proposed model, we
module and to the output of the corresponding MDirLSTM conducted experiments on two off-line handwriting databas-
module. There are several issues that need to be addressed. es: the IAM [20] database and IRONOFF [21] handwriting
First, as noted in [7], the features from different layers database. Note that images in these databases vary in size,
have various scales, which makes it difficult to directly and thus typically all images need to be scaled into a fixed
combine them because the features with larger values can size to accelerate the training process. However, a pre-
be dominant. Therefore, L2 normalization is applied to the defined scale may not be suitable and result in geometric
features from different layers. After the combination proce- distortion; thus we adopt a proportional scale strategy in
dure, we use a scale layer with learnable scaling parameters. practice. Data augmentation strategies, including optical dis-
More details are shown in Fig. 3. Second, the specifics tortion and image blurring, were used to avoid over-fitting.
of the combination methods differ between applications. In The detailed architecture of the proposed model is shown in
[7] [8], features are combined by concatenating them in Fig. 4. For both databases, the experiments were conducted
a certain dimension, and in [10] an element-wise adding using closed lexicons, which consist of all words occurring
operation is applied. We conducted comparative experiments in the corresponding databases. We used the word error rate

243
0.65
Table I: WER on the validation set of IAM words with Without shortcut
different spatial LSTM layers in a MDirLSTM module 0.6 Shortcut−concat
Shortcut−sum

Number of spatial LSTM layers 0.55

in each MDirLSTM module WER% 0.5

WER on validation set


0 43.27 0.45
1 22.91
0.4
2 20.93
0.35

Table II: Word error ratess on the IAM words database 0.3

Methods WER% 0.25

BHMM [22] 25.8 0.2


KCSR [23] 20.01
Espana-Bosquera et al. [24] 15.50 0 5 10 15 20 25
Iterations(/5000)
30 35 40 45

CDRN (proposed) 11.51


Figure 5: Effect of shortcut connections and the influence
of different combination methods. Word error rates on the
(WER) as an evaluation metric for our experiments. IAM words dataset are shown.
A. IAM Database
The IAM database consists of unconstrained English in Fig. 5, the models without shortcut connections required
handwriting text written by 657 writers. Using the segment 40,000 training iterations to achieve the error rate of 25%.
annotations, the dataset can be built with two versions: IAM In comparison, only 20,000 iterations were required for the
words and IAM lines. We adopted the official partition of the model with the concatenation operation (Shortcut-concat).
database for large writer independent text line recognition Moreover, the model with shortcut connections using the
task but performed our experiments at the word level. element-wise adding operation (Shortcut-sum) achieved an
The training set contained 53,841 words, the validation set error rate of 17.86%, which obtained a relative 14.7% error
contained 7,899 words and the test set contained 17,616 rate reduction compared with the model without shortcut
words. We conducted several comparative experiments to connections. With these results, we can conclude that short-
evaluate the effect of MDirLSTM modules and shortcut cut connections can accelerate the training procedure and
connections. help the model to converge better. Furthermore, compared
1) Different numbers of spatial LSTM layers in a with the concatenation operation, shortcut connections with
MDirLSTM module: We first conducted comparative exper- the element-wise adding operation achieved better perfor-
iments to prove the effectiveness of the MDirLSTM module. mance.
These experiments were conducted with the architecture Our comparative experiments showed the effect of stacked
shown in Fig. 1 (without shortcut connections) but with a MDirLSTM and shortcut connections; therefore, we adopted
different number of spatial LSTM layers. Our experimental these two architectures in our model for the following ex-
results are presented in Table I. Compared with the model periments. We further conducted several experiments on the
without using the MDirLSTM module (containing a zero IAM words database with more data augmentation strategies,
spatial LSTM layer), introducing MDirLSTM modules be- and the results are presented in Table II. Compared with
tween convolution layers reduced the error rate by more than the result reported in [22], our proposed model achieved
20%. This shows that MDirLSTM modules can capture more a significant improvement, with a relative 55.4% WER
informative features from the sequential input and improve reduction. Additionally, a relative 42.5% WER reduction
performance significantly. Furthermore, by stacking spatial was achieved compared with the result reported in [23].
LSTM layers, the model achieved an WER of 20.93%, which
obtained a relative 8.6% error rate reduction compared with B. IRONOFF database
the model with single spatial LSTM. The results indicate The IRONOFF handwriting database contains 36,396
that proposed stacked spatial LSTMs obtained contextual word images from a 196-word lexicon (English and French).
information in diagonal directions. The database contains both on-line and off-line samples of
2) Different combination methods in shortcut connection- certain handwriting signals; only the latter is applied in our
s: In these experiments, we evaluated the effect of shortcut experiments. The database is officially divided into a training
connections, and the influence of different combination set of 20,898 words and a testing set of 10,448 words. We
methods in shortcut connections, that is, the concatenation compared the performance of different methods using the
operation and the element-wise adding operation. As shown

244
IRONOFF database. The results in Table III indicate that our [5] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural
proposed model outperformed all other models. It achieved network for image-based sequence recognition and its appli-
cation to scene text recognition,” CoRR, vol. abs/1507.05717,
the WER of 1.6%, which is the best result for the IRONOFF 2015.
database so far to our best knowledge. [6] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and
Y. Bengio, “Renet: A recurrent neural network based alterna-
tive to convolutional networks,” CoRR, vol. abs/1505.00393,
Table III: Word error rates on the IRONOFF database 2015.
[7] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking
Methods WER% wider to see better,” CoRR, vol. abs/1506.04579, 2015.
Kessentini et al. [2] 10.2 [8] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-
outside net: Detecting objects in context with skip pooling
Tay et al. [1] 3.9 and recurrent neural networks,” CoRR, vol. abs/1512.04143,
CDRN (proposed) 1.6 2015.
[9] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun,
“Pedestrian detection with unsupervised multi-stage feature
learning,” in Proceedings of the IEEE Conference on Com-
V. C ONCLUSION puter Vision and Pattern Recognition, 2013, pp. 3626–3633.
In this paper, we presented a novel network architecture [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” CoRR, vol. abs/1512.03385, 2015.
called CDRN for offline handwritten text recognition. The
proposed architecture is an end-to-end trainable model that [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-
based learning applied to document recognition,” Proceedings
uses images in various sizes as input and outputs the pre- of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
dictions. Using MDirLSTM, CDRN can capture the contex- [12] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional
tual information in different directions, which significantly networks for semantic segmentation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
improves the performance of the network. Moreover, by tion, 2015, pp. 3431–3440.
adopting shortcut connections in our proposed model, it [13] R. J. Williams and D. Zipser, “Gradient-based learning
can accelerate the training procedure of the model, which algorithms for recurrent networks and their computational
complexity,” Back-propagation: Theory, architectures and ap-
is proved to be better converged. Our experiments on the plications, pp. 433–486, 1995.
IAM words database and IRONOFF offline database show [14] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-
that the proposed architecture achieves highly competitive term dependencies with gradient descent is difficult,” Neural
Networks, IEEE Transactions on, vol. 5, no. 2, pp. 157–166,
performance. In our experiments, CDRN applies the MDirL- 1994.
STM module only at the top of the last two convolution [15] S. Hochreiter and J. Schmidhuber, “Long short-term memo-
layers. Since the MDirLSTM module can process the input ry,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
of various shapes, more MDirLSTM modules can be adopted [16] A. Graves and J. Schmidhuber, “Framewise phoneme clas-
in CDRN. Besides, the usage of shortcut connections in our sification with bidirectional lstm and other neural network
architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610,
model can be further investigated. 2005.
[17] A. Graves and J. Schmidhuber, “Offline handwriting recog-
ACKNOWLEDGMENT nition with multidimensional recurrent neural networks,” in
Advances in neural information processing systems, 2009, pp.
This research is supported in part by NSFC (Grant No.: 545–552.
61472144), the National Key Research & Development [18] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How
to construct deep recurrent neural networks,” CoRR, vol.
Plan of China (No. 2016YFB1001405), GDSTP (Grant No.: abs/1312.6026, 2013.
2014A010103012, 2015B010101004, 2015B010130003, [19] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE Interna-
2015B010131004) , GDUPS (2011), and the Fundamental tional Conference on Computer Vision, 2015, pp. 1440–1448.
Research Funds for the Central Universities(no. D2157060). [20] U.-V. Marti and H. Bunke, “The iam-database: an english
sentence database for offline handwriting recognition,” In-
R EFERENCES ternational Journal on Document Analysis and Recognition,
[1] Y. H. Tay, P.-M. Lallican, M. Khalid, C. Viard-Gaudin, and vol. 5, no. 1, pp. 39–46, 2002.
S. Kneer, “An offline cursive handwritten word recognition [21] C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter,
system,” in TENCON 2001. Proceedings of IEEE Region “The ireste on/off (ironoff) dual handwriting database,” in
10 International Conference on Electrical and Electronic Document Analysis and Recognition, 1999. ICDAR’99. Pro-
Technology, vol. 2. IEEE, 2001, pp. 519–524. ceedings of the Fifth International Conference on. IEEE,
1999, pp. 455–458.
[2] Y. Kessentini, T. Paquet, and A. B. Hamadou, “Off-line hand-
written word recognition using multi-stream hidden markov [22] A. Giménez, I. Khoury, J. Andrés-Ferrer, and A. Juan, “Hand-
models,” Pattern Recognition Letters, vol. 31, no. 1, pp. 60– writing word recognition using windowed bernoulli hmms,”
70, 2010. Pattern Recognition Letters, vol. 35, pp. 149–156, 2014.
[3] T. Bluche, H. Ney, and C. Kermorvant, “Feature extraction [23] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word
with convolutional neural networks for handwritten word spotting and recognition with embedded attributes,” Pattern
recognition,” in Document Analysis and Recognition (IC- Analysis and Machine Intelligence, IEEE Transactions on,
DAR), 2013 12th International Conference on. IEEE, 2013, vol. 36, no. 12, pp. 2552–2566, 2014.
pp. 285–289.
[24] S. Espana-Boquera, M. J. Castro-Bleda, J. Gorbe-Moya,
[4] T.-H. Su, T.-W. Zhang, D.-J. Guan, and H.-J. Huang, and F. Zamora-Martinez, “Improving offline handwritten text
“Off-line recognition of realistic chinese handwriting using recognition with hybrid hmm/ann models,” Pattern Analysis
segmentation-free strategy,” Pattern Recognition, vol. 42, and Machine Intelligence, IEEE Transactions on, vol. 33,
no. 1, pp. 167–182, 2009. no. 4, pp. 767–779, 2011.

245

You might also like