Convolutional Multi-Directional Recurrent Network For of Ine Handwritten Text Recognition
Convolutional Multi-Directional Recurrent Network For of Ine Handwritten Text Recognition
Convolutional Multi-Directional Recurrent Network For of Ine Handwritten Text Recognition
Zenghui Sun, Lianwen Jin∗ , Zecheng Xie, Ziyong Feng, Shuye Zhang
College of Electronic and Information Engineering
South China University of Technology
Guangzhou, China
sunfreding@gmail.com, ∗ lianwen.jin@gmail.com, xiezcheng@foxmail.com,
feng.ziyong@foxmail.com, shuye.cheung@gmail.com
Abstract—In this paper, we propose a new network ar- text recognition [4]. Shi et al. [5] applied convolutional net-
chitecture called Convolutional Multi-directional Recurrent works to extract a high-dimensional feature sequence from
Network (CDRN) for offline handwritten text recognition. the input image and then developed a deep recurrent model
The conventional recurrent neural network model obtains the
local context from limited directions, whereas we build up to recognize the feature sequence. Recently, contextual and
the multi-directional long short-term memory (MDirLSTM) multiscale representations have been studied. Visin et al. [6]
module to abstract contextual information in various directions. proposed a network architecture named ReNet that used four
Moreover, we develop a shortcut connection strategy in our recurrent neural networks (RNN) to sweep horizontally and
proposed architecture for faster yet better convergence. In vertically in both directions across the image. Liu et al. [7]
cooperation with the aforementioned methods, the proposed
architecture also benefits from the following properties: (1) and Bell et al. [8] presented a technique that combined local
it obtains informative features of the input directly without and global context features.
involving hand-crafted features and segmentation; and (2) Considering the aforementioned work, the combination
it is an end-to-end trainable model whose components are of convolutional neural networks (CNN) and RNNs is a
trained conjointly. We evaluate the performance of the pro- feasible solution for the OHTR problem. In this paper, we
posed method on two databases: IAM words and IRONOFF.
Our experimental results demonstrate a significant increase propose a convolutional multi-directional recurrent network
in recognition performance using MDirLSTM and shortcut (CDRN), whose detailed architecture is shown in Fig. 1. The
connections, which suggests the effectiveness of these two advantages of the proposed model are listed as followed. (1)
proposed methods. CDRN is end-to-end trainable. All its components can be
Keywords-Offline handwritten text recognition, multi- jointly trained to fit each other. With the use of convolution
directional LSTM, shortcut connection, end-to-end trainable layers, explicit segmentation and hand-crafted features are
no longer required in our proposed model. (2) Traditional
I. I NTRODUCTION RNN extracts contextual features in a certain direction.
In our model, we advocate the use of multi-directional
Unconstrained offline handwritten text recognition long short-term memory (MDirLSTM) to extract the local
(OHTR), which is the problem of transforming an image context from different directions. With the combination of
into readable texts, remains a challenging recognition task. information from different directions, the model can provide
Some related attempts have been made to address this a high-level abstract representation of the input. (3) Multi-
problem. For example, the systems in the works [1] [2] first stage features are applied in our proposed model using
segment the images into vertical frames and extract hand- shortcut connections [9] [10].
crafted features from each of them, after which a hidden The remaining parts of the paper are organized as follows.
Markov model (HMM) module is applied. Hand-crafted In Sec. II, we introduce related work. In Sec. III, we illustrate
features are essential to the aforementioned systems. All the framework of the proposed model, in particular the
these hand-crafted features demand careful design but still architecture of MDirLSTM and advantages of using shortcut
may not be sufficiently representative for the classifiers. connections. In Sec. IV, we present the experimental results.
Instead of using the features described above, Bluche et Finally, we conclude the paper.
al. [3] adopted a convolutional neural network for feature
extraction, which outperformed the HMM system using II. R ELATED W ORK
explicit hand-crafted features.
A. Convolutional neural networks
Besides, because of the unconstrained patterns and cursive
nature of handwritten texts, it is difficult to segment them Generally, a convolutional network [11] consists of s-
correctly. Segment-free strategies have proved to be useful in tacked convolutional layers and pooling layers, and it serves
Figure 1: Architecture of the proposed convolutional multi-directional recurrent network: the architecture consists of three
parts. Fully convolutional layers are first in use to extract feature sequences from the input images. The output of the last
convolution layer is shared by two connections. Then multi-directional LSTM (MDirLSTM) modules are developed to extract
contextual information in different directions. The detailed implementation of MDirLSTM modules are shown. The output
of the first MDirLSTM module is then combined with the source input from the shortcut connection. The transcription layer
is applied to derive a label sequence from the per-frame predictions.
241
are then shared by two independent BLSTMs, each of which
computes the recurrence vertically and horizontally. The
௩ ࢠǡ
ࢠିଵǡ ௩
ࢠାଵǡ outputs of BLSTMs are then concatenated to combine the
࢟ିଵǡ ࢟ାଵǡ
local context in different directions. After that, another 1 × 1
convolution is applied to mix this information together as a
dimension reduction.
࢞ିଵǡାଵ
݆ ࢞ାଵǡାଵ With the architecture described, an MDirLSTM mod-
࢞ିଵǡ
࢟
ିଵǡାଵ
࢞ାଵǡ
࢟
ାଵǡାଵ ule is developed to exploit context in various directions.
࢟ିଵǡିଵ
࢟ାଵǡିଵ We use {xi,j } to denote the input of MDirLSTM, where
࢞ିଵǡିଵ ࢞ାଵǡିଵ x ∈ RH×W ×C (H, W , and C denote the height, width,
and number of channels, respectively). Note that i denotes
the vertical index and j the horizontal index. In a spatial
݅
LSTM layer, there are four LSTMs (two BLSTMs) that
Figure 2: Relationship between the output zi,j and its input move in the cardinal directions: right, left, down, and up.
in diagonal directions. For conciseness, only the processes For vertical directions, the layer needs to scan the input in
in vertical directions are shown. a top-down direction in addition to a bottom-up direction.
The calculation is as follows:
vf vf
yi,j = L (yi−1,j , xi,j ), i = 1, · · · , H, (1)
feature sequences for the following network modules. On
vb vb
top of the last two convolution layers, the MDirLSTM yi,j = L (yi+1,j , xi,j ), i = H, · · · , 1, (2)
modules are applied to extract the local context of different vf
directions. After the MDirLSTM module, a 1×1 convolution where L represents the recurrent operation, yi,j denotes the
layer is then used to generate a higher-level representation output of the vertical-forward process (i.e., scan in the top-
vb
of the local context. At the last convolution layer, feature down direction) and yi,j indicates the output of the vertical-
maps are convoluted to a fixed height of one, and thus the backward process (i.e., scan in the bottom-up direction). For
second MDirLSTM module can be regarded as two stacked the horizontal directions, the procedure is similar, as follows:
BLSTM layers. Fully connected layers are then incorporated hf hf
yi,j = L (yi,j−1 , xi,j ), j = 1, · · · , W, (3)
to enhance performance, and a transcription layer derives a
hb hb
label sequence from the per-frame predictions. Furthermore, yi,j = L (yi,j+1 , xi,j ), j = W, · · · , 1, (4)
we adopt shortcut connections in our proposed model. The hf hb
remainder of this section introduces the MDirLSTM module where yi,j and yi,j denote the output of the horizontal-
and shortcut connection architecture, and the benefits of forward and horizontal-backward process, respectively. After
adopting them in our model. scanning in four directions, the outputs are concatenated to
vf vb hf hb
obtain a composite output yi,j = (yi,j , yi,j , yi,j , yi,j ) =
B. MDirLSTM hb
(φi,j , yi,j
), where yi,j ∈ RC with C the number of
As illustrated in Section II.C, for the OHTR problem, we vf vb hf
channels of this spatial LSTM and φi,j = (yi,j , yi,j , yi,j ).
need a feasible solution instead of using traditional RNNs
to obtain contextual information in different dimensions. In the MDirLSTM module, the output of the first spatial
Compared with MDLSTM, spatial RNN is a better choice LSTM is then used as the input for the second spatial
because it is more efficient and easily parallelizable. To LSTM. We then demonstrate how the second spatial LSTM
avoid the problems of gradient exploding and vanishing, obtains contextual information from the diagonal directions.
spatial LSTM is used. However, spatial LSTM only summa- As stated above, yi,j is the input of the second spatial
rizes the local context of four directions to generate feature LSTM. For a certain index i, j, the relationship between
vf
maps, leaving the diagonal directions (i.e., top left, top zi,j and xi−1,j+1 is described as follows:
right, bottom left, and bottom right) out of consideration. vf
zi,j vf
= L (zi−1,j , yi,j ), (5)
Therefore, we propose an MDirLSTM module to obtain vf vf
contextual information in all directions. zi−1,j = L (zi−2,j , yi−1,j )
The MDirLSTM module consists of two spatial LSTMs vf hb
= L (zi−2,j , (φi−1,j , yi−1,j ), (6)
between which a 1 × 1 convolution layer is applied for hb hb
yi−1,j = L (yi−1,j+1 , xi−1,j ), (7)
feature extraction. The detailed architecture of MDirLSTM
hb hb
is shown in Fig. 1. In our model, the MDirLSTM module is yi−1,j+1 = L (yi−1,j+2 , xi−1,j+1 ), (8)
on top of the convolutional layers. As suggested by [8] [18], vf
where zi,j denotes the output of the vertical-forward pro-
the input-hidden transition is a 1 × 1 convolution for better
cess. From Eqs. (5) to (8) we can conclude that the output
abstraction. The output feature maps of the 1×1 convolution
242
h !" #$%&
h h
h !" #$%&
h !# '&
h h
h !" # '&
h h
h !" #($&
h !#($&
# !%&
h !#($&
$ h $ h !#( &
Figure 3: Structure of a shortcut connection used in CDRN. # !%&
)*
of the second spatial LSTM extracts contextual information
from the top-right direction. The relationship with other Figure 4: Detailed architecture of CDRN. Different colors
diagonal directions can be demonstrated using a similar represent corresponding layer types, as illustrated in Fig. 1.
procedure. A detailed explanation is provided in Fig. 2. ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding
After the input is processed by the second spatial LSTM, size respectively. For different databases, the padding size
every unit of the output feature maps combines the con- of some convolution layers is different.
textual information in multiple directions. In this way, our
model can extract the high-level feature representation that
investigates the local context in all directions thoroughly. to determine the proper combination method for shortcut
C. Multi-stage feature and shortcut connections connections in our model, which we present in the next
Practices and theories about the shortcut connection have section. Third, to apply shortcut connections to the layers
been studied in recent years and been proved to be an with multiple spatial sizes, we propose a fixed-shape pooling
effective approach [9] [10]. The basic idea of the shortcut strategy. Similar to the RoI pooling proposed in [19], fixed-
connection is to connect output features from a lower layer shape pooling uses max pooling to convert the input feature
to a higher layer while bypassing intermediate layers. With map into a smaller map with a fixed spatial shape H × W .
shortcut connections, information of multiple stages can Specifically, if the spatial size of the input map is denoted
be merged. The multi-stage features combine global and as h × w, we first divide the input feature map into H × W
local contexts, and they can be helpful when clarifying sub-cells, which have the identical size of h/H × w/W .
local confusion. Furthermore, shortcut connections provide Then we apply max-pooling in each sub-cell to obtain the
auxiliary gradient information, which accelerates the training corresponding feature map of a certain shape.
of a deep network model and helps it to converge better.
IV. E XPERIMENTS
As indicated in Fig. 1, we apply shortcut connections to
the convolution layer at the bottom of the first MDirLSTM To evaluate the effectiveness of the proposed model, we
module and to the output of the corresponding MDirLSTM conducted experiments on two off-line handwriting databas-
module. There are several issues that need to be addressed. es: the IAM [20] database and IRONOFF [21] handwriting
First, as noted in [7], the features from different layers database. Note that images in these databases vary in size,
have various scales, which makes it difficult to directly and thus typically all images need to be scaled into a fixed
combine them because the features with larger values can size to accelerate the training process. However, a pre-
be dominant. Therefore, L2 normalization is applied to the defined scale may not be suitable and result in geometric
features from different layers. After the combination proce- distortion; thus we adopt a proportional scale strategy in
dure, we use a scale layer with learnable scaling parameters. practice. Data augmentation strategies, including optical dis-
More details are shown in Fig. 3. Second, the specifics tortion and image blurring, were used to avoid over-fitting.
of the combination methods differ between applications. In The detailed architecture of the proposed model is shown in
[7] [8], features are combined by concatenating them in Fig. 4. For both databases, the experiments were conducted
a certain dimension, and in [10] an element-wise adding using closed lexicons, which consist of all words occurring
operation is applied. We conducted comparative experiments in the corresponding databases. We used the word error rate
243
0.65
Table I: WER on the validation set of IAM words with Without shortcut
different spatial LSTM layers in a MDirLSTM module 0.6 Shortcut−concat
Shortcut−sum
Table II: Word error ratess on the IAM words database 0.3
244
IRONOFF database. The results in Table III indicate that our [5] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural
proposed model outperformed all other models. It achieved network for image-based sequence recognition and its appli-
cation to scene text recognition,” CoRR, vol. abs/1507.05717,
the WER of 1.6%, which is the best result for the IRONOFF 2015.
database so far to our best knowledge. [6] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and
Y. Bengio, “Renet: A recurrent neural network based alterna-
tive to convolutional networks,” CoRR, vol. abs/1505.00393,
Table III: Word error rates on the IRONOFF database 2015.
[7] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking
Methods WER% wider to see better,” CoRR, vol. abs/1506.04579, 2015.
Kessentini et al. [2] 10.2 [8] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-
outside net: Detecting objects in context with skip pooling
Tay et al. [1] 3.9 and recurrent neural networks,” CoRR, vol. abs/1512.04143,
CDRN (proposed) 1.6 2015.
[9] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun,
“Pedestrian detection with unsupervised multi-stage feature
learning,” in Proceedings of the IEEE Conference on Com-
V. C ONCLUSION puter Vision and Pattern Recognition, 2013, pp. 3626–3633.
In this paper, we presented a novel network architecture [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” CoRR, vol. abs/1512.03385, 2015.
called CDRN for offline handwritten text recognition. The
proposed architecture is an end-to-end trainable model that [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-
based learning applied to document recognition,” Proceedings
uses images in various sizes as input and outputs the pre- of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
dictions. Using MDirLSTM, CDRN can capture the contex- [12] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional
tual information in different directions, which significantly networks for semantic segmentation,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
improves the performance of the network. Moreover, by tion, 2015, pp. 3431–3440.
adopting shortcut connections in our proposed model, it [13] R. J. Williams and D. Zipser, “Gradient-based learning
can accelerate the training procedure of the model, which algorithms for recurrent networks and their computational
complexity,” Back-propagation: Theory, architectures and ap-
is proved to be better converged. Our experiments on the plications, pp. 433–486, 1995.
IAM words database and IRONOFF offline database show [14] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-
that the proposed architecture achieves highly competitive term dependencies with gradient descent is difficult,” Neural
Networks, IEEE Transactions on, vol. 5, no. 2, pp. 157–166,
performance. In our experiments, CDRN applies the MDirL- 1994.
STM module only at the top of the last two convolution [15] S. Hochreiter and J. Schmidhuber, “Long short-term memo-
layers. Since the MDirLSTM module can process the input ry,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
of various shapes, more MDirLSTM modules can be adopted [16] A. Graves and J. Schmidhuber, “Framewise phoneme clas-
in CDRN. Besides, the usage of shortcut connections in our sification with bidirectional lstm and other neural network
architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610,
model can be further investigated. 2005.
[17] A. Graves and J. Schmidhuber, “Offline handwriting recog-
ACKNOWLEDGMENT nition with multidimensional recurrent neural networks,” in
Advances in neural information processing systems, 2009, pp.
This research is supported in part by NSFC (Grant No.: 545–552.
61472144), the National Key Research & Development [18] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How
to construct deep recurrent neural networks,” CoRR, vol.
Plan of China (No. 2016YFB1001405), GDSTP (Grant No.: abs/1312.6026, 2013.
2014A010103012, 2015B010101004, 2015B010130003, [19] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE Interna-
2015B010131004) , GDUPS (2011), and the Fundamental tional Conference on Computer Vision, 2015, pp. 1440–1448.
Research Funds for the Central Universities(no. D2157060). [20] U.-V. Marti and H. Bunke, “The iam-database: an english
sentence database for offline handwriting recognition,” In-
R EFERENCES ternational Journal on Document Analysis and Recognition,
[1] Y. H. Tay, P.-M. Lallican, M. Khalid, C. Viard-Gaudin, and vol. 5, no. 1, pp. 39–46, 2002.
S. Kneer, “An offline cursive handwritten word recognition [21] C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter,
system,” in TENCON 2001. Proceedings of IEEE Region “The ireste on/off (ironoff) dual handwriting database,” in
10 International Conference on Electrical and Electronic Document Analysis and Recognition, 1999. ICDAR’99. Pro-
Technology, vol. 2. IEEE, 2001, pp. 519–524. ceedings of the Fifth International Conference on. IEEE,
1999, pp. 455–458.
[2] Y. Kessentini, T. Paquet, and A. B. Hamadou, “Off-line hand-
written word recognition using multi-stream hidden markov [22] A. Giménez, I. Khoury, J. Andrés-Ferrer, and A. Juan, “Hand-
models,” Pattern Recognition Letters, vol. 31, no. 1, pp. 60– writing word recognition using windowed bernoulli hmms,”
70, 2010. Pattern Recognition Letters, vol. 35, pp. 149–156, 2014.
[3] T. Bluche, H. Ney, and C. Kermorvant, “Feature extraction [23] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word
with convolutional neural networks for handwritten word spotting and recognition with embedded attributes,” Pattern
recognition,” in Document Analysis and Recognition (IC- Analysis and Machine Intelligence, IEEE Transactions on,
DAR), 2013 12th International Conference on. IEEE, 2013, vol. 36, no. 12, pp. 2552–2566, 2014.
pp. 285–289.
[24] S. Espana-Boquera, M. J. Castro-Bleda, J. Gorbe-Moya,
[4] T.-H. Su, T.-W. Zhang, D.-J. Guan, and H.-J. Huang, and F. Zamora-Martinez, “Improving offline handwritten text
“Off-line recognition of realistic chinese handwriting using recognition with hybrid hmm/ann models,” Pattern Analysis
segmentation-free strategy,” Pattern Recognition, vol. 42, and Machine Intelligence, IEEE Transactions on, vol. 33,
no. 1, pp. 167–182, 2009. no. 4, pp. 767–779, 2011.
245