Online Handwritten Mathematical Expressions Recognition by Merging Multiple 1D Interpretations
Online Handwritten Mathematical Expressions Recognition by Merging Multiple 1D Interpretations
Online Handwritten Mathematical Expressions Recognition by Merging Multiple 1D Interpretations
Abstract—In this work, we propose to recognize handwritten 1960s and detailed surveys are provided in [2], [3]. Gener-
mathematical expressions by merging multiple 1D sequences of ally, ME recognition involves three interdependent tasks [3]:
labels produced by a sequence labeler. The proposed solution (1) Symbol segmentation, which consists in grouping strokes
aims at rebuilding a 2D expression from several 1D labeled
paths. An online math expression is a sequence of strokes that belong to the same symbol; (2) symbol recognition,
which is later used to build a graph considering both tem- the task of labeling the symbol to assign each of them a
poral and spatial orders among these strokes. In this graph, symbol class; (3) structural analysis, its goal is to identify
node corresponds to stroke and edge denotes the relationship spatial relations between symbols and with the help of a
between a pair of strokes. Next, we select 1D paths from the grammar to produce a mathematical interpretation. These
built graph with the expectation that these paths could catch all
the strokes and the relationships between pairs of strokes. As three problems can be solved sequentially or jointly. In
an advanced and strong sequence classifier, BLSTM networks the first case, the tasks are processed sequentially which
are adopted to label the selected 1D paths. We set different means the error from the previous stage will be propagated
weights to these 1D labeled paths and then merge them to to the next stage. The alternative solutions run these three
rebuild a label graph. After that, an additional post-process tasks concurrently and aim at making a final decision of
will be performed to complete the edges automatically. We test
the proposed solution and compare the results to the state of the formula using global information [11], [12], [13]. These
art in online math expression recognition domain. approaches seem reasonable and exhibit out performances
as these three problems are highly interdependent by nature.
Keywords-Handwritten mathematical expression; online
handwriting; 1D paths; label graph; BLSTM. Human beings recognize symbols with the help of structure,
and vice versa.
Different from the above mentioned solutions, we propose
I. I NTRODUCTION
to rebuild a 2D expression with some 1D paths from a new
Handwritten Mathematical Expressions (MEs) recognition perspective. It is possible to describe a ME using a primitive
is a challenging problem in the field of pattern recogni- Label Graph (LG) in which nodes represent strokes and
tion because it involves both the recognition of a large labels on the edges encode either segmentation information
set of symbol classes and symbol arrangements in a two- or layout information. The detailed description will be found
dimensional (2D) layout. According to the different formats in Section II. Given an handwritten expression which is a
of sources of handwritten ME, it can be divided into offline sequence of strokes, first, we build a graph which catches
(images) recognition and online (inputs written on tablets) as many as possible edges existing in the ground truth LG,
recognition. We focus on online MEs recognition in this at the same time limiting the quantity of unnecessary edges.
paper. An online ME consists of one or more strokes which After that, 1D paths will be selected from this graph. We
are sequences of points sampled from the trajectory of the also expect the set of 1D paths could include all the strokes
writing tool between a pen-down and a pen-up. and the relationships between stroke pairs existing in both
This research domain has been boosted by the Competi- the graph and the ground truth. The advanced recurrent
tion on Recognition of Handwritten Mathematical Expres- neural network BLSTM has been proved to outperform other
sions (CROHME) [6], which began as part of the Interna- classifiers in tasks like handwritten text recognition [7] and
tional Conference on Document Analysis and Recognition handwritten math symbol recognition [4], [5]. This strong
(ICDAR) in 2011. It provides a platform for researchers to sequence classifier is adopted to label the selected 1-D paths.
test their methods and compare them, and then facilitate the We set different weights to these 1D labeled paths and
progress in this field. It attracts increasing participation of then merge them to rebuild a label graph. After that, an
research groups from all over the world. In this work, the additional post-process will be performed to complete the
provided data and tools will be used and results will be edges automatically. Finally, a 2D expression is rebuilt from
compared to participants. the merged path. Grammar constraints is not included in our
Research on the recognition of math notation began in the solution yet.
188
path in G can be defined as Φi = (n0 , n1 , n2 , ..., ne ), where
n0 is the starting node and ne is the end node. The node set
of Φi is n(Φi ) = {n0 , n1 , n2 , ..., ne } and the edge set of Φi
is e(Φi ) = {n0 → n1 , n1 → n2 , ..., ne−1 → ne }.
Two types of paths are selected in this work: time path
and random path. As described in [14], time path covers
all the nodes but could miss some edges; random path is
Figure 3. Five directions for a stroke si . Point (0, 0) is the center of used to catch more edges. Although this combination is not
bounding box of si . The angle of each region is π4 . the best solution, but the simplest and the most intuitive
one. A better solution will be explored in future. time path
starts from the first input stroke and ends with the last input
Definition 2.4 Let G be a directed graph in which each stroke following the time order. For example, in Figure 4c,
node corresponds to a stroke and edges are added according the time path is (0, 1, 2, 3, 4, 5, 6, 7). Then, we consider
to the following criteria in succession. several additional random paths. To ensure a good coverage
We defined for each stroke si ( i from 1 to n-1): of the graph we guide the random selection choosing the
• the set of crossing strokes Scro(i) = {scro1 , scro2 , ...} less visited nodes and edges. One random path can be (1,
from {si+1 , ..., sn }. 5, 6, 7).
• the set of closest stroke Sclo(i) = {sclo } from
{si+1 , ..., sn } − Scro(i) . III. BLSTM I NPUTS
For stroke si ( i from 1 to n): Followed, a strong sequence labeler should be chose to
• the set Svis(i) of the visible closest strokes in label time or random paths. The advanced recurrent neural
five directionsrespectively if exist in {s1 , ..., sn } − network BLSTM combines the advantages of BRNN and
{si } Scro(i) Sclo(i) . Here, the closeness of two LSTM and thus it offers access to long range context in two
strokes is decided by the distance between the centers directions. In the nature of things, this advanced architecture
of their bounding boxes, differently from definition 2.2. outperforms other models in several tasks [7], [4], [5].
In order to visualise the RNN and understand how it
Edges from si to the Scro(i) Sclo(i) Svis(i) will be
operates, we unfolded it along the input sequence and
added to G. Finally, we check if the edge from si to si+1
illustrated a part of the unfolded one in Figure 5. In order to
( i from 1 to n-1) exists in G. If not, then add this edge to
be easier to illustrate, only one forward layer is presented.
G to ensure that the path covering the sequence of strokes
Here, each node represents a layer of network units at a
in the time order is included in G.
single time-step. The input at a single time-step is a feature
An example is presented in Figure 4. Mathematical ex-
d x vector; for the whole process, the input is a time sequence
pression dx a is written with 8 strokes. From the sequence
of feature vectors. The output at a single time-step is a
of 8 strokes, the graph shown in Figure 4c is generated
probability vector of which each element is the probability
following the above mentioned criteria. Comparing to the
of belonging to some class; the overall output is a sequence
ground truth, we can see all the edges in Figure 4b are
of probability vectors. The same weights (w1, w2, w3) are
included in the regenerated graph except edges from strokes
reused at every time-step.
4 to 3 and from strokes 7 to 6 (dash edges). This flaw can be
To feed the inputs of the BLSTM, it is important to scan
overcame as long as strokes 3 and 4 and edge from strokes
as well the points belonging to the strokes themselves (on-
3 to 4 are correctly recognized. Because if strokes 3 and 4
paper points) and also the points separating one stroke from
are recognized as the same symbol, edge from strokes 4 to
the next one (in-air points). We expect that the on-paper
3 can be completed automatically. Also, edge from strokes
points will be labeled with corresponding symbol labels and
7 to 6 can be added. At the same time, Figure 4c includes
that the in-air points will be assigned with one of the possible
some unnecessary edges (dot edges) when matching to the
edge labels. Thus, besides re-sampling points from strokes,
ground truth. The graph we build should include the ground
we also re-sample points from the straight line which links
truth edges as many as possible, simultaneously reduce the
two strokes, as can be seen in Figure 6. In the rest of this
number of unnecessary edges.
paper, strokeD and strokeU are used to indicate a re-
C. Select paths from G sampled pen-down stroke and a re-sampled pen-up stroke
The final aim of our work is to rebuild the LG of 2D ex- for convenience.
pression. We propose to perform it by merging several paths Given each path, we first re-sampled points both from
from G. These paths should cover all the nodes and the edges visible strokes and between two successive visible strokes
of the ground truth LG (at least the edges of the ground truth in the path. 1D unlabeled sequence can be described as
SRT). With the correctly recognized node and edge labels, {strokeD1 , strokeU2 , strokeD3 , strokeU4 , ..., strokeDK }
we have the possibility to rebuild a correct 2D expression. A with K is the number of re-sampled strokes. Note that if s
189
(a) (b) (c)
d x
Figure 4. (a) dx ais written with 8 strokes; (b) the LG from ground truth; (c) the graph built using our method, dot edges are unnecessary and dash
edges are missed compared to the ground truth.
corresponding edge exists, the label N oRelation will be S = n(Φi ) ∪ e(Φi ) (3)
defined as ’ ’.
In our case, at a time-step, the input to the BLSTM is the 1 if i ∈ S
feature vector extracted from one point; the output is the 1S (i) = (4)
0 otherwise
probabilities of this point belonging to different classes.
WΦi is the weight for path Φi and label(s, Φi ) is the
IV. R ECOGNITION set of candidate labels for stroke s from path Φi , 1 ≤
As Figure 5 shows, since the RNN outputs local classifi- |label(s, Φi )| ≤ 3. If stroke s exists in path Φi , but
cations for each point, some form of post-processing need l ∈/ label(s, Φi ), in this case, PΦi ,s (l) is 0. The classifier
to be done. The connectionist temporal classification (CTC) answers that there is no possibility to output label l for stroke
[9] technology is a good choice for sequence transcription s from path Φi . We still add WΦi into the normalized factor
tasks. It outputs the probabilities of the complete sequences of Ps (l). If stroke s does not exist in path Φi , the classifier’s
directly but do not provide the alignment between the inputs answer to stroke s is unknown. And we should not take into
and the target labels. In our case, we need the labels of account this path Φi . Thus, WΦi will not be added into the
190
normalized factor of Ps (l). After normalization, the label Table I
T HE STRATEGIES TO LABEL DIFFERENT TYPES OF PATHS .
with the maximum probability is selected for each stroke.
Finally, post-processing should be done in order to rebuild exp. Path Type
a valid LG, i.e. adding edges. We first look for the segments Time Random
(symbols) using connected component analysis: a connected 1 CLST
2 CLST CLST
component where nodes and edges have the same label is 3 CLST CLSR
a symbol. With regards to the relationship between two 4 CLST +R CLST +R
symbols, we choose the label having the maximum accu-
mulative probability among the edges between two symbols. Table II
T HE SYMBOL LEVEL EVALUATION RESULTS ON CROHME 2014 TEST
Then, according to the rule that all strokes in a symbol have SET, INCLUDING THE EXPERIMENT RESULTS IN THIS WORK AND
the same input and output edges and that double-direction CROHME 2014 PARTICIPANT RESULTS (T OP 4 BY RECALL OF
edges represent the segments, some missing edges can be S EGMENTS ).
completed automatically.
exp. Segments (%) Seg + Class (%) Tree Rels. (%)
Rec. Prec. Rec. Prec. Rec. Prec.
V. E XPERIMENTS 1 92.30 85.15 83.70 77.22 58.40 74.27
2 92.32 85.28 84.37 77.94 67.79 58.34
A stroke is a sequence of points sampled from the 3 92.77 85.99 85.17 78.95 67.79 67.33
trajectory of a writing tool between a pen-down and a pen- 4 91.51 83.48 82.96 75.67 66.95 62.06
up at a fixed interval of time. Then an additional re-sampling system CROHME 2014 participant results (Top 4)
III 98.42 98.13 93.91 93.63 94.26 94.01
is performed with a fixed spatial step to get rid of the
I 93.31 90.72 86.59 84.18 84.23 81.96
writing speed. The number of re-sampling points depends VII 89.43 86.13 76.53 73.71 71.77 71.65
on the size of expression. For each path, we re-sample with V 88.23 84.20 78.45 74.87 61.38 72.70
10 × (length/avrdiagonal) points. Here, length refers to
the length of all the strokes in the path (including distance Table III
T HE EXPRESSION LEVEL EVALUATION RESULTS ON CROHME 2014
between successive strokes) and avrdiagonal refers to the TEST SET, INCLUDING THE EXPERIMENT RESULTS IN THIS WORK AND
average diagonal of the bounding boxes of all the strokes in CROHME 2014 PARTICIPANT RESULTS (T OP 2)
an expression.
exp. correct (%) ≤ 1 error ≤ 2 errors ≤ 3 errors
Subsequently, we compute five features per point, which 1 11.80 19.33 26.55 31.43
are quite close to the state of art [4], [11]. For every 2 8.55 16.89 23.40 29.91
point p(x, y) we obtained 5 features [sinθ, cosθ, sinφ, cosφ, 3 13.02 22.48 30.21 35.71
PenUD]. The detailed description can be seen in [14]. 4 11.19 19.13 26.04 31.13
system CROHME 2014 participant results (Top 2)
We use the RNNLIB library1 for training a BLSTM III 62.68 72.31 75.15 76.88
model. For each training process, the network having the I 37.22 44.22 47.26 50.20
best classification error on validation data set is saved.
Then, we test this network on the test data set. The Label
Graph Evaluation library (LgEval) [1] is adopted to evaluate classifiers to label the different types of paths extracted from
the recognition output. The detailed description of network the test set as presented in Table I. In exp.1, only the labeled
architecture and configuration can be found in [14]. This time path is used to rebuild a 2D expression. For other
configuration has obtained good results in both handwritten experiments, time path and random paths are merged to
text recognition [7] and handwritten math symbol classifi- obtain a final result. The weight of time path is set to 0.4
cation [4], [5]. and each random path is 0.1.
With regards to data set, we use CROHME 2014 train set The evaluation results on symbol level are provided in
(8834 expressions) for training, CROHME 2013 test set (670 Table II including recall (Rec.) and precision (Prec.) rates for
expressions) for validation and CROHME 2014 test set (983 symbol segmentation (Segments), symbol segmentation and
expressions) for test. validation The output layer size in this recognition (Seg+Class), spatial relationship classification
experiment is 108 (101 symbol classes + 6 relationships + (Tree Rels.). A correct spatial relationship between two
N oRelation). For each expression, we extract the time path symbols requires that both symbols are correctly segmented
and 6 random paths. Totally, 3 classifiers are trained with and with the right relationship label. As presented, the results
only time path, only 6 random paths and time path plus 6 for Segments and Seg+Class do not show a big difference
random paths respectively in order to evaluate the effect among 4 experiments. It explains that time path is enough
of the training set content. These classifiers are denoted to give good results and random paths contributes little.
as CLST , CLSR and CLST +R . We use these different With regard to Tree Rels., Rec. of exp.(2 3 4) is improved
compared to exp.1 because random paths catch some
1 Graves A. RNNLIB: A recurrent neural network library for sequence
ground truth edges which are missed by time path; but Prec.
learning problems. http://sourceforge.net/projects/rnnl/.
rate declines which means that random paths also cover
191
some edges which are not in ground truth LG. Unexpectedly, and more ground truth edges and reduce unnecessary edges
these excess edges are not labeled as N oRelation. Among meanwhile. A better strategy for selecting paths will be
4 experiments, exp.3 outperforms others for all the items. explored also. Later, the grammar model will be integrated
Thus, it is a better strategy to use CLST for labeling time to the recognition system.
path and use CLSR for random path. Our results are ACKNOWLEDGMENT
comparable to the results of CROHME 2014 because the
same training and testing data sets are used. The second We would like to thank the China Scholarship Council for
part of Table II gives the symbol level evaluation results supporting PhD studentship at Nantes university.
of the top 4 participants in CROHME 2014 sorting by the R EFERENCES
recall rate for correct symbol segmentation. The best Rec. [1] Mouchère H, Viard-Gaudin C, Zanibbi R, et al. ICFHR 2014
of Segments and Seg+Class reported by CROHME 2014 Competition on Recognition of On-line Handwritten Mathe-
are 98.42% and 93.91% respectively. Ours are 92.77% and matical Expressions (CROHME 2014). ICFHR. 2014: 6 p.
85.17%, both ranked 3 out of 8 systems (7 participants in [2] Chan K F, Yeung D Y. Mathematical expression recognition:
CROHME 2014 ). Our solution presents competitive results a survey. IJDAR, 2000, 3(1): 3-15.
on symbol recognition task and segmentation task, but not
on relationship detection and recognition task. However, our [3] Zanibbi R, Blostein D. Recognition and retrieval of mathe-
matical expressions. IJDAR, 2012, 15(4): 331-357.
proposal still shows promising results because Tree Rels.
recognition rate will be improved as long as a high quality [4] Álvaro F, Sánchez J A, Benedı́ J M. Classification of on-
graph is built and a good path selecting method is adopted. line mathematical symbols with hybrid features and recurrent
It is also a part of our future work. neural networks. 2013 12th ICDAR. IEEE, 2013: 1012-1016.
Table III shows the recognition rates at the global expres- [5] Álvaro F, Sánchez J A, Benedı́ J M. Offline Features for
sion level with no error, and with at most one to three errors Classifying Handwritten Math Symbols with Recurrent Neural
in the labels of LG. This metric is very strict. For example Networks. 2014 22nd ICPR. IEEE, 2014: 2944-2949.
one label error can happen only on one stroke symbol or in [6] Mouchère H, Zanibbi R, Garain U, Viard-Gaudin C. Advancing
the relationship between two one-stroke symbols; a labeling the state of the art for handwritten math recognition: the
error on a 2-strokes symbol leads to 4 errors (2 nodes CROHME competitions, 2011-2014. IJDAR, 2016, 19(2):
labels and 2 edges labels). The recognition rate with no 173-189.
error on CROHME 2014 test set is 13.02%. The best one [7] Graves A, Liwicki M, Fernández S, et al. A novel connectionist
and worst one reported by CROHME 2014 are 62.68% and system for unconstrained handwriting recognition. IEEE
15.01%. With regard to the recognition rate with ≤ 3 errors, Transactions on PAMI, 2009, 31(5): 855-868.
4 participants are between 27% and 37% and our result is
[8] Graves A. Supervised sequence labelling with recurrent neural
35.71%. Considering that there is no grammar constraint networks. Heidelberg: Springer, 2012.
included in this moment, the results are understandable and
will be improved once this defect is overcame. [9] Graves A, Fernández S, Gomez F, et al. Connectionist tem-
poral classification: labelling unsegmented sequence data with
VI. C ONCLUSION recurrent neural networks. Proceedings of the 23rd ICML.
ACM, 2006: 369-376.
We recognize 2D handwritten mathematical expressions
by merging multiple 1D labeled paths in this paper. Given [10] Zanibbi R, Mouchère H, Viard-Gaudin C. Evaluating struc-
an expression, we propose an algorithm to generate a graph tural pattern recognition for handwritten math via primitive
label graphs. In IS&T/SPIE Electronic Imaging, International
using both temporal and spatial orders among strokes. Then, Society for Optics and Photonics, 2013, 865817-865817.
different types of paths are selected from the graph. As
a strong sequence labeller, BLSTM is used to recognize [11] Awal A M, Mouchère H, Viard-Gaudin C. A global learning
these paths. Finally, we merge the results from different approach for an online handwritten mathematical expression
recognition system. PRL, 2014, 35: 68-77.
paths to rebuild a math expression. Currently, we do not use
any grammar knowledge (even local grammar) to recognize [12] Álvaro F, Sánchez J A, Benedı́ J M. Recognition of on-
expressions. Our proposal presents competitive results on line handwritten mathematical expressions using 2d stochastic
context-free grammars and hidden markov models. PRL,
symbol recognition task and segmentation task, promising
2014, 35: 58-67.
results on relationship recognition task. Undeniably, the
recognition rate is low at the global expression level at this [13] Álvaro F, Sánchez J A, Benedı́ J M. An integrated grammar-
moment. However, it will be improved certainly as the graph based approach for mathematical expression recognition. PR,
2016, 35: 135-147.
and paths catch more ground truth edges, i.e. more variations
of stroke order. [14] Zhang T, Mouchère H, Viard-Gaudin C. Using BLSTM
In future, we will evaluate the built graph with comparing for Interpretation of 2D Languages - Case of Handwritten
to the ground truth and then modify it in order to cover more Mathematical Expressions. CORIA-CIFED, 2016, 217-232.
192