Deep LSTM Networks For Online Chinese Handwriting Recognition
Deep LSTM Networks For Online Chinese Handwriting Recognition
Deep LSTM Networks For Online Chinese Handwriting Recognition
Abstract—Currently two heavy burdens are borne in online learning methods [12, 13]. In [14], such three modules are
Chinese handwriting recognition: a large-scale training data termed by Vision Objects Ltd as segmentation, recognition,
needs to be annotated with the boundaries of each character and and interpretation experts. They are separately trained and
effective features should be handcrafted by domain experts. To their weighting to the task is scored through a global
relieve such issues, the paper presents a novel end-to-end discriminant training process.
recognition method based on recurrent neural networks. A Despite past few years have witnessed great success of
mixture architecture of deep bidirectional Long Short-Term segmentation-recognition integrated strategy, constructing
Memory (LSTM) layers and feed forward subsampling layers is such a system is nontrivial. As a requisite element, a large-
used to encode the long contextual history trajectories. The
scale training data needs to be annotated in character level.
Connectionist Temporal Classification (CTC) objective function
makes it possible to train the model without providing
The widespread Chinese handwriting databases, such as HIT-
alignment information between input trajectories and output MW [15] and CASIA-OLHWDB [16], had been taken years
strings. During decoding, a modified CTC beam search to be ground-truthed. Moreover, features for classification are
algorithm is devised to integrate the linguistic constraints derived through experts’ handcraft. In addition, different
wisely. Our method is evaluated both on test set and competition components of the system are developed separated, even they
set of CASIA-OLHWDB 2.x. Comparing with state-of-the-art require different data.
methods, over 30% relative error reductions are observed on As an alternation, segmentation-free strategy is promising
test set in terms of both correct rate and accurate rate. Even to to solve above issues. There is no need to explicitly segment
the more challenging competition set, better results can be textlines into characters. Thus labeling each character
achieved by our method if the out-of-vocabulary problem can boundaries in training data is unnecessary. Previously, hidden
be ignored. markov models (HMMs) have been successfully used to
recognize Chinese handwriting [17] in a segmentation-free
Keywords- Chinese Handwriting Recognition; Deep way. Recently, combination of Long Short-Term Memory
Learning; Long Short-Term Memory; Beam Search (LSTM) and Connectionist Temporal Classification (CTC)
becomes a more powerful sequential labeling tool than HMMs
I. INTRODUCTION
[18]. Messina et al utilize this tool to transcribe offline
Online handwriting recognition plays a significant role for Chinese handwriting [19]. The preliminary results are
entering Chinese text. The pioneer works on online Chinese comparable to those ever reported. More recently,
character recognition were independently studied in 1966 by convolutional LSTM is applied to online Chinese handwriting
two groups from MIT and University of Pittsburgh [1, 2]. recognition [20] with pleasing results. The system includes
Since then, extensive concerns have been received, owing to four kinds of layers. The input trajectory is first converted into
the popularity of personal computers and smart devices. textline images using a path signature layer. Then
Compared with other input methods, it is more natural to convolutional layers are concatenated to learn features. After
human beings, especially to those Chinese people who are not that, the features are fed to multiple layers of bidirectional
good at Pinyin. Successful applications have been found in LSTM. And finally, the CTC beam search is used to transform
pen-based text input [3], overlapped input method[4, 5], and the output of network to labels in transcription layer.
notes digitalization and retrieval [6]. Inspired by the works in segmentation-free strategy, the
Most approaches for online Chinese handwriting paper presents a novel end-to-end recognition method based
recognition fall into segmentation-recognition integrated on deep recurrent neural networks. Instead of rendering the
strategy [7]. Firstly, the textline is over-segmented into trajectories as offline mode [20, 21], we explore the possible
primitive segments. Nextly, consecutive segments are merged capacities of original pen trajectory using a deep architecture.
and assigned a list of candidate classes by a character classifier The architecture mixes bidirectional LSTM layers and feed
[8], which forms segmentation-recognition lattice. Last, path forward subsampling layers which used to encode the long
searching is executed to identify an optimal path by wisely contextual history trajectories. We also follow the CTC
integrating character classification scores, geometric context objective function, making it possible to train the model
and linguistic context [9]. Many works are attempted to without alignment information between input trajectories and
optimize any of the key factors, such as character output strings. As suggested in [21], explicit linguistic
recognition[10], confidence transformation [11], parameter constraints are helpful to end-to-end systems even trained with
char model n-gram where w is the window size, ܟሺሻ is the weight vector from
(n-1)-th layer to the k-th unit in the n-th layer.
Learning
Prediction
To further benefit the geometric context both from left to
right and vice versa, we also consider a bidirectional recurrent
neural network (BRNN) [23] where each recurrent layer is
string
272
271
The standard RNNs suffer from vanishing gradient Substituting into (1), we obtain the empirical loss function.
problem [24]. In our RNN network, LSTM memory blocks Defining the sensitivity signal at output layer ߜ௧ ࣦ߲ Τ߲ܽ௧ ,
[25] are used to replace the nodes of RNN in hidden layers. we can get:
Fig. 3 illustrates a single LSTM memory block. Herein each
memory block includes three gates, one cell, and three ଵ
ߜ௧ ൌ ݕ௧ െ σ୳אሺܢǡ୩ሻ ߙሺݐǡ ݑሻߚሺݐǡ ݑሻ , (6)
peephole connections. The input gate controls how much the ୰൫ܢห܆൯
input can be flowed into the cell, and the output gate controls
how much the output can be sent out of the cell while the where ሺܢǡ ሻ ൌ ሼݑǣ ܢᇱ ௨ ൌ ሽ. Based on (6), the sensitivity
forget gate controls the cell’s previous state. All of them are signals at other layers can be easily backpropagated through
nonlinear summation units. In general, the input activation the network.
function is tanh function and the gate activation function is
C. Decoding
sigmoid function which can squeeze the gate data between 0
and 1. The CTC beam search algorithm is modified to wisely
integrate linguistic constraints. To reduce the bias of noisy
estimation, the extension probability Pr(k,y,t) is rescaled,
while the pseudocode in Alg. 1 falls into the similar
framework proposed in [26]. Define Pr-(y,t), Pr+(y,t) and
Pr(y,t) respectively as the null, non-null and total probabilities
θ assigned to partial output transcription y, at time t. The
bθ probability Pr(k,y,t) of y by label k at time t as follows:
xt t σ
t
ି ሺܡǡ ݐെ ͳሻǡ ܡ ൌ ݇
ht-1 ht ሺǡ ܡǡ ݐሻ ൌ ሺǡ ݐȁ܆ሻ ሾߛሺȁܡሻሿ ൜ ,
σ Cell ሺܡǡ ݐെ ͳሻ ǡ
bφ
φ t ct where Pr(k,t|X) is the CTC emission probability of k at t, as
ht defined in (3), Pr(k|y) is the linguistic transition from y to y+k,
ε ye is the final label in y and γ is the language weight.
j-th node bε
j-th node σ Algorithm I. CTC Beam Search
273
272
divided into train set and test set by authors of [16]. The train of x, y coordinates, along with a state to indicate whether the
set includes 4072 pages from 815 people while the test set pen is lifted. As for the start of the trajectory, first two
includes 1020 pages from 204 people. Also there is a held-out dimensions are set as zeroes.
subset for ICDAR2013 Chinese handwriting recognition Three different networks are investigated, as provide in
competition (denoted as comp set) [14]. Characteristics of Table 2. The type column reflects the number of hidden layers.
those datasets are summarized in Table 1. Our systems are The setting column describes the size of each layer. The suffix
evaluated on both test set and comp set. Note that there is 5 has specific meaning: I-input layer, S-subsampling layer, L-
characters in test set which are not coved by train set while LSTM layer and O-output layer. The last column summarize
around 2.02% characters in comp set are not coved by train the overall number of network parameters (in million). The
set. three networks are deliberately tuned to be comparable in
number of parameters.
TABLE I. PROPERTIES OF DATASETS
TABLE II. NETWORK SETUPS
Subsets of OLHWDB 2.0~2.2
Items Type
train test comp Setting #para
#pages 4,072 1,020 300 3-layer 3I-64L-128S-192L-2765O 1.63M
274
273
imposed. Compared between two sets, the improvement by 1.15% and 1.71% absolute error reductions are observed in
using language models on comp set is more remarkable. CR and AR respectively. Compared with the best results from
the segmentation-recognition integrated strategy [13], around
96 2.3% absolute error reductions are made in both CR and AR.
Our system achieves a bit lower result than the segmentation-
recognition integrated strategy on comp set. It is severely
challenging for the segmentation-free strategy to perform on
AR (%)
Figure 4. The 4th training stage is evaluated on validation set. Wang2012 [3] 91.97 92.76 -- --
97 93.40 94.43
Our Method 97.05 97.55
94.65* 95.65*
* remove all characters in comp set which are not coved by train set.
3-layer
5-layer F. Error Analysis
96
6-layer The recognition result is analyzed per text line. We
partition the ARs of all textlines into 11 intervals and consider
the number of textlines that fall into the interval, as shown in
0 0.5 0.8 1 1.2 1.5 1.8 2 2.3
Fig. 6. The width of interval is 5% if the AR is bigger than
LM weight 50%. There are 113 text lines in comp set whose ARs are
lower than 75% and we inspect all of them to get some insights.
Figure 5. Selection of language model (trigram) weight on validation set.
2000
TABLE III. EFFECTS OF LANGUAGE MODELS ON BOTH TEST SET AND
COMP SET (%) 50
bigram trigram 40
No LM 1500
Type 30
AR CR AR CR AR CR
20
Freqency
test set
10
3-layer 94.81 95.52 96.13 96.84 96.83 97.45 1000
0
5-layer 95.08 95.63 96.34 96.92 97.04 97.58 70-75 65-70 60-65 55-60 50-55 <50
Finally we compare our results with previous works, as Generally, errors are caused by three challenging issues,
shown in Table 4. The systems in first four rows use a as shown in Fig. 7. First of all, there are cursively written
segmentation-recognition integrated strategy. The fifth row samples, even two or more characters are joined as ligature.
employs a hybrid of CNN and LSTM architecture and a Moreover, severe skew or slant textlines are not easy to deal
specific feature extraction layer is presented. Best results on with since there are limited training samples to cover such
test set are achieved by our system. Compared with [20], phenomena. In addition, there are inadequate samples for
275
274
English letters/words and the punctuations are noisily written. [4] Y. Zou, Y. Liu, Y. Liu, and K. Wang, "Overlapped Handwriting Input
It may be helpful to further reduce or alleviate such issues. on Mobile Phones," in ICDAR, 2011.
[5] Y. Lv, L. Huang, D. Wang, and C. Liu, "Learning-Based Candidate
Segmentation Scoring for Real-Time Recognition of Online Overlaid
Chinese Handwriting," in ICDAR, 2013, pp. 74-78.
output: 㿴㣡਼ᇦᵋᆀ 䇴ᇑ㓿࣎,؍䇱䇴ᇑᐕՊ [6] H. Zhang, D.-H. Wang, and C.-L. Liu, "Character confidence based on
label: 㿴㤳ഭᇦ྆ᆖ䠁䇴ᇑ㹼࣎,؍䇱䇴ᇑᐕޜ N-best list for keyword spotting in online Chinese handwritten
(a) cursive handwriting and ligature documents," Pattern Recognition, vol. 47, pp. 1880-1890, 2014.
[7] M. Cheriet, N. Kharma, C. Liu, and C. Suen, Character Recognition
Systems: A Guide for Students and Practitioners: John Wiley & Sons,
2007.
[8] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, "Online and offline
handwritten Chinese character recognition: Benchmarking on new
databases," Pattern Recognition, vol. 46, pp. 155-162, 1// 2013.
[9] Q.-F. Wang, F. Yin, and C.-L. Liu, "Handwritten Chinese Text
Recognition by Integrating Multiple Contexts," IEEE Trans. on PAMI,
vol. 34, pp. 1469-1481, 2012.
output: ᘛ Ѡ⭠ѝഭ ޜ䜘ḷ㓒ඊ৸լԧ [10] M.-K. Zhou, X.-Y. Zhang, F. Yin, and C.-L. Liu, "Discriminative
label: ᘛDŽѠ⭠ѝഭޣޜ䜘ᶘ㓒ඊሩ↔䇴ԧ quadratic feature learning for handwritten Chinese character
recognition," Pattern Recognition, vol. 49, pp. 7-18, 1// 2016.
(b) skew and slant textline
[11] D.-H. Wang and C.-L. Liu, "Learning confidence transformation for
handwritten Chinese text recognition," International Journal on
Document Analysis and Recognition, vol. 17, pp. 205-219, 2013.
[12] X. D. Zhou, D. H. Wang, F. Tian, C. L. Liu, and M. Nakagawa,
output: aer crp uiree720i )㠚ࡽᧂ൘ㅜҼսDŽնᱟ䲔䶎 "Handwritten Chinese/Japanese Text Recognition Using Semi-Markov
label: der Cup WorldPoints)ⴞࡽᧂ൘ㅜҼս, նᱟ䲔䶎 Conditional Random Fields," IEEE Trans. on PAMI, vol. 35, pp. 2413-
(c) English words and punctuation 2426, 2013.
Typical errors extracted from comp set [13] X.-D. Zhou, Y.-M. Zhang, F. Tian, H.-A. Wang, and C.-L. Liu,
"Minimum-risk training for semi-Markov conditional random fields
with application to handwritten Chinese/Japanese text recognition,"
V. CONCLUSIONS Pattern Recognition, vol. 47, pp. 1904-1916, 5// 2014.
The paper presents a novel end-to-end recognition method [14] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, "ICDAR 2013
for online Chinese handwriting recognition based on deep Chinese Handwriting Recognition Competition," in ICDAR, 2013.
[15] T. Su, T. Zhang, and D. Guan, "Corpus-based HIT-MW Database for
recurrent neural networks. Unlikely to previous practices, we Offline Recognition of General-purpose Chinese Handwritten Text,"
directly feed the original pen trajectory into the network. The International Journal on Document Analysis and Recognition, vol. 10,
long contextual dependencies and complex dynamics are pp. 27-38, Jun 2007.
intended to be encoded by a hybrid architecture. A modified [16] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, "CASIA Online and
CTC beam search decoding algorithm is devised to wisely Offline Chinese Handwriting Databases," in ICDAR, 2011, pp. 37-41.
[17] T. Su, T. Zhang, and D. Guan, "Off-line Recognition of Realistic
integrate the linguistic constraints. Experiments on a public Chinese Handwriting Using Segmentation-free Strategy," Pattern
handwriting database show that our method remarkably Recognition, vol. 42, pp. 167-182, 2009.
outperforms state-of-the-art approaches on test set. Further [18] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J.
evaluating on the more challenging competition set, our Schmidhuber, "A Novel Connectionist System for Unconstrained
results are at least comparable to ones already published. Handwriting Recognition," IEEE Trans. on PAMI, vol. 31, pp. 855-
Especially, a new performance record can be achieved by our 868, 2009.
[19] R. Messina and J. Louradour, "Segmentation-free handwritten Chinese
method in situations without out-of-vocabulary problem. text recognition with LSTM-RNN," in ICDAR, 2015, pp. 171-175.
[20] Z. S. Zecheng Xie, Lianwen Jin, Ziyong Feng, Shuye Zhang, "Fully
ACKNOWLEDGMENT Convolutional Recurrent Network for Handwritten Chinese Text
This work was supported by "National Natural Science Recognition," in arXiv, 2016.
[21] Q. Liu, L. Wang, and Q. Huo, "A study on effects of implicit and
Foundation of China" (Grant No. 61203260), " Natural explicit language model information for DBLSTM-CTC based
Science Foundation of Guangdong Province, China" (Grant handwriting recognition," in ICDAR, 2015, pp. 461-465.
No. ZD2015017 and F2016017), and "the Fundamental [22] A. Graves, Supervised Sequence Labelling with Recurrent Neural
Research Funds for the Central Universities" (Grant No. Networks: Springer, 2012.
HIT.NSRIF.2015083). The work was also supported by [23] M. Schuster and K. Paliwal, "Bidirectional Recurrent Neural
Networks," IEEE Transactions on Signal Processing, vol. 45, pp. 2673-
NVIDIA GPU Education Center program. 2681, 1997.
[24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning: MIT
REFERENCES Press, 2016.
[1] J.-H. Liu, "Real Time Chinese Handwriting Recognition," Bachelar EE [25] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory,"
Thesis, Department of Electrical Engineering, Massachusetts Institute Neural Computation, vol. 9, pp. 1735-1780, 1997.
of Technology, Cambridge, 1966. [26] A. Graves and N. Jaitly, "Towards End-To-End Speech Recognition
[2] M. J. Zobrak, "A method for rapid recognition hand drawn line with Recurrent Neural Networks," in ICML, 2014, pp. 1764-1772.
patterns," Ms Thesis Ms Thesis, University of Pittsburgh, [27] A. Stolcke, "SRILM--an extensible language modeling toolkit," in
Pennsylvania, 1966. INTERSPEECH, 2002.
[3] D.-H. Wang, C.-L. Liu, and X.-D. Zhou, "An approach for real-time [28] M. D. Zeiler, "Adadelta: An adaptive learning rate method," in arXiv
recognition of online Chinese handwritten sentences," Pattern preprint arXiv:1212.5701, 2012.
Recognition, vol. 45, pp. 3661-3675, 2012.
276
275