Deep LSTM Networks For Online Chinese Handwriting Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

2016 15th International Conference on Frontiers in Handwriting Recognition

Deep LSTM Networks for Online Chinese Handwriting Recognition

Li Sun, Tonghua Su, Ce Liu, Ruigang Wang


School of Software
Harbin Institute of Technology
Harbin, China
[email protected]

Abstract—Currently two heavy burdens are borne in online learning methods [12, 13]. In [14], such three modules are
Chinese handwriting recognition: a large-scale training data termed by Vision Objects Ltd as segmentation, recognition,
needs to be annotated with the boundaries of each character and and interpretation experts. They are separately trained and
effective features should be handcrafted by domain experts. To their weighting to the task is scored through a global
relieve such issues, the paper presents a novel end-to-end discriminant training process.
recognition method based on recurrent neural networks. A Despite past few years have witnessed great success of
mixture architecture of deep bidirectional Long Short-Term segmentation-recognition integrated strategy, constructing
Memory (LSTM) layers and feed forward subsampling layers is such a system is nontrivial. As a requisite element, a large-
used to encode the long contextual history trajectories. The
scale training data needs to be annotated in character level.
Connectionist Temporal Classification (CTC) objective function
makes it possible to train the model without providing
The widespread Chinese handwriting databases, such as HIT-
alignment information between input trajectories and output MW [15] and CASIA-OLHWDB [16], had been taken years
strings. During decoding, a modified CTC beam search to be ground-truthed. Moreover, features for classification are
algorithm is devised to integrate the linguistic constraints derived through experts’ handcraft. In addition, different
wisely. Our method is evaluated both on test set and competition components of the system are developed separated, even they
set of CASIA-OLHWDB 2.x. Comparing with state-of-the-art require different data.
methods, over 30% relative error reductions are observed on As an alternation, segmentation-free strategy is promising
test set in terms of both correct rate and accurate rate. Even to to solve above issues. There is no need to explicitly segment
the more challenging competition set, better results can be textlines into characters. Thus labeling each character
achieved by our method if the out-of-vocabulary problem can boundaries in training data is unnecessary. Previously, hidden
be ignored. markov models (HMMs) have been successfully used to
recognize Chinese handwriting [17] in a segmentation-free
Keywords- Chinese Handwriting Recognition; Deep way. Recently, combination of Long Short-Term Memory
Learning; Long Short-Term Memory; Beam Search (LSTM) and Connectionist Temporal Classification (CTC)
becomes a more powerful sequential labeling tool than HMMs
I. INTRODUCTION
[18]. Messina et al utilize this tool to transcribe offline
Online handwriting recognition plays a significant role for Chinese handwriting [19]. The preliminary results are
entering Chinese text. The pioneer works on online Chinese comparable to those ever reported. More recently,
character recognition were independently studied in 1966 by convolutional LSTM is applied to online Chinese handwriting
two groups from MIT and University of Pittsburgh [1, 2]. recognition [20] with pleasing results. The system includes
Since then, extensive concerns have been received, owing to four kinds of layers. The input trajectory is first converted into
the popularity of personal computers and smart devices. textline images using a path signature layer. Then
Compared with other input methods, it is more natural to convolutional layers are concatenated to learn features. After
human beings, especially to those Chinese people who are not that, the features are fed to multiple layers of bidirectional
good at Pinyin. Successful applications have been found in LSTM. And finally, the CTC beam search is used to transform
pen-based text input [3], overlapped input method[4, 5], and the output of network to labels in transcription layer.
notes digitalization and retrieval [6]. Inspired by the works in segmentation-free strategy, the
Most approaches for online Chinese handwriting paper presents a novel end-to-end recognition method based
recognition fall into segmentation-recognition integrated on deep recurrent neural networks. Instead of rendering the
strategy [7]. Firstly, the textline is over-segmented into trajectories as offline mode [20, 21], we explore the possible
primitive segments. Nextly, consecutive segments are merged capacities of original pen trajectory using a deep architecture.
and assigned a list of candidate classes by a character classifier The architecture mixes bidirectional LSTM layers and feed
[8], which forms segmentation-recognition lattice. Last, path forward subsampling layers which used to encode the long
searching is executed to identify an optimal path by wisely contextual history trajectories. We also follow the CTC
integrating character classification scores, geometric context objective function, making it possible to train the model
and linguistic context [9]. Many works are attempted to without alignment information between input trajectories and
optimize any of the key factors, such as character output strings. As suggested in [21], explicit linguistic
recognition[10], confidence transformation [11], parameter constraints are helpful to end-to-end systems even trained with

2167-6445/16 $31.00 © 2016 IEEE 270


271
DOI 10.1109/ICFHR.2016.54
a large scale training data. To integrate the linguistic from (n-1)-th hidden variable vector ‫ܐ‬௧௡ିଵ ( ‫ܐ‬௧଴ ൌ ‫ ܠ‬௧ ) and
constraints wisely, a modified CTC beam search decoding previous hidden variable vector ‫ܐ‬௧ିଵ
௡ :
algorithm is devised. Experiments show that our method
reduces over 30% relative errors on test set. Even to the more ‫ܐ‬௧௡ ൌ ‫݄݊ܽݐ‬ሺ‫܅‬௡ூ ‫ܐ‬௧௡ିଵ ൅ ‫܃‬௡ு ‫ܐ‬௧ିଵ
௡ ሻ, (2)
challenging competition set, state-of-the-art performance can
be achieved by our method. where ‫܅‬௡ூ is the weight matrix from (n-1)-th hidden layer to
The rest of the paper is organized as follows. Section 2 n-th hidden layer and ‫܃‬௡ு is the self-connected recurrent
provides a bird-eye view of the whole system. Section 3 weight matrix. Since ‫ܐ‬௧௡ is defined recursively, the
details the proposed models and algorithms. Section 4 dependency on input vectors ranging 1st to t-th time step can
presents the experimental results and finally in Section 5 be accessed. The network output vector at time t can be
conclusions are given. computed as:
II. SYSTEM OVERVIEW
‫ ܡ‬௧ ൌ ‫ݔܽ݉ݐ݂݋ݏ‬ሺ‫܉‬௧ ሻ ൌ ‫ݔܽ݉ݐ݂݋ݏ‬ሺ‫ܐ ࡻ ܅‬௧௅ ሻ , (3)
The proposed system aims to map a fed pen trajectory X
(=x1x2…xT) to a textual string l following an end-to-end
where WO is the weight matrix from the last hidden layer to
paradigm. This process is denoted as prediction in Fig. 1.
output layer. Note one extra output element is reserved for
Supposing a LSTM model is learned, we can feed an unseen
{null} that will be used in CTC algorithm.
trajectory textline. The LSTM network is propagated from
Subsampling layers are worthy to explore considering the
input layer to output layer through hidden layers. As the final
large length of trajectories and the variability in writing styles.
step, CTC beam search is performed to output most possible
Herein we devise feedforward-like neuron units which allows
strings. The performance of each trajectory is measured based
inputs from a window of consecutive timesteps to be collapsed.
on the alignment between l and the ground-truthed string z.
It is a simplified version of [22]. The k-th hidden variable at
time step w×t in the n-th subsampling layer is computed as :
OLHWDB train set train LSTM
௪ൈ௧ ௜
݄௡ሺ௞ሻ ൌ ‫݄݊ܽݐ‬൫σ௪ ௪ൈ௧ା௜
௜ୀଵ ‫ܟ‬௡ሺ௞ሻ ‫ܐ ڄ‬௡ିଵ ൯ , (4)


char model n-gram where w is the window size, ‫ܟ‬௡ሺ௞ሻ is the weight vector from
(n-1)-th layer to the k-th unit in the n-th layer.
Learning
Prediction
To further benefit the geometric context both from left to
right and vice versa, we also consider a bidirectional recurrent
neural network (BRNN) [23] where each recurrent layer is
string

trajectory forward LSTM CTC beam search


replaced with a forward layer and backward layer. The
forward pass is the same as usual, while the backward
Figure 1. System diagram of the proposed paradigm. processes data from t = T to 1. The recurrent formulation
results in the sharing of parameters through all time steps [22].
The training process in Fig. 1 is to derive LSTM models Fig. 2 shows a typical structure of the unfolded recurrent
minimizing the empirical loss function: neural network.

ࣦሺ‫܁‬ሻ ൌ σሺ‫܆‬ǡ‫ܢ‬ሻ‫ࣦ ܁א‬ሺ‫܆‬ǡ ‫ܢ‬ሻ , (1)

with S denoting the training set. Because the problem is a


multi-class classification, we use the ࣦሺ‫܆‬ǡ ‫ܢ‬ሻ ൌ െ݈݊ܲሺœȁሻ to
represent the example loss function which can be effectively
computed by CTC and forward-backward algorithm. Training
data in the CASIA-OLHWDB database [16] is used to derive
the classification model. Also we estimate n-gram language
models from textual corpus in advance which will be used to
facilitate the search process during decoding.
III. MODELLING
A. Architecture
We choose deep recurrent neural network (RNN) to derive
the mapping function that is specialized for processing a
sequence of vectors. The history information is encoded Figure 2. A typical structure of the unfolded BRNN (with 5 hidden
through the recurrent hidden layer. Hidden variable vector at layers).
time step t in the n-th hidden layer ‫ܐ‬௧௡ is iteratively computed

272
271
The standard RNNs suffer from vanishing gradient Substituting into (1), we obtain the empirical loss function.
problem [24]. In our RNN network, LSTM memory blocks Defining the sensitivity signal at output layer ߜ௞௧ ‫ࣦ߲ ؜‬Τ߲ܽ௞௧ ,
[25] are used to replace the nodes of RNN in hidden layers. we can get:
Fig. 3 illustrates a single LSTM memory block. Herein each
memory block includes three gates, one cell, and three ଵ
ߜ௞௧ ൌ ‫ݕ‬௞௧ െ σ୳‫א‬୆ሺ‫ܢ‬ǡ୩ሻ ߙሺ‫ݐ‬ǡ ‫ݑ‬ሻߚሺ‫ݐ‬ǡ ‫ݑ‬ሻ , (6)
peephole connections. The input gate controls how much the ୔୰൫‫ܢ‬ห‫܆‬൯
input can be flowed into the cell, and the output gate controls
how much the output can be sent out of the cell while the where ሺ‫ܢ‬ǡ ሻ ൌ ሼ‫ݑ‬ǣ ‫ ܢ‬ᇱ ௨ ൌ ሽ. Based on (6), the sensitivity
forget gate controls the cell’s previous state. All of them are signals at other layers can be easily backpropagated through
nonlinear summation units. In general, the input activation the network.
function is tanh function and the gate activation function is
C. Decoding
sigmoid function which can squeeze the gate data between 0
and 1. The CTC beam search algorithm is modified to wisely
integrate linguistic constraints. To reduce the bias of noisy
estimation, the extension probability Pr(k,y,t) is rescaled,
while the pseudocode in Alg. 1 falls into the similar
framework proposed in [26]. Define Pr-(y,t), Pr+(y,t) and
Pr(y,t) respectively as the null, non-null and total probabilities
θ assigned to partial output transcription y, at time t. The
bθ probability Pr(k,y,t) of y by label k at time t as follows:
xt t σ

t
” ି ሺ‫ܡ‬ǡ ‫ ݐ‬െ ͳሻǡ ‫ ܡ‬௘ ൌ ݇
ht-1 ht  ”ሺǡ ‫ܡ‬ǡ ‫ݐ‬ሻ ൌ ”ሺǡ ‫ݐ‬ȁ‫܆‬ሻ ሾߛ”ሺȁ‫ܡ‬ሻሿ ൜ ,
σ Cell ”ሺ‫ܡ‬ǡ ‫ ݐ‬െ ͳሻ ǡ ‘–Š‡”•

φ t ct where Pr(k,t|X) is the CTC emission probability of k at t, as
ht defined in (3), Pr(k|y) is the linguistic transition from y to y+k,
ε ye is the final label in y and γ is the language weight.
j-th node bε 
j-th node σ Algorithm I. CTC Beam Search

1: initialize : ࣜ ={‫ ି ”;}׎‬ሺ‫׎‬ǡ Ͳሻ ൌ ͳ


ht 2: for t = 1 … T do
3: ࣜ෡ = the N-Best in ࣜ
Figure 3. Memory block of LSTM network.
4: ࣜ = {}
B. Learning 5: ෡ do
for y in ࣜ
6: if y ് ‫ ׎‬then
Connectionist Temporal Classification (CTC) is an object
function that allows an RNN to be trained without requiring 7: ” ା ሺ‫ܡ‬ǡ ‫ݐ‬ሻ ՚ ” ା ሺ‫ܡ‬ǡ ‫ ݐ‬െ ͳሻ”ሺ‫ ܡ‬௘ ǡ ‫ݐ‬ȁ‫܆‬ሻ
any prior alignment between input and target sequences [26]. 8: ෝ ‫ࣜא‬
if (࢟ ෡ ) then
Assuming all labels are draw from an alphabet A, define the 9: ” ା ሺ‫ܡ‬ǡ ‫ݐ‬ሻ ՚ ”ሺ‫ܡ‬ǡ ‫ ݐ‬െ ͳሻ ൅ ”ሺ‫ ܡ‬௘ ǡ ‫ݕ‬ො ǡ ‫ݐ‬ሻ
extended alphabet A' = A∪{null}. If the output is null, it emits 10: ” ି ሺ‫ܡ‬ǡ ‫ݐ‬ሻ ՚ ”ሺ‫ܡ‬ǡ ‫ ݐ‬െ ͳሻ”ሺെǡ ‫ݐ‬ȁ‫܆‬ሻ
no label at that timestep. A CTC mapping function ࣠ is 11: add y to ࣜ
defined to remove repeated labels and then delete the nulls 12: prune emission probabilities at time t (retain Kt classes)
from each output sequence. Providing that the null label is 13: for k = 1 … Kt do
injected into the label sequence z, we get a new sequence z' of
14: ” ି ሺ‫ ܡ‬൅ ݇ǡ ‫ݐ‬ሻ ՚ Ͳ
length 2|z|+1.
15: ” ା ሺ‫ ܡ‬൅ ݇ǡ ‫ݐ‬ሻ ՚ ”ሺ݇ǡ ‫ܡ‬ǡ ‫ݐ‬ሻ
CTC then uses a forward-backward algorithm [22] to sum
over all possible alignments and determine the conditional 16: add (y + k) to ࣜ

probability Pr(z|X) of the target sequence given the input 17: Return: ƒ”‰ ƒš௬‫ ” ࣜא‬ȁ‫ܡ‬ȁ ሺ‫ܡ‬ǡ ܶሻ
sequence. For a labelling z, the forward variable (t, u) is
defined as the summed probability of all length t paths that are IV. EXPERIMENTS
mapped by ࣠ onto the length u/2 prefix of z. On the contrary,
the backward variable (t, u) is defined as the summed A. Trajectory Datasets
probabilities of all paths starting at t+1 that complete z when We use an online Chinese handwriting database, CASIA-
appended to any path contributing to (t, u). Thus the OLHWDB [16], to train and evaluate our models. This
conditional probability Pr(z|X) can be expressed as: database comprises trajectories of both isolated characters and
lines of text. The part comprising only isolated characters has
”ሺ‫ܢ‬ȁ‫܆‬ሻ ൌ σ୳ୀଵ̱ȁ‫ܢ‬ᇱȁ ߙሺ‫ݐ‬ǡ ‫ݑ‬ሻߚሺ‫ݐ‬ǡ ‫ݑ‬ሻ . (5) three sub-sets, named OLHWDB 1.0~l.2, and the other also
has three parts: OLHWDB 2.0~2.2. OLHWDB 2.0~2.2 are

273
272
divided into train set and test set by authors of [16]. The train of x, y coordinates, along with a state to indicate whether the
set includes 4072 pages from 815 people while the test set pen is lifted. As for the start of the trajectory, first two
includes 1020 pages from 204 people. Also there is a held-out dimensions are set as zeroes.
subset for ICDAR2013 Chinese handwriting recognition Three different networks are investigated, as provide in
competition (denoted as comp set) [14]. Characteristics of Table 2. The type column reflects the number of hidden layers.
those datasets are summarized in Table 1. Our systems are The setting column describes the size of each layer. The suffix
evaluated on both test set and comp set. Note that there is 5 has specific meaning: I-input layer, S-subsampling layer, L-
characters in test set which are not coved by train set while LSTM layer and O-output layer. The last column summarize
around 2.02% characters in comp set are not coved by train the overall number of network parameters (in million). The
set. three networks are deliberately tuned to be comparable in
number of parameters.
TABLE I. PROPERTIES OF DATASETS
TABLE II. NETWORK SETUPS
Subsets of OLHWDB 2.0~2.2
Items Type
train test comp Setting #para
#pages 4,072 1,020 300 3-layer 3I-64L-128S-192L-2765O 1.63M

#lines 41,710 10,510 3,432 5-layer 3I-64L-80S-96L-128S-160L-2765O 1.50M

#chars 1,082,220 269,674 91,576 6-layer 3I-48L-64S-96L-128L-144S-160L-2765O 1.84M

#classes 2,650 2,631 1,375


The system is developed from scratch by our own. The
B. Texual Corpus probabilities except language model are expressed in natural
logarithm scale. The learning rate is driven by AdaDelta
To model the linguistic constraints, we collect a large
algorithm [28] with a momentum of 0.95. During CTC beam
amount of textual data from electronic news of People’s Daily
search decoding, the emission probabilities are pruned with
(in html format). We just filter out the html markups and no
the reciprocal of the output units as threshold and the beam
human effort is intervened while the data contain some noisy
size is fixed as 64 in our evaluation. Both CPU version and
lines. The texts are ranging from 1999 to 2005 including
GPU version are implemented. Using a mini-batch size of 32
193,842,123 characters.
textlines, a speedup of more than 200X can be achieved by our
We estimate our n-gram language models using SRILM
GPU version along with a Tesla K40.
toolkit [27]. All characters outside of 7356 classes are
The training process is divided into four stages. In the first
removed. Bigram and trigram are considered respectively in
stage, isolated samples are used to train the networks with a
base-10 logarithm scale. Both of them are pruned with a
mini-batch size of 128 characters. It has cycled for 5 epochs.
threshold of 7e-7.
The next stage is run on all training data from OLHWDB 2.x
C. Performance Metrics with a mini-batch size of 32 textlines and it is repeated for 10
The output of certain recognizer is compared with the epochs. The third stage is executed on the same data as the
reference transcription and two metrics, the correct rate (CR) first stage with 10 epochs. At the final stage, 95% of the
and accurate rate (AR), are calculated to evaluate the results. training data from OLHWDB 2.x is used to fine-tune the
Supposing the number of substitution errors (ܵ݁), deletion networks no more than 20 epochs. The rest of the training data
errors (‫)݁ܦ‬, and insertion errors (‫ )݁ܫ‬are known, CR and AR is reserved for validation of the best model.
are defined respectively as: E. Results
ே೟ ିௌ೐ ି஽೐
The 4th training stage is illustrated in Fig. 4 where the y
‫ ܴܥ‬ൌ axis illustrates the AR on validation set. The best model in this
೟ ே
ቐ ே ିௌ ି஽ ିூ (7) phase is selected and used to evaluate on test set or comp set
‫ ܴܣ‬ൌ ೟ ೐ ೐ ೐ in the following text.
ே೟
The rescaling intensity of language models is determined
where ܰ‫ ݐ‬is the number of total characters in the reference based on their performance on validation set. In Fig. 5, the role
transcription. of character-level trigram is given. We can see that it is stable
when the language weight ranges from 1.0 to 1.8 while the
D. Setup standard decoding algorithm uses a value around 2.3. Similar
We model the output layer with 2765 units. One is observation can be seen for bigram. In the following
reserved as null indicator; others cover all characters occurred experiments, we simply fix it as 1.0.
in train set, test set and comp set of OLHWDB 2.0~2.2. We We investigate the role of different language models as
also select isolated samples from OLHWDB 1.x with no more well as network depth both on test set and comp set, as shown
than 700 samples per class hoping to alleviate the out-of- in Table 3. Though three networks have almost same quantity
vocabulary problem of comp set. of parameters, the deeper networks generally work better on
The input trajectory fed to the network is represented in both sets. On the other, we can see a steady increase in
raw form. At each timestep, it consists of first order difference performance when more strong linguistic constraints are

274
273
imposed. Compared between two sets, the improvement by 1.15% and 1.71% absolute error reductions are observed in
using language models on comp set is more remarkable. CR and AR respectively. Compared with the best results from
the segmentation-recognition integrated strategy [13], around
96 2.3% absolute error reductions are made in both CR and AR.
Our system achieves a bit lower result than the segmentation-
recognition integrated strategy on comp set. It is severely
challenging for the segmentation-free strategy to perform on
AR (%)

94 comp set since there is a noticeable out-of-vocabulary


problem. If we remove all out-of-vocabulary characters from
3-layer comp set, our system achieves 94.65% and 95.65% in AR and
5-layer CR, respectively, even outperforming the best system of
92 6-layer
Vision Objects [14].
TABLE IV. RESULTS COMPARISONS (%)
90 test set comp set
4 8 12 16 20 Methods
Training Epochs AR CR AR CR

Figure 4. The 4th training stage is evaluated on validation set. Wang2012 [3] 91.97 92.76 -- --

Zhou2013 [12] 93.75 94.34 94.06 94.62


98
Zhou2014 [13] 94.69 95.32 94.22 94.76

VO-3 [14] -- -- 94.49 95.03

Xie2016 [20] 95.34 96.40 92.88* 95.00*


AR (%)

97 93.40 94.43
Our Method 97.05 97.55
94.65* 95.65*
* remove all characters in comp set which are not coved by train set.
3-layer
5-layer F. Error Analysis
96
6-layer The recognition result is analyzed per text line. We
partition the ARs of all textlines into 11 intervals and consider
the number of textlines that fall into the interval, as shown in
0 0.5 0.8 1 1.2 1.5 1.8 2 2.3
Fig. 6. The width of interval is 5% if the AR is bigger than
LM weight 50%. There are 113 text lines in comp set whose ARs are
lower than 75% and we inspect all of them to get some insights.
Figure 5. Selection of language model (trigram) weight on validation set.
2000
TABLE III. EFFECTS OF LANGUAGE MODELS ON BOTH TEST SET AND
COMP SET (%) 50

bigram trigram 40
No LM 1500
Type 30
AR CR AR CR AR CR
20
Freqency

test set
10
3-layer 94.81 95.52 96.13 96.84 96.83 97.45 1000
0
5-layer 95.08 95.63 96.34 96.92 97.04 97.58 70-75 65-70 60-65 55-60 50-55 <50

6-layer 95.30 95.82 96.45 97.00 97.05 97.55 500


comp set

3-layer 88.06 89.40 91.42 92.87 93.00 94.28


0
95-100 90-95 85-90 80-85 75-80 70-75 65-70 60-65 55-60 50-55 <50
5-layer 88.91 90.00 92.05 93.20 93.37 94.44
AR Range
6-layer 89.12 90.18 92.12 93.25 93.40 94.43 Figure 6. Summary of the ARs using interval histogram

Finally we compare our results with previous works, as Generally, errors are caused by three challenging issues,
shown in Table 4. The systems in first four rows use a as shown in Fig. 7. First of all, there are cursively written
segmentation-recognition integrated strategy. The fifth row samples, even two or more characters are joined as ligature.
employs a hybrid of CNN and LSTM architecture and a Moreover, severe skew or slant textlines are not easy to deal
specific feature extraction layer is presented. Best results on with since there are limited training samples to cover such
test set are achieved by our system. Compared with [20], phenomena. In addition, there are inadequate samples for

275
274
English letters/words and the punctuations are noisily written. [4] Y. Zou, Y. Liu, Y. Liu, and K. Wang, "Overlapped Handwriting Input
It may be helpful to further reduce or alleviate such issues. on Mobile Phones," in ICDAR, 2011.
[5] Y. Lv, L. Huang, D. Wang, and C. Liu, "Learning-Based Candidate
Segmentation Scoring for Real-Time Recognition of Online Overlaid
Chinese Handwriting," in ICDAR, 2013, pp. 74-78.
output: 㿴㣡਼ᇦᵋᆀ 䇴ᇑ㓿࣎,‫؍‬䇱䇴ᇑᐕ֌Պ [6] H. Zhang, D.-H. Wang, and C.-L. Liu, "Character confidence based on
label: 㿴㤳ഭᇦ྆ᆖ䠁䇴ᇑ㹼࣎,‫؍‬䇱䇴ᇑᐕ֌‫ޜ‬ N-best list for keyword spotting in online Chinese handwritten
(a) cursive handwriting and ligature documents," Pattern Recognition, vol. 47, pp. 1880-1890, 2014.
[7] M. Cheriet, N. Kharma, C. Liu, and C. Suen, Character Recognition
Systems: A Guide for Students and Practitioners: John Wiley & Sons,
2007.
[8] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, "Online and offline
handwritten Chinese character recognition: Benchmarking on new
databases," Pattern Recognition, vol. 46, pp. 155-162, 1// 2013.
[9] Q.-F. Wang, F. Yin, and C.-L. Liu, "Handwritten Chinese Text
Recognition by Integrating Multiple Contexts," IEEE Trans. on PAMI,
vol. 34, pp. 1469-1481, 2012.
output: ᘛ Ѡ⭠ѝഭ‫ ޜ‬䜘ḷ㓒ඊ৸լ֌ԧ [10] M.-K. Zhou, X.-Y. Zhang, F. Yin, and C.-L. Liu, "Discriminative
label: ᘛDŽѠ⭠ѝഭ‫ޣޜ‬䜘ᶘ㓒ඊሩ↔䇴ԧ quadratic feature learning for handwritten Chinese character
recognition," Pattern Recognition, vol. 49, pp. 7-18, 1// 2016.
(b) skew and slant textline
[11] D.-H. Wang and C.-L. Liu, "Learning confidence transformation for
handwritten Chinese text recognition," International Journal on
Document Analysis and Recognition, vol. 17, pp. 205-219, 2013.
[12] X. D. Zhou, D. H. Wang, F. Tian, C. L. Liu, and M. Nakagawa,
output: aer crp uiree720i )㠚ࡽᧂ൘ㅜҼսDŽնᱟ䲔䶎 "Handwritten Chinese/Japanese Text Recognition Using Semi-Markov
label: der Cup WorldPoints)ⴞࡽᧂ൘ㅜҼս, նᱟ䲔䶎 Conditional Random Fields," IEEE Trans. on PAMI, vol. 35, pp. 2413-
(c) English words and punctuation 2426, 2013.
Typical errors extracted from comp set [13] X.-D. Zhou, Y.-M. Zhang, F. Tian, H.-A. Wang, and C.-L. Liu,
"Minimum-risk training for semi-Markov conditional random fields
with application to handwritten Chinese/Japanese text recognition,"
V. CONCLUSIONS Pattern Recognition, vol. 47, pp. 1904-1916, 5// 2014.
The paper presents a novel end-to-end recognition method [14] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, "ICDAR 2013
for online Chinese handwriting recognition based on deep Chinese Handwriting Recognition Competition," in ICDAR, 2013.
[15] T. Su, T. Zhang, and D. Guan, "Corpus-based HIT-MW Database for
recurrent neural networks. Unlikely to previous practices, we Offline Recognition of General-purpose Chinese Handwritten Text,"
directly feed the original pen trajectory into the network. The International Journal on Document Analysis and Recognition, vol. 10,
long contextual dependencies and complex dynamics are pp. 27-38, Jun 2007.
intended to be encoded by a hybrid architecture. A modified [16] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, "CASIA Online and
CTC beam search decoding algorithm is devised to wisely Offline Chinese Handwriting Databases," in ICDAR, 2011, pp. 37-41.
[17] T. Su, T. Zhang, and D. Guan, "Off-line Recognition of Realistic
integrate the linguistic constraints. Experiments on a public Chinese Handwriting Using Segmentation-free Strategy," Pattern
handwriting database show that our method remarkably Recognition, vol. 42, pp. 167-182, 2009.
outperforms state-of-the-art approaches on test set. Further [18] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J.
evaluating on the more challenging competition set, our Schmidhuber, "A Novel Connectionist System for Unconstrained
results are at least comparable to ones already published. Handwriting Recognition," IEEE Trans. on PAMI, vol. 31, pp. 855-
Especially, a new performance record can be achieved by our 868, 2009.
[19] R. Messina and J. Louradour, "Segmentation-free handwritten Chinese
method in situations without out-of-vocabulary problem. text recognition with LSTM-RNN," in ICDAR, 2015, pp. 171-175.
[20] Z. S. Zecheng Xie, Lianwen Jin, Ziyong Feng, Shuye Zhang, "Fully
ACKNOWLEDGMENT Convolutional Recurrent Network for Handwritten Chinese Text
This work was supported by "National Natural Science Recognition," in arXiv, 2016.
[21] Q. Liu, L. Wang, and Q. Huo, "A study on effects of implicit and
Foundation of China" (Grant No. 61203260), " Natural explicit language model information for DBLSTM-CTC based
Science Foundation of Guangdong Province, China" (Grant handwriting recognition," in ICDAR, 2015, pp. 461-465.
No. ZD2015017 and F2016017), and "the Fundamental [22] A. Graves, Supervised Sequence Labelling with Recurrent Neural
Research Funds for the Central Universities" (Grant No. Networks: Springer, 2012.
HIT.NSRIF.2015083). The work was also supported by [23] M. Schuster and K. Paliwal, "Bidirectional Recurrent Neural
Networks," IEEE Transactions on Signal Processing, vol. 45, pp. 2673-
NVIDIA GPU Education Center program. 2681, 1997.
[24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning: MIT
REFERENCES Press, 2016.
[1] J.-H. Liu, "Real Time Chinese Handwriting Recognition," Bachelar EE [25] S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory,"
Thesis, Department of Electrical Engineering, Massachusetts Institute Neural Computation, vol. 9, pp. 1735-1780, 1997.
of Technology, Cambridge, 1966. [26] A. Graves and N. Jaitly, "Towards End-To-End Speech Recognition
[2] M. J. Zobrak, "A method for rapid recognition hand drawn line with Recurrent Neural Networks," in ICML, 2014, pp. 1764-1772.
patterns," Ms Thesis Ms Thesis, University of Pittsburgh, [27] A. Stolcke, "SRILM--an extensible language modeling toolkit," in
Pennsylvania, 1966. INTERSPEECH, 2002.
[3] D.-H. Wang, C.-L. Liu, and X.-D. Zhou, "An approach for real-time [28] M. D. Zeiler, "Adadelta: An adaptive learning rate method," in arXiv
recognition of online Chinese handwritten sentences," Pattern preprint arXiv:1212.5701, 2012.
Recognition, vol. 45, pp. 3661-3675, 2012.

276
275

You might also like