Posted by 李健 on November 23, 2018

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition


Engineering research

Problem pattern

Less studied problem / Well studied problems

Idea pattern



What is the research problem? Why it is an important problem? Why these solutions cannot address the problem satisfactorily?

Image-based sequence recognition has been a long-standing research topic in computer vision.

Literature Review

What are the existing solutions? What is their methodology?

1.Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

2.Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images.

3. Some other approaches treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total).

Research Niche

What motivates this work? e.g. motivated by the limitations of existing work or the lack of study for a specific issue.



Research Objectives

What does the authors want to achieve in this work? What are the scope of this work? What the questions that this paper want to address?

current systems based on DCNN can not be directly used for image-based sequence recognition.


Why achieving the research objectives is difficult?



What is the main inspiration that lead to the new solution? 


Research summary

What is the proposed approach/framework/technique(s)?





Evaluation summary

How do the authors compare the new solution with existing ones? What are the comparison metrics? What is the scale of the experiment? What interesting observations do the author make from their experiment results?


What are the implications of the evaluation results? What does it mean to the practitioners?





What are the contributions of the work? 。


1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts ; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.


Are there any unrealistic assumptions in the approach? Are there any case where the approach does not work?



Key concepts

What are the important concepts introduced in this work?

1.Edit distance




1. d(x,y) = 0 当且仅当 x=y  (Levenshtein距离为0 <==> 字符串相等)
2. d(x,y) = d(y,x)     (从x变到y的最少步数就是从y变到x的最少步数)
3. d(x,y) + d(y,z) >= d(x,z)  (从x变到z所需的步数不会超过x先变成y再变成z的步数)




    举个例子,假如我们输入一个GAIE,程序发现它不在字典中。现在,我们想返回字典中所有与GAIE距离为1的单词。我们首先将GAIE与树根进行比较, 得到的距离d=1。由于Levenshtein距离满足三角形不等式,因此现在所有离GAME距离超过2的单词全部可以排除了。比如,以AIM为根的子树 到GAME的距离都是3,而GAME和GAIE之间的距离是1,那么AIM及其子树到GAIE的距离至少都是2。于是,现在程序只需要沿着标号范围在 1-1到1+1里的边继续走下去。我们继续计算GAIE和FAME的距离,发现它为2,于是继续沿标号在1和3之间的边前进。遍历结束后回到GAME的第 二个节点,发现GAIE和GAIN距离为1,输出GAIN并继续沿编号为1或2的边递归下去(那条编号为4的边连接的子树又被排除掉了)。。。


Technical content

What are the important concepts introduced in this work?

CNN、bidirectional LSTM、Transcription.



Future work

What are the further research topic that cannot be extended from this work? 





