An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Posted by 李健 on November 23, 2018

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Papers

  1. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Introduction

Category

Engineering research

Problem pattern

Less studied problem / Well studied problems

Idea pattern

Motivation

 Background

What is the research problem? Why it is an important problem? Why these solutions cannot address the problem satisfactorily?

Image-based sequence recognition has been a long-standing research topic in computer vision.

Literature Review

What are the existing solutions? What is their methodology?

1.Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as “OK” or 15 characters such as “congratulations”. Consequently, the most popular deep models like DCNN cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.

2.Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images.

3. Some other approaches treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total).

Research Niche

What motivates this work? e.g. motivated by the limitations of existing work or the lack of study for a specific issue.

在现实世界中,稳定的视觉对象,如场景文本、笔迹、乐谱等,往往以序列的形式出现,而不是孤立存在的。与一般的对象识别不同,识别这种类似序列的对象通常需要系统预测一系列的对象标签,而不是单个标签。因此,这类对象的识别可以自然地转换为序列识别问题。

Work

Research Objectives

What does the authors want to achieve in this work? What are the scope of this work? What the questions that this paper want to address?

current systems based on DCNN can not be directly used for image-based sequence recognition.

Challenge

Why achieving the research objectives is difficult?

序列对象的另一个独特特性是,它们的长度可能有很大的变化。例如,英语单词可以由两个字符(如OK)组成,也可以由15个字符(如贺词)组成。因此,最流行的深度模型如DCNN不能直接应用到序列预测中。

Insight

What is the main inspiration that lead to the new solution? 

 

Research summary

What is the proposed approach/framework/technique(s)?

在本文中,我们提出了一种新的神经网络架构,称为卷积递归神经网络(CRNN),它融合了卷积神经网络(CNN)和递归神经网络(RNN)的优点。

CRNN的网络架构如图1所示,由卷积层、循环层和转录层三部分组成,从下到上依次为卷积层、循环层和转录层。
在CRNN的底部,卷积层自动从每个输入图像中提取一个特征序列。
在卷积网络之上,构建一个递归网络,对卷积层输出的特征序列的每一帧进行预测。
利用CRNN顶部的转录层将重复层的每帧预测转换为标签序列。
虽然CRNN是由不同类型的网络架构组成的(CNN和RNN),但是可以联合训练一个损失函数。

 

Evaluation

Evaluation summary

How do the authors compare the new solution with existing ones? What are the comparison metrics? What is the scale of the experiment? What interesting observations do the author make from their experiment results?

Implications

What are the implications of the evaluation results? What does it mean to the practitioners?

1.在受限制的词汇情况下,我们的方法始终优于大多数最先进的方法,并在一般情况下击败了最好的文本阅读器。

2.在无约束词典的情况下,我们的方法在SVT数据集上取得了最好的性能,但在IC03和IC13数据集上仍落后于一些方法。

Novelty

Contributions

What are the contributions of the work? 。

CRNN是一个通用框架,因此他可以运用到其他与图像序列预测相关的领域和问题(例如中文识别)。对于类似序列的对象,CRNN比传统的神经网络模型具有一些独特的优势:

1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts ; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.

Limitations

Are there any unrealistic assumptions in the approach? Are there any case where the approach does not work?

识别速度。

 

Key concepts

What are the important concepts introduced in this work?

1.Edit distance

编辑距离是通过计算将一个字符串转换为另一个字符串所需的最小操作数量来量化两个字符串彼此不相似的方式。也被称为Levenshtein距离。

2.BK-Tree算法(模糊匹配)

 首先,我们观察Levenshtein距离的性质。令d(x,y)表示字符串x到y的Levenshtein距离,那么显然:

1. d(x,y) = 0 当且仅当 x=y  (Levenshtein距离为0 <==> 字符串相等)
2. d(x,y) = d(y,x)     (从x变到y的最少步数就是从y变到x的最少步数)
3. d(x,y) + d(y,z) >= d(x,z)  (从x变到z所需的步数不会超过x先变成y再变成z的步数)

    最后这一个性质叫做三角形不等式。就好像一个三角形一样,两边之和必然大于第三边。给某个集合内的元素定义一个二元的“距离函数”,如果这个距离函数同时满足上面说的三个性质,我们就称它为“度量空间”。我们的三维空间就是一个典型的度量空间,它的距离函数就是点对的直线距离。度量空间还有很多,比如Manhattan距离,图论中的最短路,当然还有这里提到的Levenshtein距离。就好像并查集对所有等价关系都适用一样,BK树可以用于任何一个度量空间。

    建树的过程有些类似于Trie。首先我们随便找一个单词作为根(比如GAME)。以后插入一个单词时首先计算单词与根的Levenshtein距离:如果这个距离值是该节点处头一次出现,建立一个新的儿子节点;否则沿着对应的边递归下去。例如,我们插入单词FAME,它与GAME的距离为1,于是新建一个儿子,连一条标号为1的边;下一次插入GAIN,算得它与GAME的距离为2,于是放在编号为2的边下。再下次我们插入GATE,它与GAME距离为1,于是沿着那条编号为1的边下去,递归地插入到FAME所在子树;GATE与FAME的距离为2,于是把GATE放在FAME节点下,边的编号为2。

http://www.matrix67.com/blogimage/200710223.gif

查询操作异常方便。如果我们需要返回与错误单词距离不超过n的单词,这个错误单词与树根所对应的单词距离为d,那么接下来我们只需要递归地考虑编号在d-n到d+n范围内的边所连接的子树。由于n通常很小,因此每次与某个节点进行比较时都可以排除很多子树。

    举个例子,假如我们输入一个GAIE,程序发现它不在字典中。现在,我们想返回字典中所有与GAIE距离为1的单词。我们首先将GAIE与树根进行比较, 得到的距离d=1。由于Levenshtein距离满足三角形不等式,因此现在所有离GAME距离超过2的单词全部可以排除了。比如,以AIM为根的子树 到GAME的距离都是3,而GAME和GAIE之间的距离是1,那么AIM及其子树到GAIE的距离至少都是2。于是,现在程序只需要沿着标号范围在 1-1到1+1里的边继续走下去。我们继续计算GAIE和FAME的距离,发现它为2,于是继续沿标号在1和3之间的边前进。遍历结束后回到GAME的第 二个节点,发现GAIE和GAIN距离为1,输出GAIN并继续沿编号为1或2的边递归下去(那条编号为4的边连接的子树又被排除掉了)。。。

    实践表明,一次查询所遍历的节点不会超过所有节点的5%到8%,两次查询则一般不会17-25%,效率远远超过暴力枚举。适当进行缓存,减小Levenshtein距离常数n可以使算法效率更高。

Technical content

What are the important concepts introduced in this work?

CNN、bidirectional LSTM、Transcription.

 

 

Future work

What are the further research topic that cannot be extended from this work? 

进一步加快CRNN的速度,使其在实际应用中更加实用,是未来值得探索的另一个方向。

 

 

Discussion

What are the further research topic that cannot be extended from this work?