OCR/文本纠错方法汇总

1、pycorrector

https://github.com/shibing624/pycorrector

中文文本纠错工具。音似、形似错字（或变体字）纠正，可用于中文拼音、笔画输入法的错误纠正。python3.6开发。

pycorrector依据语言模型检测错别字位置，通过拼音音似特征、笔画五笔编辑距离特征及语言模型困惑度特征纠正错别字。

Question

中文文本纠错任务，常见错误类型包括：

谐音字词，如配副眼睛-配副眼镜
混淆音字词，如流浪织女-牛郎织女
字词顺序颠倒，如伍迪艾伦-艾伦伍迪
字词补全，如爱有天意-假如爱有天意
形似字错误，如高梁-高粱
中文拼音全拼，如 xingfu-幸福
中文拼音缩写，如 sz-深圳
语法错误，如想象难以-难以想象

当然，针对不同业务场景，这些问题并不一定全部存在，比如输入法中需要处理前四种，搜索引擎需要处理所有类型，语音识别后文本纠错只需要处理前两种，其中’形似字错误’主要针对五笔或者笔画手写输入等。本项目重点解决其中的谐音、混淆音、形似字错误、中文拼音全拼、语法错误带来的纠错任务。

2、FASPell

https://github.com/iqiyi/FASPell

This repository (licensed under GNU General Public License v3.0) contains all the data and code you need to build a state-of-the-art (by early 2019) Chinese spell checker and replicate the experiments in the original paper:

FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm LINK

, which is published in the Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text.

2-1、APTED algorithm for the Tree Edit Distance

https://github.com/DatabaseGroup/apted

Note that FASPell only adopts string edit distance to compute similarity. If you are interested in using tree edit distance to compute similarity, you need to download (from here) and compile a tree edit distance executable apted.jar into the home directory before running:

3、OCR-Corrector

https://github.com/tiantian91091317/OCR-Corrector

利用语言模型，纠正OCR识别错误