An Implementation of Pseudo Corpus Generation for Chinese Spelling Checking
-
摘要: 大多数中文拼写检查的研究通过序列标注的方法检查错误,但这些方法都受限于训练语料的来源和规模。目前中 文拼写检查的语料多是来源于外国人学习中文写作时出现的错误,构造这些语料库的人工成本巨大,导致其规模小,且其中的 语法错误与中文母语者进行文字录入时出现的错误分布不同,使其难以直接在面向中文出版行业的应用中使用。本文提出一 种基于中文维基语料,自动生成包含错误拼写的伪语料的方法,使用伪语料进行训练,相对于直接使用训练集数据,模型获得了 提升,并且基于伪语料训练的模型在现实语料中取得了较好的效果。Abstract: Most of Chinese spelling check tasks are implemented as sequence tagging tasks. However, those implementations are limited by the size and source of corpus they used. By now, most of Chinese spelling check corpus are extracted from a condition called CFL (Chinese as Foreign Language), which the errors are made by Chinese language learners when they are writing in an exam. Corpus constructed by this method are often limited by cost, which further limited their size. Those limitations make them hardly be used in spelling check task for publish industry. This paper composes a method to generate large amount of pseudo corpus from Wikipedia.
-
Key words:
- Pseudo corpus generation /
- Chinese spelling checking /
- Edit distance
-
China:Association for Computational Linguistics,2015:32-37. [2] CHOE Y J,HAM J,PARK K.A Neural Grammatical Error Correction System Built on Better Pre-training and Sequential Transfer Learning[C].ACL 2019 BEA Workshop Shared Task,2019:32-43. [3] VASWANI A,SHAZEER N,PARMAR N.Attention Is All You Need[J]. Computer Science,Mathematics,2017(5):1-12. [4] HONG Y,YU X,HE N.Faspel:a fast,adaptable,simple, powerful Chinese spell checker based on dae-decoder paradigm[C].Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019).Hong Kong,China:Association for Computational Linguistics,2019:160-169. -

计量
- 文章访问数: 82
- HTML全文浏览量: 2
- PDF下载量: 3
- 被引次数: 0