服务于拼写检查的伪语料生成方法实现

胡睿

doi:10.19695/j.cnki.cn12-1369.2021.01.55

服务于拼写检查的伪语料生成方法实现

doi: 10.19695/j.cnki.cn12-1369.2021.01.55

胡睿

北方工业大学信息学院,北京 100144

基金项目:

2020年北京市大学生科学研究与创业行动计划项目,来源:北京市教委

详细信息

作者简介:
胡睿,男,北京人,本科,研究方向:自然语言处理、知识图谱。

中图分类号: TP391.1
计量
- 文章访问数: 82
- HTML全文浏览量: 2
- PDF下载量: 3
- 被引次数: 0
出版历程
- 收稿日期: 2020-12-01
- 修回日期: 2021-01-17
- 网络出版日期: 2021-09-23
- 刊出日期: 2021-01-25

An Implementation of Pseudo Corpus Generation for Chinese Spelling Checking

HU Rui

School of Information Science and Technology, North China University of Technology, Beijing 100144

摘要

摘要: 大多数中文拼写检查的研究通过序列标注的方法检查错误,但这些方法都受限于训练语料的来源和规模。目前中文拼写检查的语料多是来源于外国人学习中文写作时出现的错误,构造这些语料库的人工成本巨大,导致其规模小,且其中的语法错误与中文母语者进行文字录入时出现的错误分布不同,使其难以直接在面向中文出版行业的应用中使用。本文提出一种基于中文维基语料,自动生成包含错误拼写的伪语料的方法,使用伪语料进行训练,相对于直接使用训练集数据,模型获得了提升,并且基于伪语料训练的模型在现实语料中取得了较好的效果。
- 伪语料生成 /
- 中文拼写检查 /
- 编辑距离
Abstract: Most of Chinese spelling check tasks are implemented as sequence tagging tasks. However, those implementations are limited by the size and source of corpus they used. By now, most of Chinese spelling check corpus are extracted from a condition called CFL (Chinese as Foreign Language), which the errors are made by Chinese language learners when they are writing in an exam. Corpus constructed by this method are often limited by cost, which further limited their size. Those limitations make them hardly be used in spelling check task for publish industry. This paper composes a method to generate large amount of pseudo corpus from Wikipedia.
- Pseudo corpus generation /
- Chinese spelling checking /
- Edit distance

HTML全文

参考文献(4)

China:Association for Computational Linguistics,2015:32-37.
[2]	CHOE Y J,HAM J,PARK K.A Neural Grammatical Error Correction System Built on Better Pre-training and Sequential Transfer Learning[C].ACL 2019 BEA Workshop Shared Task,2019:32-43.
[3]	VASWANI A,SHAZEER N,PARMAR N.Attention Is All You Need[J]. Computer Science,Mathematics,2017(5):1-12.
[4]	HONG Y,YU X,HE N.Faspel:a fast,adaptable,simple, powerful Chinese spell checker based on dae-decoder paradigm[C].Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019).Hong Kong,China:Association for Computational Linguistics,2019:160-169.

施引文献

资源附件(0)

访问统计

点击查看大图

计量

文章访问数: 82
HTML全文浏览量: 2
PDF下载量: 3
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

服务于拼写检查的伪语料生成方法实现

doi: 10.19695/j.cnki.cn12-1369.2021.01.55

作者简介:
胡睿,男,北京人,本科,研究方向:自然语言处理、知识图谱。

计量

An Implementation of Pseudo Corpus Generation for Chinese Spelling Checking

计量

目录

期刊在线

友情链接

联系我们

留言板

服务于拼写检查的伪语料生成方法实现

doi: 10.19695/j.cnki.cn12-1369.2021.01.55

作者简介: 胡睿,男,北京人,本科,研究方向:自然语言处理、知识图谱。

计量

出版历程

An Implementation of Pseudo Corpus Generation for Chinese Spelling Checking

计量

出版历程

目录

期刊在线

友情链接

联系我们

作者简介:
胡睿,男,北京人,本科,研究方向:自然语言处理、知识图谱。