Challenges and Thoughts in Making Text Ground Truth for Republican Chinese Newspaper: Taking Jing Bao as an Example
摘要: 欧洲和北美众多研究学者已对机器学习在光学字符识别中的应用进行了探索,许多项目也正在为此创建基准真值(ground truth, GT)数据。但对于非拉丁文本(non-Latin script)阅读材料来说,情况则有所不同。德国海德堡大学的“中国早期报刊在线数据库”(ECPO)项目于2021年开始研究如何基于中国报刊史料生成机器可读文本。ECPO采用多种机器学习方法(如卷积神经网络)开发了一个半自动流程来生成机器可读的全文文本,并选取民国时期娱乐小报《晶报》(1919—1940年)作为实验基础。文章聚焦于两方面:一是对基准真值编辑工作流程作详细阐述,包括组建编辑团队、组织工作流程、建立操作规范和确保质量控制;二是探讨制作基准真值时遇到的具体困难,包括字符编码问题、与Unicode相关的异体字符问题等。该研究项目创建了两个基准真值数据集,分别是文本型/结构化数据(全文基准真值,full-text GT)和版面分割数据(几何基准真值,geometry GT)。此外,文章还指出研究项目发现的问题及应对方案,期望提高机器学习效率,并为其他从事非拉丁文阅读材料研究的同仁提供借鉴。
Abstract: Many researchers have explored the use of machine learning for optical character recognition (OCR), particularly in Europe and North America, and many projects are producing ground truth (GT) data for this purpose. It is different when it comes to non-Latin Our paper focuses on two main aspects: First, we provide a description of our ground truth editing work. It includes assembling the editing team, organizing the workflows, establishing processing regulations, and ensuring quality control. Secondly, we discuss particular challenges in producing the GT sets, including issues in character encoding and problems with variant characters related to Unicode. We produced two sets of ground truth data comprising textual/structural data (full-text GT) and segmentation data (geometry GT). We hope our experiences from the project can be helpful to others working with NLS material. Based on our work, we point out some pitfalls and provide hints to avoid them in order to make machine learning more efficient.
[V1] | 2024-04-22 12:40:40 | PSSXiv:202405.00478V1 | 下载全文 |
1. 基于古今字音和语源的汉字输入和转换码 | 2024-11-25 |
2. 在文学实践中深化理解:从文学批评到文学阐释 | 2024-11-21 |
3. 余华中短篇小说叙述姿态与外国作家、中国视域 ——总体认知叙事学分析断想* | 2024-11-19 |
4. 中国现代汉语语文辞书编纂研究(1908-1948) | 2024-11-15 |
5. 敛情约性,因狭出奇——论徐照山水诗 | 2024-11-15 |