Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] To explore a humanistic knowledge base construction method based on word and entity retrieval and knowledge mining. [Method/process] This paper constructed the Zhou Qin Han Annals of the Zizhitongjian, achieved the automatic segmentation and part-of-speech tagging of the 68-volume 600,000-character text, manually annotated entity information such as persons, locations, GIS and time in the text, and designed the system of full-text retrieval and map visualization based on words and entities. This paper used co-occurrence information to get the relationship and travel information of the characters. By TF-IDF and time series analysis, the key periods, people and locations in history were automatically extracted and illustrated. [Result/conclusion] Depth information labeling based on words and entities is a good solution to the problems of word boundaries, same name with different person and different name with same person, and it can solid the basis for multi-studies on the knowledge mining and knowledge service of ancient books.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] The classics are the carrier of Chinese traditional culture, thought and wisdom. Combining the methods of data acquisition, labeling and analysis of digital humanities, it is of great significance for the automatic entity recognition of classics for subsequent application research. [Method/process] The corpus was constructed based on 25 pre-Qin literature that have been automatically segmented and manually annotated, based on the corpus of different sizes and seven deep learning models of Bi-LSTM, Bi-LSTM-Attention, Bi-LSTM-CRF, Bi-LSTM-CRF-Attention, Bi-RNN, Bi-RNN-CRF and BERT, we extracted the corresponding entities that constituted historical events and compared their effects.[Result/conclusion] The accuracy of the Bi-LSTM-Attention and Bi-RNN-CRF models trained on all corpus reached 89.79% and 89.33%, respectively, confirming the feasibility of applying deep learning to large-scale text datasets.
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】验证中古时期分词一致性和语料类别对CRFs 分词效率的影响, 在此基础上进一步提高分词效率, 降低人工校对的工作量。【方法】以中古时期的史书、佛经、小说类语料为例, 针对中古汉语的自动分词问题, 优化分词原则, 运用CRFs 模型和词典相结合的方法, 消除中古汉语人工分词结果中易出现的分词不一致问题; 同时在CRFs 分词中引入字符分类、字典信息两种特征, 并通过对比实验选取每种特征最合适的分词模板。【结果】实验结果显示, 分词结果的总F 值在封闭测试中达到99%以上, 开放测试的综合测试中也达到89%-95%。【局限】分词不一致研究主要针对双字词, 因此三字以上词语(多字词)的识别效果稍有欠缺。【结论】在有效提高分词一致性的前提下, 字符分类、词典标记特征能够有效提高中古汉语CRFs 分词的精确度。同时本文提出的中古汉语分词系统可以服务于中古时期多类别的汉语语料。