Current Location: > Detailed Browse

Construction, Performance and Application of New Era People's Daily Segmented Corpus (I)——Construction and Evaluation of Corpus postprint

请选择邀稿期刊:
Abstract: [Purpose/significance] The construction of the segmented corpus of People's Daily in line with the new era provides new annotated corpus for Chinese information processing, and also offers new language resources for analyzing modern Chinese from a diachronic perspective.[Method/process] The data source, annotation specification and process of the constructed corpus were explained on the basis of analyzing the existing Chinese word segmentation corpus, on the other hand, the corpus performance was evaluated by constructing the automatic word segmentation model by comparing with the existing corpus.[Result/conclusion] The New Era People's Daily Segmented Corpus(NEPD) with a large scale and a long time span follows the basic processing standards of modern Chinese corpus. The part of January 2018 is selected from NEPD to build a segmentation model based on conditional random field model. The performance of the corpus of People's Daily in January 2018 is evaluated and compared with that of the corpus of People's Daily in January 1998. The specific evaluation indexes obtained from the corpus show that the overall performance of the corpus of People's Daily in the new era is relatively outstanding. The corpus of 1998 could not be replaced, but it is very necessary to construct the NEPD.

Version History

[V1] 2023-07-26 17:47:02 ChinaXiv:202307.00327V1 Download
Download
Preview
License Information
metrics index
  •  Hits2593
  •  Downloads1668
Comment
Share