Subjects: Library Science,Information Science >> Information Science submitted time 2024-05-08
Abstract: Purpose/Significance This paper starts from the field of humanities and social sciences, and compares the model performance of humanities and social sciences from the aspects of basic knowledge and academic texts of humanities and social sciences. It aims to provide a systematic large language model evaluation benchmark for the field of humanities and social sciences for the reference of researchers in humanities and social sciences related fields. Methods/Processes Seven evaluation tasks related to the field of humanities and social sciences were designed and corresponding indicators were selected. On this basis, the current open-source and high-performance general-purpose domain Chinese large language models were selected to complete the domain-specific tasks in the form of questions and answers by invoking the local models, and their performance in the field of humanities and social sciences was quantitatively evaluated by selecting relevant indicators. Results/Conclusions The evaluation results show that among the open-source models selected in this paper, Qwen has the best performance, followed by Baichuan2, InternLM, and Atom is the worst performer in both the base model and the dialog model; moreover, in most cases, the dialog model shows more superior performance compared to the base model.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-10-08 Cooperative journals: 《知识管理论坛》
Abstract: [Purpose/significance] This paper conducts a study for the mainstream news media for People’s Daily Online corpus, aiming to provide ideas and practical support for the study of automatic text summarization, which can then be applied to news and other related text information processing, and contribute to knowledge aggregation services and information access research. [Method/process] The experimental corpus of this research was the sub-corpus of the People’s Daily Online in January 2015, June 2015 and January 2016 in the new era People’s Daily (NEPD). Based on TF-IDF, Textrank and other extractive automatic summarization algorithms, based on the generative automatic abstractive summarization model for the pointer-generator network, the research was carried out and analyzed and evaluated the summarization results. [Result/conclusion] The experiment builds a news extraction automatic abstractive algorithm the Pointer-Generator Networks model for the People’s Daily corpus, and constructs a network model of news generative automatic summary pointer generation for People’s Daily Online corpus. Fruitful experimental results are evaluated by Rouge indicator (including 3 indicators: Rouge-1, Rouge-2 and Rouge-L). This article provides corpus support and practical support for the automatic news summarization system.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-08-27 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] Data science is emerging as a new interdisciplinary field which combines many fields. Extracting the corresponding entities knowledge from the announcement information of data science recruitment can not only help to understand the development of data science from a market perspective, but also help to improve the content of data science teaching.[Method/process] Based on the recruitment announcement from the recruitment website, combining with information science data collection, annotation and organization methods, data science corpus was constructed and the corresponding entities from it were extracted.[Result/conclusion] In the existing 11000 annotated data science corpus scale recruitment announcement, based on the Bi-LSTM-CRF, CRF and Bi-LSTM models, this paper compared the extraction performance of data science recruiting entities and finally determined the final data science recruitment entities automatic extraction model, designed the data science recruitment entities automatic extraction platform, and built a data science recruitment entities network.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-08-26 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] Abstract can explain concisely the research purposes, research methods and the final part of the statement, which is of high exploration value and significance.[Method/process] In this paper, four short-term memory networks (long short-term memory, support vector machine, LSTM-CRF and CNN-CRF) were selected to summarize the journal articles of 3672 CNKI databases.[Result/conclusion] The long-term memory network model identifies the highest F value of 69.15%, the maximum F value of LSTM-CRF neural network model is 88.76%, and the highest F value of RNN-CRF model is 89.10%. The highest support vector machine classifier classification macro F value is 72.04%. The experimental results have a high reference value for the selection of the experimental model of the functional structure of academic dissertation in the field of library and information science.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-07-26 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] The statistics and analysis of sentence length in different dimensions and vocabulary distribution based on the New Era People's Daily(NEPD) word segmentation corpus is not only conducive to a relatively comprehensively and systematically understanding of the linguistic characteristics of the contemporary Chinese text, but also beneficial to the subsequent exploration of natural language processing and text mining of the text.[Method/process] Based on the word segmentation data of People's Daily in January 2018 and the word segmentation data of People's Daily in January 1998, 6 sentence categories used in the statistics were determined, and the sentence length distribution of character and word units was counted and analyzed, and the distribution of words in static state was revealed based on Zipf's law.[Result/conclusion] From the perspective of the sentence length distribution in the word dimension and the Zipf distribution of vocabulary, the sentence length and vocabulary distribution have both changed in the 1998 and 2018 corpora as time goes by, but this change is continuous and related.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-07-26 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] On the basis of the new era People's Daily(NEPD) word segmentation corpus, the construction of the automatic word segmentation model of deep learning not only can help to provide relevant experience for the construction of high-performance word segmentation model, but also can verify the performance of the corresponding model of deep learning through specific natural language processing tasks.[Method/process] Based on the introduction of Bi-directional Long Short-Term Memory (Bi-LSTM) and Bi-directional Long Short-Term Memory with conditional random field (Bi-LSTM-CRF), this paper expounded the process, type and situation of Chinese word segmentation preprocessing, the evaluation indexes and parameters and hardware platform, the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation models were constructed respectively, and the overall performance of the models was analyzed.[Result/conclusion] The overall performance of the Bi-LSTM and Bi-LSTM-CRF Chinese automatic word segmentation model is relatively reasonable from the three indexes of precision, recall and F value. In terms of specific performance, Bi-LSTM word segmentation model is superior to Bi-LSTM-CRF word segmentation model, but the difference is very small.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-07-26 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] The construction of the segmented corpus of People's Daily in line with the new era provides new annotated corpus for Chinese information processing, and also offers new language resources for analyzing modern Chinese from a diachronic perspective.[Method/process] The data source, annotation specification and process of the constructed corpus were explained on the basis of analyzing the existing Chinese word segmentation corpus, on the other hand, the corpus performance was evaluated by constructing the automatic word segmentation model by comparing with the existing corpus.[Result/conclusion] The New Era People's Daily Segmented Corpus(NEPD) with a large scale and a long time span follows the basic processing standards of modern Chinese corpus. The part of January 2018 is selected from NEPD to build a segmentation model based on conditional random field model. The performance of the corpus of People's Daily in January 2018 is evaluated and compared with that of the corpus of People's Daily in January 1998. The specific evaluation indexes obtained from the corpus show that the overall performance of the corpus of People's Daily in the new era is relatively outstanding. The corpus of 1998 could not be replaced, but it is very necessary to construct the NEPD.
Subjects: Library Science,Information Science >> Library Science submitted time 2023-04-13
Abstract: Purpose/significance The knowledge mining of plants in pre-Qin classics and the construction of pre-Qin plant knowledge map are of great significance for understanding the society and living conditions of ancient Chinese people. Method/process This paper makes a detailed labeling and quantitative analysis of plant words in pre-Qin classics. Based on CRF and a variety of deep learning models, a plant named entity recognition model for pre-Qin classics was constructed, and the performance of each model was compared and analyzed to determine the optimal model. A knowledge map-oriented knowledge organization model of classics and plants was designed. Result/conclusion The plant entity recognition model based on the domain pre-trained language model SikuRoBERTa has the best performance, and the harmonic average reaches 85.44%, which provides an effective method for entity-based plant knowledge mining. Aggregation and visualization of plant knowledge in pre-Qin classics.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] The study of digital humanities in ancient Chinese classics shows a promising future based on the digitization and intelligent processing of ancient classics, because the methods of quantitative analysis provides new perspectives. [Method/process] The study is based on the data of Spring and Autumn Annals and the Three Commentaries. With the annotation of knowledge on women in the books, the study provides quantitative analysis based on names, countries and other important knowledges about ancient pre-Qin Chinese women. This study also conveys the marriages between countries based on the data annotated before. The activeness in marriages representing the importance in diplomacy is deeply measured. [Result/conclusion] The study gives a new interpretation of the female characters in the books, proposes a measurable and visual research method which provides reliable data verifications for relevant researches. The methods in this study will provide reliable data for related traditional studied.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] To explore a humanistic knowledge base construction method based on word and entity retrieval and knowledge mining. [Method/process] This paper constructed the Zhou Qin Han Annals of the Zizhitongjian, achieved the automatic segmentation and part-of-speech tagging of the 68-volume 600,000-character text, manually annotated entity information such as persons, locations, GIS and time in the text, and designed the system of full-text retrieval and map visualization based on words and entities. This paper used co-occurrence information to get the relationship and travel information of the characters. By TF-IDF and time series analysis, the key periods, people and locations in history were automatically extracted and illustrated. [Result/conclusion] Depth information labeling based on words and entities is a good solution to the problems of word boundaries, same name with different person and different name with same person, and it can solid the basis for multi-studies on the knowledge mining and knowledge service of ancient books.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] There is a long history of crop cultivation in China. It is of great significance to analyze the time distribution and development evolution of ancient crops for optimizing the modern agricultural planting structure. [Method/process] This paper put forward a set of analytical process of crop time distribution and evolution characteristics, which included four parts: corpus acquisition and digitization, segmentation and entity relationship extraction, time distribution characteristics analysis and evolution characteristics analysis, and selected Shihuozhi from 15 historical books for empirical analysis. [Result/conclusion] Based on the analysis results of Shihuozhi, the feasibility and effectiveness of the method are verified by the relevant historical, economic, philological and other multidisciplinary research data, which can provide reference for the analysis of the time distribution and evolution characteristics of ancient crops based on classics. But in the future, we need to improve the level of automation, expand the research sample, refine the event type and other aspects to further optimize the method process.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] Variations are a common phenomenon and also an important research object in ancient books. The traditional collation of ancient books is to manually search for materials, including variations from a large number of ancient books. This work is not only time-consuming, laborious, and heavy, but the data may not be accurate and comprehensive. Automatic mining of variant sentences through computers can obtain effective information from larger-scale corpus. In addition, the collation method combined with automatic mining of variant sentences can realize exhaustive retrieval, which is of great significance to the collation of ancient books. It provides new ideas and methods for the collation research of ancient books in the new period.[Method/process] This research automatically mined the variant sentences in Three Biographies of the Spring and Autumn Period, combining deep learning and introducing parallel corpus commonly used in the field of machine translation. Subsequently, this study compared LSTM and BERT models'results with the classic SVM model and further explored and analyzed the related content of the variants expressing the same event with different descriptions in two ancient books.[Result/conclusion] The experiment obtained a deep learning model for automatic mining of variants expressing the same event suitable for Three Biographies of the Spring and Autumn Period. It proves the feasibility of integrating new technologies such as deep learning into the construction of ancient books' knowledge base. Meanwhile, the combination of deep learning and parallel corpus can play a more significant role in studying variant sentences and provide practical support for applying digital humanities in the Chinese language and literature.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/significance] The classics are the carrier of Chinese traditional culture, thought and wisdom. Combining the methods of data acquisition, labeling and analysis of digital humanities, it is of great significance for the automatic entity recognition of classics for subsequent application research. [Method/process] The corpus was constructed based on 25 pre-Qin literature that have been automatically segmented and manually annotated, based on the corpus of different sizes and seven deep learning models of Bi-LSTM, Bi-LSTM-Attention, Bi-LSTM-CRF, Bi-LSTM-CRF-Attention, Bi-RNN, Bi-RNN-CRF and BERT, we extracted the corresponding entities that constituted historical events and compared their effects.[Result/conclusion] The accuracy of the Bi-LSTM-Attention and Bi-RNN-CRF models trained on all corpus reached 89.79% and 89.33%, respectively, confirming the feasibility of applying deep learning to large-scale text datasets.
Subjects: Library Science,Information Science >> Information Science submitted time 2023-04-01 Cooperative journals: 《图书情报工作》
Abstract: [Purpose/Significance] Relation extraction under academic full-text is the key technology for the construction of academic full-text knowledge graph. The constructed academic knowledge graph can realize the structure and knowledge of documents, and improve the efficiency of researchers retrieving documents, analyzing documents and grasping scientific research trends, and cognitive reasoning through graphs contributes to implicit knowledge discovery.[Method/Process] Enhancing relation extraction through external knowledge has achieved results in many studies, but relation extraction for specific fields often lacked available external knowledge. The research in this paper found that the high-confidence knowledge in the full-text could also be used to assist the extraction of full-text relations. For this reason, based on the dual-system theory of cognitive processes (system 1 is intuitive cognition, system 2 is reasoning cognition), this paper designed a sentence-level model to acquire knowledge, and obtained high-confidence knowledge through remote supervision, and then high-confidence knowledge was integrated into the final classification layer of the text-level deep learning model.[Result/Conclusion] On the biomedical academic full-text data set (CDR-revised), the F1 is about 11.13% higher than the current state-of-the-art model.
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】从大规模食品安全事件当中抽取食品安全事件实体。【方法】基于已发生的食品安全事件, 结合情报学数据获取、标注和组织的方法, 融合食品安全事件实体的多种分布特征知识, 通过条件随机场模型, 构建食品安全事件语料并从中抽取相应的实体。【局限】在食品安全事件实体抽取过程中所制定的特征模板在领域化迁移上具有一定的局限性。【结果】在已有1500万字经过标注的食品安全事件语料的规模上, 通过统计食品安全事件实体的内部和外部特征, 基于条件随机场机器学习模型, 构建了食品安全实体的抽取模型, 该模型最高的F 值达到91.94%。【结论】通过对食品安全事件实体抽取结果的分析, 在食品这一领域化的语料上, 基于条件随机场进行实体抽取是可行的。
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】在食品安全领域中, 建立相关数据库对食品安全的监管和控制都会有很大的帮助, 自动分词在构建索引、使用索引以及构建语料库中都起到至关重要的作用。将基于条件随机场的字标注统计学习方法, 应用在食品安全突发事件语料的自动分词中。【方法】分析语料的词长分布等特点, 对该方法自动分词过程中所涉及的特征选择和特征模板进行不同实验, 得出不同特征选择和应用不同特征模板对分词结果的影响。【结果】从实验结果可以看出, 特征选择时并不是特征越多分词效果越好, 会出现特征干扰的情况, 在二三字词占46.62%的食品安全突发事件语料中, 特征模板中的当前字和前后驱第一个字所代表的特征模板对分词效果影响明显。【结论】通过对不同特征选择和特征模板及其相互组合的实验, 选择出在本文研究的语料库自动分词中最优的特征和特征模板, 在5Tag 特征标记下配合对应特征模板对目标语料分词的F 值达到92.88%。
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】在总结当前引文元数据抽取方法的基础上, 结合语义学知识和机器学习方法, 对引文元数据的自动抽取方法进行探索。【方法】实验中采用神经网络模型对人工分割过的语料进行词向量训练。利用相同类型的元数据会相对集中地出现在向量空间中某一位置的现象, 通过支持向量机分类算法实现对元数据的自动归类和标注。【结果】在以外文引文数据作为测试集的实验中, 本文方法取得了较高的准确率和召回率, 特别是针对引文中含有多种语言和缩写的现象, 具有较好的处理能力。【局限】在对于引文元数据时间内容的细粒度抽取中存在一定的局限性。【结论】实验结果表明, 此方法在引文元数据的自动发现和标注上具有良好的效果, 并能很大程度地提高方法的适用性和容错率。
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】在食品安全领域中, 建立相关数据库对食品安全的监管和控制都会有很大的帮助, 自动分词在构建索引、使用索引以及构建语料库中都起到至关重要的作用。将基于条件随机场的字标注统计学习方法, 应用在食品安全突发事件语料的自动分词中。【方法】分析语料的词长分布等特点, 对该方法自动分词过程中所涉及的特征选择和特征模板进行不同实验, 得出不同特征选择和应用不同特征模板对分词结果的影响。【结果】从实验结果可以看出, 特征选择时并不是特征越多分词效果越好, 会出现特征干扰的情况, 在二三字词占46.62%的食品安全突发事件语料中, 特征模板中的当前字和前后驱第一个字所代表的特征模板对分词效果影响明显。【结论】通过对不同特征选择和特征模板及其相互组合的实验, 选择出在本文研究的语料库自动分词中最优的特征和特征模板, 在5Tag 特征标记下配合对应特征模板对目标语料分词的F 值达到92.88%。
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】在总结当前引文元数据抽取方法的基础上, 结合语义学知识和机器学习方法, 对引文元数据的自动抽取方法进行探索。【方法】实验中采用神经网络模型对人工分割过的语料进行词向量训练。利用相同类型的元数据会相对集中地出现在向量空间中某一位置的现象, 通过支持向量机分类算法实现对元数据的自动归类和标注。【结果】在以外文引文数据作为测试集的实验中, 本文方法取得了较高的准确率和召回率, 特别是针对引文中含有多种语言和缩写的现象, 具有较好的处理能力。【局限】在对于引文元数据时间内容的细粒度抽取中存在一定的局限性。【结论】实验结果表明, 此方法在引文元数据的自动发现和标注上具有良好的效果, 并能很大程度地提高方法的适用性和容错率。
Subjects: Library Science,Information Science >> Information Science submitted time 2017-11-08 Cooperative journals: 《数据分析与知识发现》
Abstract:【目的】中文机构名结构复杂、罕见词多, 识别难度大, 对其进行正确识别对于信息抽取、信息检索、知识挖掘和机构科研评价等情报学中的后续任务意义重大。【方法】基于深度学习的循环神经网络(Recurrent Neural Network, RNN)方法, 面向中文汉字和词的特点, 重新定义了机构名标注的输入和输出, 提出汉字级别的循环网络标注模型。【结果】以词级别的循环神经网络方法为基准, 本文提出的字级别模型在中文机构名识别的准确率、召回率和F 值均有明显提高, 其中F 值提高了1.54%。在包含罕见词时提高更为明显, F 值提高了11.05%。【局限】在解码时直接使用了贪心策略, 易于陷入局部最优, 如果使用条件随机场算法进行建模可能获取全局最优结果。【结论】本文方法构架简单, 能利用到汉字级别的特征来进行建模, 比只使用词特征取得了更好的结果。