Your conditions: 杜志钢
  • 基于BERT 和深度主动学习的农业新闻文本分类方法

    Subjects: Other Disciplines >> Synthetic discipline submitted time 2023-03-31 Cooperative journals: 《农业图书情报学报》

    Abstract: [Purpose/Significance] At present, most of the training models used in the research of news classification are non-active learning. There are common problems about these models, including data cannot be labeled immediately and the labeling cost is too high, which also hinders the analysis of agricultural news. Especially because of the explosive growth of news data in the network era, it is more difficult to label data, train supervised text classification models, and screen relevant news in the field of agriculture from diversified online news sources. In order to solve this problem, the most commonly used pool based active learning or deep active learning technique is used to select more valuable and representative data from unlabeled data for manual labeling, and construct labeled data sets to improve the efficiency and effect of news classification and agricultural news mining. [Method/Process] The commonly used machine learning models for text classification, such as random forest classifier, polynomial naive Bayes classifier and logistic regression classifier, were combined with the active learning method with the lowest confidence to analyze the effect, and the BERT model was combined with the three sampling strategies of discriminative active learning, deep Bayes active learning and lowest confidence for deep active learning training. On the news corpus of 19 847 samples crawled and cleaned by crawler technology from Sina and other news websites, aiming at screening agricultural related news from diversified news samples of various topics, the iterative experiment of adding 30 samples per round was tested to check the improvement effect of F1 score under various method combinations with the increase of the number of annotation. In addition, the representativeness and diversity of the samples selected by the sampling function of each method in the deep active learning method of the BERT model were compared, so as to understand the characteristics of each strategy and provide inspiration for the selection and improvement of Al strategy in the future. In addition, this paper also analyzed how much labeling cost can be saved by using the proposed method. [Results/Conclusions] When comparing a variety of machine learning models, it is found that although the gradient boosting tree and support vector machine classifier have high accuracy, they are not suitable for active learning because of their low efficiency in text data processing of large-scale high-dimensional data. After combining other machine learning models and the BERT model and training text models with the corresponding active learning or deep active learning methods, it is found that the application of active learning method can significantly improve the training process of each model. Among them, the BERT model, combined with discriminative active learning sampling function, has the best news text classification effect and the lowest annotation data requirements. The representativeness and diversity of the samples selected by discriminative active learning sampling function are also the highest, which explains the source of the advantages of this method. It can also be found that for the same task model, the higher the accuracy of classification is required, and the active learning method can save more annotation cost than non-active learning.