Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2022-05-10 Cooperative journals: 《计算机应用研究》
Abstract: To address the problem that traditional similarity-based outlier detection algorithms were not effective enough on high-dimensional unbalanced datasets, this paper proposed a novel Ensemble learning and Random projection-based Outlier Detection (EROD) framework. Firstly, the EROD algorithm integrated several random projection methods to reduce the dimensionality of high-dimensional data, which improved the data diversity. Secondly, it integrated several different traditional outlier detectors to build a heterogeneous ensemble model, which increased the robustness of the algorithm. Finally, the EROD acquired the final outlier value of the object by using the heterogeneous ensemble model to train the reduced-dimensional data and by using two optimal combinations of the trained model to reduce the total error, and the algorithm determined the object with high outlier value as outlier point. The results showed that the algorithm had an average improvement of 3.6% and 14.45% in AUC and Precision@n value compared with the traditional outlier detection algorithm and the outlier detection algorithm based on ensemble learning. Therefore, the EROD algorithm has the advantage of handling the anomalies of high-dimensional unbalanced data.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-04-01 Cooperative journals: 《计算机应用研究》
Abstract: In order to solve the problem of long-distance dependence in traditional machine learning sentiment classification method and the disadvantage of ignoring the emotional lexicon in deep learning, this paper proposes a Uyghur sentiment classification method based on attention mechanism combined with bidirectional long-short term memory network and convolutional neural network model. The concatenated multi-feature vector is used as the input of the bidirectional long short-term memory network to capture the context information, the attention mechanism and convolution network are used to capture text hidden emotional feature information, which effectively enhances the capture ability of the text sentiment semantics. The experimental results show that the F1 value of this method on two-category and five-category Uyghur sentiment data sets higher than machine learning method 5.59%, 7.73%, respectively.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-04-01 Cooperative journals: 《计算机应用研究》
Abstract: Because any MapReduce job requires a series of complex operations such as task scheduling and resource allocation independently, there are a lot of redundant disk I/O and resource duplicate application operations among multiple MapReduce jobs coordinated by the same algorithm, causing inefficient resource utilization in job computing process. Big data mining algorithms are usually divided into several MapReduce Jobs, taking ItemBased algorithm as an example, this papere analyze the resource efficiency of mining algorithm with multi-MapReduce job collaboration scenario. It proposed an ItemBased algorithm based on DistributedCache, which used DistributedCache to cache I/O data between multiple MapReduce Jobs, breaks the defect of independence between jobs, and reduced the waiting delay between Map and Reduce tasks. The experimental results show that, DistributedCache can improve the data reading speed of MapReduce jobs. The algorithm reconstructed by DistributedCache greatly reduces the waiting delay between Map and Reduce tasks, and improves the resource efficiency by more than three times.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-01-03 Cooperative journals: 《计算机应用研究》
Abstract: Due to the low accuracy of Kazakh parsing and the lack of correlation research based on neural network Kazakh parsing. This paper focused on the parsing of Kazakh phrase structure, based on the shift-reduce method, but by the stack elements were sentence spans rather than partial tree, then it didn’t need to carry out the binary tree in parsing. It also used the bi-directional LSTM to extract the features of sentence span, and obtained the sentence span in the whole sentence context, using the multilayer perceptron to train the parsing model. In the end, the Kazakh parsing accuracy has been achieved 76.92%. The research results have improved the accuracy of Kazakh parsing and build a good foundation for Kazakh machine translation and semantic analysis.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-12-13 Cooperative journals: 《计算机应用研究》
Abstract: Users of different personalities have different language expressions. Existing sentiment analysis work rarely considers the personality of the user. To solve this problem, this paper proposes a micro-blog sentiment analysis model based personality PLSTM. The model firstly uses the personality recognition rules to divide the microblog text into five personality sets and a universal set, then train a sentiment classifier for each personality set, and finally integrate six basic sentiment classifiers to obtain the ultimate sentiment polarity. The experimental results show that the F1 value of the PLSTM method can reach 96.95%, which indicates that PLSTM has a higher improvement in accuracy, recall rate and F1 value than the commonly used benchmark sentiment analysis model.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-10-11 Cooperative journals: 《计算机应用研究》
Abstract: Due to the lack of systematic research on the feature expression of Uyghur text sentiment classification, this paper uses the traditional n-gram features as the basis to extract new features and combined features from Uyghur sentiment corpora on different scales, classified the corpora as positive and negative with support vector machine (SVM) classifier. Results indicated that, in the Uyghur text sentiment classification, the unigram features in the basic features have the best classification efficiency. The combination of unigram features and phrase features can further improve the classification efficiency. The best performance of the combined features, the classification accuracy is 1.78% higher than that of unigram. This paper first to make a comprehensive evaluation of the classification performance of different features on a unified data set. The research results can be applied as a reference for future Uyghur sentiment classification research.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-08-13 Cooperative journals: 《计算机应用研究》
Abstract: Text vectorization is the basis of text classification. Feature weighting is one of the important factors that directly affect the quality of text vector representation. Feature weighting schemes based on category information is not accurate enough to express the relationship between features and categories. That is the classification ability of the features with the same category frequency can’t be compared, so the distribution of the features in the category should be considered. This paper combines the inverse category frequency (ICF) and inner category entropy of the features into the term weight calculation, and constructs two supervised feature weighting schemes. The experimental results on the Uygur text categorization dataset showed that this method can obviously improve the spatial distribution of the samples and improve the micro average F1 value of the Uygur text classification.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-05-18 Cooperative journals: 《计算机应用研究》
Abstract: For the issues that the text filtering in Uyghur web forum, this paper proposed a text filtering method based on term selection and Rocchio classifier. Firstly, it preprocessed the forum text to remove useless words and extract stemming (term) based on the N-gram statistical model. Then, it proposed a balanced mutual information term selection method (BMITS) , which considered the correlation and redundancy of equilibrium, used to reduce the dimension of initial term set and obtain the reduced term set. Finally, it made the text feature terms as input, and used Rocchio classifier to filter out the bad text. The experimental results show that the proposed method can accurately identify the bad type text, which is effective.
Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-04-17 Cooperative journals: 《计算机应用研究》
Abstract: For the issues of the similarity calculation and plagiarism detection from documents written in Uyghur, a content-based Uyghur plagiarism detection (U-PD) method is proposed. Firstly, the Uyghur texts are segmented, the stop words are deleted, the stems are extracted and synonyms are replaced through the preprocessing stage, of which extraction stems are based on N-gram statistical models. Then, calculate the hash value of each text block through the BKDRhash algorithm and construct the hash fingerprint information of the entire document. Finally, according to the hash fingerprint information, the document and document library are matched at the document level, the paragraph level and the sentence level based on the RKR-GST matching algorithm, and the similarity of the document is obtained, so as to realize plagiarism detection. The experimental evaluation in Uyghur documents shows that the proposed method can detect plagiarism documents accurately and is feasible and effective.