ChinaXiv.org 中国科学院科技论文预发布平台

Submitted Date

Subjects

Integration Theory of Computer Science
9

Authors

Institution

result total 9.

Hide Summary

Hits

Date

Downloads

Your conditions: 新疆大学信息科学与工程学院

1. ChinaXiv:202205.00092
Download

基于随机投影与集成学习的离群点检测算法

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2022-05-10 Cooperative journals: 《计算机应用研究》

郭一阳于炯杜旭升曹铭

Abstract： To address the problem that traditional similarity-based outlier detection algorithms were not effective enough on high-dimensional unbalanced datasets, this paper proposed a novel Ensemble learning and Random projection-based Outlier Detection (EROD) framework. Firstly, the EROD algorithm integrated several random projection methods to reduce the dimensionality of high-dimensional data, which improved the data diversity. Secondly, it integrated several different traditional outlier detectors to build a heterogeneous ensemble model, which increased the robustness of the algorithm. Finally, the EROD acquired the final outlier value of the object by using the heterogeneous ensemble model to train the reduced-dimensional data and by using two optimal combinations of the trained model to reduce the total error, and the algorithm determined the object with high outlier value as outlier point. The results showed that the algorithm had an average improvement of 3.6% and 14.45% in AUC and Precision@n value compared with the traditional outlier detection algorithm and the outlier detection algorithm based on ensemble learning. Therefore, the EROD algorithm has the advantage of handling the anomalies of high-dimensional unbalanced data.

Hits 2341 Downloads 414 Comment 0
2. ChinaXiv:201904.00061
Download

基于多特征和深度神经网络的维吾尔文情感分类

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-04-01 Cooperative journals: 《计算机应用研究》

买买提阿依甫吾守尔·斯拉木艾斯卡尔·艾木都拉杨文忠帕丽旦·木合塔尔

Abstract： In order to solve the problem of long-distance dependence in traditional machine learning sentiment classification method and the disadvantage of ignoring the emotional lexicon in deep learning, this paper proposes a Uyghur sentiment classification method based on attention mechanism combined with bidirectional long-short term memory network and convolutional neural network model. The concatenated multi-feature vector is used as the input of the bidirectional long short-term memory network to capture the context information, the attention mechanism and convolution network are used to capture text hidden emotional feature information, which effectively enhances the capture ability of the text sentiment semantics. The experimental results show that the F1 value of this method on two-category and five-category Uyghur sentiment data sets higher than machine learning method 5.59%, 7.73%, respectively.

Hits 1596 Downloads 840 Comment 0
3. ChinaXiv:201904.00064
Download

多MapReduce作业协同下的大数据挖掘类算法资源效率优化

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-04-01 Cooperative journals: 《计算机应用研究》

廖彬张陶于炯黄静莱国冰磊刘炎

Abstract： Because any MapReduce job requires a series of complex operations such as task scheduling and resource allocation independently, there are a lot of redundant disk I/O and resource duplicate application operations among multiple MapReduce jobs coordinated by the same algorithm, causing inefficient resource utilization in job computing process. Big data mining algorithms are usually divided into several MapReduce Jobs, taking ItemBased algorithm as an example, this papere analyze the resource efficiency of mining algorithm with multi-MapReduce job collaboration scenario. It proposed an ItemBased algorithm based on DistributedCache, which used DistributedCache to cache I/O data between multiple MapReduce Jobs, breaks the defect of independence between jobs, and reduced the waiting delay between Map and Reduce tasks. The experimental results show that, DistributedCache can improve the data reading speed of MapReduce jobs. The algorithm reconstructed by DistributedCache greatly reduces the waiting delay between Map and Reduce tasks, and improves the resource efficiency by more than three times.

Hits 1549 Downloads 872 Comment 0
4. ChinaXiv:201901.00044
Download

基于句子跨度的哈萨克语句法分析研究

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2019-01-03 Cooperative journals: 《计算机应用研究》

柴伟古丽拉·阿东别克

Abstract： Due to the low accuracy of Kazakh parsing and the lack of correlation research based on neural network Kazakh parsing. This paper focused on the parsing of Kazakh phrase structure, based on the shift-reduce method, but by the stack elements were sentence spans rather than partial tree, then it didn’t need to carry out the binary tree in parsing. It also used the bi-directional LSTM to extract the features of sentence span, and obtained the sentence span in the whole sentence context, using the multilayer perceptron to train the parsing model. In the end, the Kazakh parsing accuracy has been achieved 76.92%. The research results have improved the accuracy of Kazakh parsing and build a good foundation for Kazakh machine translation and semantic analysis.

Hits 1239 Downloads 654 Comment 0
5. ChinaXiv:201812.00120
Download

一种基于性格的微博情感分析模型PLSTM

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-12-13 Cooperative journals: 《计算机应用研究》

袁婷婷杨文忠仲丽君张志豪向进勇

Abstract： Users of different personalities have different language expressions. Existing sentiment analysis work rarely considers the personality of the user. To solve this problem, this paper proposes a micro-blog sentiment analysis model based personality PLSTM. The model firstly uses the personality recognition rules to divide the microblog text into five personality sets and a universal set, then train a sentiment classifier for each personality set, and finally integrate six basic sentiment classifiers to obtain the ultimate sentiment polarity. The experimental results show that the F1 value of the PLSTM method can reach 96.95%, which indicates that PLSTM has a higher improvement in accuracy, recall rate and F1 value than the commonly used benchmark sentiment analysis model.

Hits 1554 Downloads 884 Comment 0
6. ChinaXiv:201810.00059
Download

维吾尔文情感分类特征建设研究

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-10-11 Cooperative journals: 《计算机应用研究》

热西旦木·吐尔洪太吾守尔·斯拉木

Abstract： Due to the lack of systematic research on the feature expression of Uyghur text sentiment classification, this paper uses the traditional n-gram features as the basis to extract new features and combined features from Uyghur sentiment corpora on different scales, classified the corpora as positive and negative with support vector machine (SVM) classifier. Results indicated that, in the Uyghur text sentiment classification, the unigram features in the basic features have the best classification efficiency. The combination of unigram features and phrase features can further improve the classification efficiency. The best performance of the combined features, the classification accuracy is 1.78% higher than that of unigram. This paper first to make a comprehensive evaluation of the classification performance of different features on a unified data set. The research results can be applied as a reference for future Uyghur sentiment classification research.

Hits 1368 Downloads 798 Comment 0
7. ChinaXiv:201808.00093
Download

基于类别信息和特征熵的文本特征权重计算

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-08-13 Cooperative journals: 《计算机应用研究》

阿力木江·艾沙殷晓雨库尔班·吾布力李喆

Abstract： Text vectorization is the basis of text classification. Feature weighting is one of the important factors that directly affect the quality of text vector representation. Feature weighting schemes based on category information is not accurate enough to express the relationship between features and categories. That is the classification ability of the features with the same category frequency can’t be compared, so the distribution of the features in the category should be considered. This paper combines the inverse category frequency (ICF) and inner category entropy of the features into the term weight calculation, and constructs two supervised feature weighting schemes. The experimental results on the Uygur text categorization dataset showed that this method can obviously improve the spatial distribution of the samples and improve the micro average F1 value of the Uygur text classification.

Hits 7954 Downloads 1026 Comment 0
8. ChinaXiv:201805.00368
Download

维吾尔文论坛中基于术语选择和Rocchio分类器的文本过滤方法

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-05-18 Cooperative journals: 《计算机应用研究》

如先姑力·阿布都热西提亚森·艾则孜艾山·吾买尔阿力木江·艾沙

Abstract： For the issues that the text filtering in Uyghur web forum, this paper proposed a text filtering method based on term selection and Rocchio classifier. Firstly, it preprocessed the forum text to remove useless words and extract stemming (term) based on the N-gram statistical model. Then, it proposed a balanced mutual information term selection method (BMITS) , which considered the correlation and redundancy of equilibrium, used to reduce the dimension of initial term set and obtain the reduced term set. Finally, it made the text feature terms as input, and used Rocchio classifier to filter out the bad text. The experimental results show that the proposed method can accurately identify the bad type text, which is effective.

Hits 2070 Downloads 1189 Comment 0
9. ChinaXiv:201804.02180
Download

基于分级匹配的维吾尔语文档相似性计算及剽窃检测方法

Subjects: Computer Science >> Integration Theory of Computer Science submitted time 2018-04-17 Cooperative journals: 《计算机应用研究》

亚森·艾则孜艾山·吾买尔阿力木江·艾沙

Abstract： For the issues of the similarity calculation and plagiarism detection from documents written in Uyghur, a content-based Uyghur plagiarism detection (U-PD) method is proposed. Firstly, the Uyghur texts are segmented, the stop words are deleted, the stems are extracted and synonyms are replaced through the preprocessing stage, of which extraction stems are based on N-gram statistical models. Then, calculate the hash value of each text block through the BKDRhash algorithm and construct the hash fingerprint information of the entire document. Finally, according to the hash fingerprint information, the document and document library are matched at the document level, the paragraph level and the sentence level based on the RKR-GST matching algorithm, and the similarity of the document is obtained, so as to realize plagiarism detection. The experimental evaluation in Uyghur documents shows that the proposed method can detect plagiarism documents accurately and is feasible and effective.

Hits 1958 Downloads 1119 Comment 0