Current Location: > Detailed Browse

Spark框架结合分布式KNN分类器的网络大数据分类处理方法 postprint

请选择邀稿期刊:
Abstract: Aiming at the limitation that the existing big data classification methods can not meet the time and storage space in big data applications, a design method of big data parallel multi-label k-nearest neighbor classifier based on Apache Spark framework is proposed. In order to reduce the cost of the existing MapReduce scheme by using other memory operations, first, the training set is divided into several partitions in conjunction with the parallel mechanism of the Apache Spark framework. Then in the Map stage, the K nearest neighbors of each partition of the sample to be predicted are found, and in the Reduce phase, the final K nearest neighbors are determined according to the results of the Map phase. Finally, the neighboring tag sets are aggregated in parallel, and the target tag set of the sample to be predicted is output by maximizing the posterior probability. Experiments were conducted on PokerHand et al. 's four big data classification datasets. The proposed method achieved a lower Hamming loss and proved its effectiveness.

Version History

[V1] 2018-08-13 09:26:13 ChinaXiv:201808.00091V1 Download
Download
Preview
License Information
metrics index
  •  Hits2540
  •  Downloads1414
Comment
Share