Abstract:
Aiming at the limitation that the existing big data classification methods can not meet the time and storage space in big data applications, a design method of big data parallel multi-label k-nearest neighbor classifier based on Apache Spark framework is proposed. In order to reduce the cost of the existing MapReduce scheme by using other memory operations, first, the training set is divided into several partitions in conjunction with the parallel mechanism of the Apache Spark framework. Then in the Map stage, the K nearest neighbors of each partition of the sample to be predicted are found, and in the Reduce phase, the final K nearest neighbors are determined according to the results of the Map phase. Finally, the neighboring tag sets are aggregated in parallel, and the target tag set of the sample to be predicted is output by maximizing the posterior probability. Experiments were conducted on PokerHand et al. 's four big data classification datasets. The proposed method achieved a lower Hamming loss and proved its effectiveness.