Spark框架结合分布式KNN分类器的网络大数据分类处理方法 postprint

Author: 曹瑜 ¹ 王楠 ^2,3 徐志超 ²
Institute:

1. 哈尔滨金融学院计算机系

2. 吉林财经大学管信学院

3. 3吉林大学计算机学院
Submit Time:2018-08-13 09:26:13

Abstract: Aiming at the limitation that the existing big data classification methods can not meet the time and storage space in big data applications, a design method of big data parallel multi-label k-nearest neighbor classifier based on Apache Spark framework is proposed. In order to reduce the cost of the existing MapReduce scheme by using other memory operations, first, the training set is divided into several partitions in conjunction with the parallel mechanism of the Apache Spark framework. Then in the Map stage, the K nearest neighbors of each partition of the sample to be predicted are found, and in the Reduce phase, the final K nearest neighbors are determined according to the results of the Map phase. Finally, the neighboring tag sets are aggregated in parallel, and the target tag set of the sample to be predicted is output by maximizing the posterior probability. Experiments were conducted on PokerHand et al. 's four big data classification datasets. The proposed method achieved a lower Hamming loss and proved its effectiveness.

分类处理 Apache Spark 并行机制数据挖掘汉明损失 K最近邻

Subject: Computer Science >> Integration Theory of Computer Science

Journal:

计算机应用研究

Contribution： Published
Cite as: ChinaXiv:201808.00091 (or this version ChinaXiv:201808.00091V1)
DOI:10.12074/201808.00091V1
CSTR:32003.36.ChinaXiv.201808.00091.V1
TXID： 3a377c5a-3298-41b7-83d5-8a62f69f5869
Recommended references： 曹瑜,王楠,徐志超.Spark框架结合分布式KNN分类器的网络大数据分类处理方法.计算机应用研究:https://chinaxiv.org/abs/201808.00091.[ChinaXiv:201808.00091V1] (Click&Copy)

Version History

[V1]

2018-08-13 09:26:13

ChinaXiv:201808.00091V1

Download

Related Paper

1. AI4Games：基于强化学习的演化博弈策略挖掘	2025-08-18
2. 机器学习的信息科学原理：基于形式化信息映射的因果链元框架	2025-08-15
3. 中枢智药：基于多智能体的药物设计与递送全流程系统的设计	2025-08-15
4. 基于长序列时序嵌入的水电交互大模型快速检索	2025-08-14
5. Generative AI for Brain-Computer Interfaces Decoding: A Systematic Review	2025-08-14
6. 藏语拉萨话韵律词库——基于语音合成的实验研究	2025-08-07
7. 基于证据积累的认知决策神经网络模型	2025-07-23
8. 矩阵论——以数据挖掘与机器学习为例	2025-07-19
9. 信息论安全的可信验证算法	2025-07-17
10. 关于命名实体识别领域的综述报告	2025-07-16


Public comments Anonymous comments Send only to author