计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 204-207.
胡赢双, 陆亿红
HU Ying-shuang, LU Yi-hong
摘要: 随着位置大数据的爆炸式增长,传统的串行算法已无法对其进行高效地聚类处理,因此,基于MapReduce框架的并行聚类算法研究逐渐成为热点。聚类算法并行化后的聚类质量通常难以保证,因此对并行化聚类结果进行归约的方法极为重要。首先提出基于网格的改进DBSCAN并行化聚类算法,通过该步骤得到每个数据子集的聚类结果。然后在分析网格与簇的关系,定义网格簇和网格簇的连通、强连通概念的基础上,通过计算网格簇之间的连通权值矩阵,对具有强连通关系的网格簇进行归约,构成基于MapReduce的强连通网格聚类算法。该算法可实现位置大数据集的高效聚类。实验分析表明,基于MapReduce的强连通网格聚类算法对位置大数据的处理具有较高的效率和聚类质量。
中图分类号:
[1]刘经南,方媛,郭迟,等.位置大数据的分析处理研究进展[J].武汉大学学报(信息科学版),2014,39(4):379-385. [2]YUAN J,ZHENG Y,XIE X,et al.T-Drive:Enhancing driving directions with taxi drivers’ intelligence[J].IEEE Trans.on Knowledge & Data Engineering,2013,25(1):220-232. [3]ZHENG Y,XIE X,MA W Y.GeoLife:A collaborative social networking service among user,location and trajectory[J].Bulletin of the TechnicalCommittee on Data Engineering,2010,33(2):32-39. [4]YUAN J,ZHENG Y,XIE X.Discovering regions of differentfunctions in a city using humanmobility and POIs[C]∥Know-ledge Discovery and Data Mining.ACM Press,2012:186-194. [5]郭迟,刘经南,方媛,等.位置大数据的价值提取与协同挖掘方法[J].软件学报,2014,25(4):713-730. [6]林乐轩.基于位置大数据的行人路径预测及人群密度预估系统研究[D].北京:北京邮电大学,2018. [7]TOBLER W,DEICHMANN U,GOTTSEGEN J,et al.World population in a grid of spherical quadrilaterals[J].International Journal of Population Geography,1997,3(3):203-225. [8]李斯凡.基于无监督学习技术的位置大数据分析[D].杭州:浙江理工大学,2017. [9]GUTTMAN A.R-trees:A dynamic index structure for spatial searching[C]∥International Conference on Management of Data.Boston:1984:47-57. [10]ZHAO Q,SHI Y,LIU Q,et al.A grid-growing clustering algorithm for geospatial Data[J].Pattern Recognition Letters,2014,53(53):77-84. [11]KUMAR K M,REDDY A R M.A fast DBSC-AN clustering algorithm by accelerating neighbor searching using Groups me-thod[J].Pattern Recognition,2016,58:39-48. [12]HE Y,TAN H,LUO W,et al.MR-DBSCAN:An efficient parallel density-based clustering algorithm using MapReduce[C]∥2011 IEEE 17th International Conference on Parallel and Distributed Systems.IEEE Computer Society,2011:473-480. [13]KIM Y,SHIM K,KIM M S,et al.DBCURE-MR:An efficient density-based clustering algorithm for large data using MapReduce[J].Information Systems,2014,42(2):15-35. [14]于彦伟,贾召飞,曹磊,等.面向位置大数据的快速密度聚类算法[J].软件学报,2018,29(8):2470-2484. [15]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise[C]∥International Conference on Knowledge Discovery & Data Mining.Portland:AAAI Press,1996:226-231. [16]黄德才.数据仓库与数据挖掘教程[M].北京:清华大学出版社,2016. [17]钱潮恺,黄德才.基于维度频率相异度和强连通融合的混合数据聚类算法[J].模式识别与人工智能,2016,29(1):82-89. [18]余长俊,张燃.云环境下基于Canopy聚类的FCM算法研究[J].计算机科学,2014,41(S1):316-319. [19]GIONIS A,MANNILA H,TSAPARAS P.Clustering aggregation[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):1-30. [20]ZAHN C T.Graph-theoretical methods for detecting and de-scribing gestalt clusters[J].IEEE Transactions on Computers,1971,100(1):68-86. [21]YUAN J,ZHENG Y,XIE X,et al.T-Drive:Enhancing driving directions with taxi drivers’ intelligence[J].IEEE Transactions on Knowledge & Data Engineering,2013,25(1):220-232. |
[1] | 程富豪, 徐泰华, 陈建军, 宋晶晶, 杨习贝. 基于顶点粒k步搜索和粗糙集的强连通分量挖掘算法 Strongly Connected Components Mining Algorithm Based on k-step Search of Vertex Granule and Rough Set Theory 计算机科学, 2022, 49(8): 97-107. https://doi.org/10.11896/jsjkx.210700202 |
[2] | 刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040 |
[3] | 叶跃进, 李芳, 陈德训, 郭恒, 陈鑫. 基于国产众核架构的非结构网格分区块重构预处理算法研究 Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture 计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045 |
[4] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[5] | 封雷, 朱登明, 李兆歆, 王兆其. 一种基于遮罩的稀疏点云滤波算法 Sparse Point Cloud Filtering Algorithm Based on Mask 计算机科学, 2022, 49(5): 25-32. https://doi.org/10.11896/jsjkx.210600129 |
[6] | 刘江, 刘文博, 张矩. OpenFoam中多面体网格生成的MPI+OpenMP混合并行方法 Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam 计算机科学, 2022, 49(3): 3-10. https://doi.org/10.11896/jsjkx.210700060 |
[7] | 张仁杰, 陈伟, 杭梦鑫, 吴礼发. 基于变分自编码器的不平衡样本异常流量检测 Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder 计算机科学, 2021, 48(7): 62-69. https://doi.org/10.11896/jsjkx.200600022 |
[8] | 张元鸣, 虞家睿, 蒋建波, 陆佳炜, 肖刚. 面向MapReduce的中间数据传输流水线优化机制 Intermediate Data Transmission Pipeline Optimization Mechanism for MapReduce Framework 计算机科学, 2021, 48(2): 41-46. https://doi.org/10.11896/jsjkx.191000103 |
[9] | 郭杰, 高希然, 陈莉, 傅游, 刘颖. 用数据驱动的编程模型并行多重网格应用 Parallelizing Multigrid Application Using Data-driven Programming Model 计算机科学, 2020, 47(8): 32-40. https://doi.org/10.11896/jsjkx.200500093 |
[10] | 程盛淦, 于浩然, 韦建文, 林新华. 基于定点压缩技术的双层粒子网格算法的设计与优化 Design and Optimization of Two-level Particle-mesh Algorithm Based on Fixed-point Compression 计算机科学, 2020, 47(8): 56-61. https://doi.org/10.11896/jsjkx.200200112 |
[11] | 任帅, 王萌, 范傲雄, 高泽, 徐解, Shahzad KHURRAM, 张弢. 一种零高分辨率3D网格模型的信息隐藏算法 Zero-high-resolution Information Hiding Algorithm for 3D Mesh Model 计算机科学, 2020, 47(7): 328-334. https://doi.org/10.11896/jsjkx.190800021 |
[12] | 罗晋楠, 张济民. 基于扩展Haar特征和DBSCAN的钢轨识别算法 Rail Area Extraction Using Extended Haar-like Features and DBSCAN Clustering 计算机科学, 2020, 47(6A): 153-156. https://doi.org/10.11896/JsJkx.200100008 |
[13] | 庞荣来,林静,张磊. 网格驱动的双向图像拼接算法 Grid-driven Bi-directional Image Stitching Algorithm 计算机科学, 2020, 47(3): 130-136. https://doi.org/10.11896/jsjkx.190100239 |
[14] | 郑磊, 吴俊威, 林俊勉, 潘翔. 交互标记约束的三维网格序列分割 Marker-constrained Interactive Segmentation of 3D Animated Meshes 计算机科学, 2020, 47(11A): 271-275. https://doi.org/10.11896/jsjkx.200400030 |
[15] | 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071 |
|