Data Mining Presentation

Comparing Clustering Algorithms
Partitioning Algorithms

K-Means DBSCAN Using KD Trees
Hierarchical Algorithms

Agglomerative Clustering CURE
K-Means Partitional clustering

Prototype based Clustering O(I * K * m * n) Space Complexity Using KD Trees the overall Time Complexity reduces to O(m * logm) Select K initial centroids Repeat
For each point, find its closes centroid and assign that point to the centroid. This results in the formation of K clusters Recompute centroid for each cluster until the centroids do not change
K-Means (Contd.)
Datasets - SPAETH2 2D dataset of 3360 points
K-Means (Contd.)
Performance Measurements Compiler Used

LabVIEW 8.2.1 Intel Core(TM)2 IV 1.73 Ghz 1 GB RAM
Hardware Used Current Status

Done
355 ms / 3360 points
Time Taken
K-Means (Contd.)
Pros Simple Fast for low dimensional data It can find pure sub clusters if large number of clusters is specified Cons K-Means cannot handle non-globular data of different sizes and densities K-Means will not identify outliers K-Means is restricted to data which has the notion of a center (centroid)
Agglomerative Hierarchical Clustering
Starting with one point (singleton) clusters and recursively merging two or more most similar clusters to one "parent" cluster until the termination criterion is reached Algorithms:

MIN (Single Link) MAX (Complete Link) Group Average (GA)
MIN: susceptible to noise/outliers MAX/GA: may not work well with nonglobular clusters CURE tries to handle both problems
Data Set
2-D data set used
The SPAETH2 dataset is a related collection of data for cluster analysis. (Around 1500 data points)
Algorithm optimization
It involved the implementation of Minimum SpanningTreeusingKruskalsalgorithm

Union By Rank method is used to speed-up the algorithm Environment:
Implemented using MATLAB Gnuplot Single Link and Complete Link Done Group Average in progress
Other Tools:
Present Status

Single Link/CURE Globular Clusters
After 64000 iterations
Final Cluster
Single Link / CURE Non globular
KD Trees

K Dimensional Trees Space Partitioning Data Structure Splitting planes perpendicular to Coordinate Axes
Useful in Nearest Neighbor Search Reduces the Overall Time Complexity to O(log n) Has been used in many clustering algorithms and other domains
Clustering Algorithms use KD Trees extensively for improving their Time Complexity Requirements Eg. Fast K-Means, Fast DBSCAN etc We considered 2 popular Clustering Algorithms which use KD Tree Approach to speed up clustering and minimize search time. We used Open Source Implementation of KD Trees (available under GNU GPL)
DBSCAN (Using KD Trees)
Density based Clustering (Maximal Set of Density Connected Points) O(m) Space Complexity Using KD Trees the overall Time Complexity reduces to O(m * logm) from O(m^2)
Pros

Fast for low dimensional data Can discover clusters of arbitrary shapes Robust towards Outlier Detection (Noise)
DBSCAN - Issues
DBSCAN is very sensitive to clustering parameters MinPoints (Min Neighborhood Points) and EPS (Images Next)
The Algorithm is not partitionable for multiprocessor systems.
DBSCAN fails to identify clusters if density varies and if the data set is too sparse. (Images Next)
Sampling Affects Density Measures
DBSCAN (Contd.)
Performance Measurements

Compiler Used - Java 1.6 Hardware Used Intel Pentium IV 1.8 Ghz (Duo Core)1GB RAM 1572 3568 10.9 7502 39.5 10256 78.4
No. of Points
Clustering Time (sec) 3.5
DBSCAN Using KD Trees Performance Measures

110 100 90 80 70 60 50 40 30 20 10 0 1572 3568 7502 10256
DBSCAN Using KDTree Basic DBSCAN
CURE Hierarchical Clustering

Involves Two Pass clustering Uses Efficient Sampling Algorithms Scalable for Large Datasets
First pass of Algorithm is partitionable so that it can run concurrently on multiple processors (Higher number of partitions help keeping execution time linear as size of dataset increase)
Source - CURE: An Efficient Clustering Algorithm for Large Databases. S. Guha, R. Rastogi and K. Shim, 1998.
Each STEP is Important in Achieving Scalability and Efficiency as well as Improving concurrency. Data Structures KD-Tree to store the data/representative points : O(log n) searching time for nearest neighbors Min Heap to Store the Clusters : O(1) searching time to compute next cluster to be processed
Cure hence has a O(n) Space Complexity
CURE (Contd.)
Outperforms Basic Hierarchical Clustering by reducing the Time Complexity to O(n^2) from O(n^2*logn) Two Steps of Outlier Elimination

After Pre-clustering Assigning label to data which was not part of Sample
Captures the shape of clusters by selecting the notion of representative points (well scattered points which determine the boundary of cluster)
CURE - Benefits against Popular Algorithms
K-Means (& Centroid based Algorithms) : Unsuitable for non-spherical and size differing clusters. CLARANS : Needs multiple data scan (R* Trees were proposed later on). CURE uses KD Trees inherently to store the dataset and use it across passes. BIRCH : Suffers from identifying only convex or spherical clusters of uniform size DBSCAN : No parallelism, High Sensitivity, Sampling of data may affect density measures.
CURE (Contd.)
Observations towards Sensitivity to Parameters
Random Sample Size : It should be ensured that the sample represents all existing cluster. Algorithm uses Chernoff Bounds to calculate the size Shrink Factor of Representative Points
Representative Points Computation Time

Number of Partitions : Very high number of partitions (>50) would not give suitable results as some partitions may not have sufficient points to cluster.
CURE - Performance
Compiler : Java 1.6 Hardware Used : Intel Pentium IV 1.8 Ghz (Duo Core)1GB RAM
No. of Points Clustering Time (sec) Partition P = 2 Partition P = 3 Partition P = 5 1572 6.4 6.5 6.1 3568 7.8 7.6 7.3 7502 29.4 21.6 12.2 10256 75.7 43.6 21.2
CURE Performance Measurements

80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0 1572
P=2 P=3 P=5 DBSCAN
3568
7502
10256
Data Sets and Results
SPAETH - http://people.scs.fsu.edu/~burkardt/f_src/spaeth/spaeth.html Synthetic Data - http://dbkgroup.org/handl/generators/
References
An Efficient k-Means Clustering Algorithm: Analysis and Implementation - Tapas Kanungo, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise - Martin Ester, Hans-Peter Kriegel, Jrg Sander, Xiaowei Xu, KDD '96 CURE : An Efficient Clustering Algorithm for Large Databases S. Guha, R. Rastogi and K. Shim, 1998. Introduction to Clustering Techniques by Leo Wanner A comprehensive overview of Basic Clustering Algorithms Glenn Fung Introduction to Data Mining Tan/Steinbach/Kumar
Thanks!
Presenters

Vasanth Prabhu Sundararaj Gnana Sundar Rajendiran Joyesh Mishra
Source www.cise.ufl.edu/~jmishra/clustering Tools Used

JDK 1.6, Eclipse, MATLAB, LABView, GnuPlot
This slide was made using Open Office 2.2.1

Data Mining Presentation

Uploaded by

Copyright:

Available Formats

Data Mining Presentation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Presentation

Uploaded by

Copyright:

Available Formats

Comparing Clustering Algorithms

K-Means DBSCAN Using KD Trees

Agglomerative Clustering CURE

K-Means Partitional clustering

LabVIEW 8.2.1 Intel Core(TM)2 IV 1.73 Ghz 1 GB RAM

Hardware Used Current Status

Agglomerative Hierarchical Clustering

MIN (Single Link) MAX (Complete Link) Group Average (GA)

2-D data set used

It involved the implementation of Minimum SpanningTreeusingKruskalsalgorithm

Single Link/CURE Globular Clusters

After 64000 iterations

Single Link / CURE Non globular

DBSCAN (Using KD Trees)

Clustering Time (sec) 3.5

DBSCAN Using KD Trees Performance Measures

DBSCAN Using KDTree Basic DBSCAN

CURE Hierarchical Clustering

Cure hence has a O(n) Space Complexity

CURE - Benefits against Popular Algorithms

Representative Points Computation Time

CURE Performance Measurements

P=2 P=3 P=5 DBSCAN

Data Sets and Results

SPAETH - http://people.scs.fsu.edu/~burkardt/f_src/spaeth/spaeth.html Synthetic Data - http://dbkgroup.org/handl/generators/

Vasanth Prabhu Sundararaj Gnana Sundar Rajendiran Joyesh Mishra

Source www.cise.ufl.edu/~jmishra/clustering Tools Used

This slide was made using Open Office 2.2.1

You might also like