VDBSCAN

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

VDBSCAN:

Varied Density Based Spatial Clustering of Applications with Noise

Peng Liu, Dong Zhou, Naijun Wu


School of Information Management and Engineering,
Shanghai University of Finance and Economics, Shanghai, 200433, China
liupeng@mail.shufe.edu.cn

ABSTRACT
Clustering analysis is a primary method for data mining. Density clustering has such advantages as: its clusters are easy
to understand and it does not limit itself to shapes of clusters. But existing density-based algorithms have trouble in
finding out all the meaningful clusters for datasets with varied densities. This paper introduces a new algorithm called
VDBSCAN for the purpose of varied-density datasets analysis. The basic idea of VDBSCAN is that, before adopting
traditional DBSCAN algorithm, some methods are used to select several values of parameter Eps for different densities
according to a k-dist plot. With different values of Eps, it is possible to find out clusters with varied densities
simultaneity. For each value of Eps, DBSCAN algorithm is adopted in order to make sure that all the clusters with
respect to corresponding density are clustered. And for the next process, the points that have been clustered are ignored,
which avoids marking both denser areas and sparser ones as one cluster. Finally, a synthetic database with 2-dimension
data is used for demonstration, and experiments show that VDBSCAN is efficient in successfully clustering uneven
datasets.

Keywords: density-based clustering, DBSCAN, VDBSCAN, data mining

1. INTRODUCTION 2. RELATED WORKS

Clustering analysis is a primary method for data mining. Density-based approaches apply a local cluster criterion
There are five areas of clustering, which are Partitioning, and are very popular for the purpose of database mining.
Hierarchical, Density, Grid, and Model methods. Density Clusters are regarded as regions in the data space in
clustering methods are very useful to find clusters of any which the objects are dense, and which are separated by
shape, giving the correct parameters (yet hard to regions of low object density (noise).
determine them) [1].
A common way to find regions of high-density in the
Roughly speaking, the goal of a clustering algorithm is to data space is based on grid cell densities [7]. The basic
group the objects of a database into a set of meaningful idea for the algorithm is that the data space is partitioned
subclasses. DBSCAN (Density Based Spatial Clustering into a number of non-overlapping regions or cells, and
of Applications with Noise) is a traditional and cells containing a relatively large number of objects are
widely-accepted density-based clustering method. It can potential cluster centers. However, the success of the
find clusters of arbitrary shapes and sizes, yet may have method depends on the size of the cells which must be
trouble with clusters of varying density. The specified by the user.
density-based algorithms still suffer from several
problems. Traditional algorithms, such as DBSCAN and DBSCAN algorithm is based on center-based approach,
DENCLUE, can have trouble with density if the density one of definitions of density [5]. In the center-based
of clusters varies widely. There are also some approach, density is estimated for a particular point in
improvements which can handle clusters of different the dataset by counting the number of points within a
densities, like OPTICS and Jarvis-Patrick clustering specified radius, Eps, of that point. This includes the
algorithm [4]. However, they lower the cluster validity point itself. The center-based approach to density allows
simultaneously. us to classify a point as a core point, a border point, a
noise or background point. A point is core point if the
We introduce a new improved density-based algorithm in number of points within Eps, a user-specified parameter,
the following, for the purpose of effective clustering exceeds a certain threshold, MinPts, which is also a
analysis of datasets with varied densities. The algorithm user-specified parameter.
is named VDBSCAN (Varied Density Based Spatial
Clustering of Applications with Noise). It selects suitable
The details of DBSCAN algorithm are given in Figure 1
parameters for different density, using k-dist plot, and
[4]. And it can be informally described as follows. Any
adopts DBSCAN algorithm for each chosen parameter.
two core points that are close enough within a distance
Eps of one another are put in the same cluster. Likewise,
any border point that is close enough to a core point is

1-4244-0885-7/07/$20.00 ©2007 IEEE.


put in the same cluster as the core point. Noise points are
discarded.
DBSCAN algorithm.
1: Label all points as core, border, or noise points.
2: Eliminate noise points.
3: Put an edge between all core points that are within Eps of
each other.
4: Make each group of connected core points into a separate
cluster.
5: Assign each border point to one of the clusters with its
associated core points.
Figure 1. DBSCAN algorithm
Figure 2. A sample k-dist plot
OPTICS (Ordering Points to Identify the Clustering
Structure) is an improved method upon DBSCAN, which Because DBSCAN uses a density-based definition of a
is practically the same in runtime and process, but cluster, it is relatively resistant to noise and can handle
represents the clusters of objects by the ordering of clusters of different shapes and sizes. Thus, DBSCAN
objects in the database [2]. The main advantage of can find many clusters that could not be found using
OPTICS is that the algorithm does not limit itself to one some other clustering algorithms, like K-means.
holistic parameter setting, which is a limitation of However, the main weakness of DBSCAN is that it has
traditional DBSCAN, but just displays the point-order trouble when the clusters have greatly varied densities.
information based on densities, instead of clusters. And Suppose that the noise around the denser cluster C1 has
OPTICS is weak at finding out information of clusters in the same density as the other cluster C2. If the Eps
sparse datasets though it is good at finding them in dense threshold is low enough that DBSCAN finds C2 as
areas [3]. cluster, then C1 and the points surrounding it will become
a single cluster. If the Eps threshold is high enough that
Another improvement of the DBSCAN algorithm is DBSCAN finds C1 as a separate cluster, and the points
DENCLUE [6], which is older than OPTICS. surrounding are marked as noise, then C2 and the points
DENCLUE introduces the idea of an influence function surrounding it will also be marked as noise. DBSCAN
that describes the impact of a data point upon its also has trouble with high-dimensional data because
neighborhood. The algorithm has a good mathematical density is more difficult to define for such data. In this
foundation, and scales well because it is able to process paper, we only focus on finding a solution for the main
highly sparse datasets with minimal work (use of grid weakness when DBSCAN meets greatly varied densities.
cells) while it requires very careful selection of
parameters MinPts and Eps [1]. In many real datasets, clusters with respect to different
densities are all useful to analysis. It is necessary to find
WaveCluster [8] is another density-based approach, out both dense clusters and sparse ones. Very different
which applies wavelet transform to the feature space. partial densities may be needed to reveal clusters in
The algorithm is grid-based and only applicable to different regions of the data space. So we introduce
low-dimensional data. Also the density and grid-based VDBSCAN, which is a new improved algorithm of
clustering technique CLIQUE [9] has been proposed for traditional DBSCAN but goes beyond the limitation of
mining in high-dimensional data spaces. single global parameter of DBSCAN, for the purpose of
varied density-based clustering and analysis.
3. DESCRIPTION OF VDBSCAN ALGORITHM
Firstly, VDBSCAN calculates and stores k-dist for each
The basic approach of how to determine the parameters project and partition k-dist plots. Secondly, the number
Eps and MinPts is to look at the behavior of the distance of densities is given intuitively by k-dist plot. Thirdly,
from a point to its kth nearest neighbor, which is called choose parameters Epsi automatically for each density.
k-dist. The k-dists are computed for all the data points Fourthly, scan the dataset and cluster different densities
for some k, sorted in ascending order, and then plotted using corresponding Epsi. And finally, display the valid
using the sorted values, as a result, a sharp change is clusters corresponding with varied densities.
expected to see. The sharp change at the value of k-dist VDBSCAN has two steps: choosing parameters Epsi and
corresponds to a suitable value of Eps. Line A in Figure 2 cluster in varied densities. Details are given in Figure 3.
shows a sample k-dist line. Note that the value of Eps
that is determined in this way depends on k, but does not
change dramatically as k changes.
Partition k-dist plot;
Step
Give thresholds of parameters Epsi 4. EXPERIMENT AND ANALYSIS
1
(i=1,2,…,n);
For each Epsi (i=1,2,…,n) 4.1 Description of Data
Eps=Epsi;
Step Adopt DBSCAN algorithm for points In order to observe and analyze experimental results
that are not marked;
2 directly, 2-dimension data is chosen in our experiment.
Mark points as Ci-t;
Figure 4 shows the data. Obviously, there are two regions
Display all the masked points as
corresponding clusters. with respect to different densities in the data. And data
points of each region are uniformly distributed. The
Figure 3. VDBSCAN algorithm
dataset provides a clustering standard to estimate the
accuracy of the result, for it has strong regularity and
Step 1: Choosing Epsi
obvious clusters. In addition, as it has been already
This is a key step in the process. K-dist plot is drawn for
acknowledged that density-based clustering algorithms
not only selection of parameters Epsi but also analysis of
can find out clusters with any shape, this paper will not
density levels of the dataset. For datasets with widely
discuss the problems of cluster shapes in clustering.
varied density, note that there will be some variation,
depending on the density of the cluster and the random
distribution of points, but for points of the same density
level, the range of variation will not be huge while a
sharp change is expected to see between two density
levels. Thus there will be several smooth curves
connected by greatly variational ones. If there are n
(natural number n>1) different smooth curves in the
k-dist plot, the dataset has n density levels. A dataset is of
varied-density if it has several density levels and of n
varied-density if it has n density levels. Specially, a
dataset is of single-density if its density does not vary
widely, or there is only one smooth curve in its k-dist Figure 4. Data for experiment
plot. Figure 2 shows a sample k-dist plot. Line A shows a
sample k-dist line of a single-density dataset while line B 4.2 Experiment Process and Result
shows a sample line of a three varied-densities dataset.
Compute k-dist. Select k=3. Calculate the distance from
For points that are not in a cluster, such as noise points, each point to its 3rd nearest neighbor, which is called
the corresponding k-dist line rockets, connecting two 3-dist, sort points by 3-dist and plot the sorted values.
smooth curves which stand for two density levels. Line b Figure 5 shows the k-dist plot (k=3).
and d in Figure 2 are such lines, which can be called
level-turning lines. Line b connects line a and c, and line
d connects c and e, while a, c and e stand for different
density levels. Note that line f shows the k-dists of
outliers and is not a level-turning line for it does not
connect two smooth lines.

For different density levels Di, select suitable Epsi. For


example, in Figure 2, there are three density levels. Line
a shows the densest density level and e shows the
sparsest one. Combine line a and b as a sub-k-dist plot to
select Eps1, and then take line c and d as a sub-k-dist plot
for Eps2, e and f for Eps3 finally.

Step 2: Varied-density clustering Points Sorted by Distance to 3rd Nearest Neighbor


Adopt algorithm DBSCAN for each Epsi (natural number Figure 5. K-dist plot (k=3)
i=1,2,3,…,n. n is the number of density levels). Note that
parameters Epsi have been ordered as k-dist line curves, According to k-dist plot (k=3) shown in Figure 5, there
that is Epsi < Epsi+1 (i<n). Before adopt DBSCAN for are two sharp changes at the value of k-dist that
Epsi+1, mark points in clusters corresponding with Epsi as correspond to suitable values of Eps. So select Eps1=2,
Ci-t (t is a natural number), which indicates that the points Eps2 =4, and take the value of k as the MinPts parameter,
belong to the cluster t in density level i. Marked points that is, MinPts=3.
will not be processed by DBSCAN again. Non-marked
points after all the Epsi process are recognized as outliers.
And all the Ci-t are displayed as the results.
Adopt DBSCAN algorithm twice for two different values the efficiency of the new algorithm. However, there are
of Eps. In the first round, Eps1 =2, points marked ' - ' in several opportunities for future research. How to select
Figure 4 are identified as core points, and are clustered as all the parameters automatically is one of the interesting
C1. At the same time, those marked ' o ' or ' . ' are challenges as parameter k has to be chosen subjectively
identified as noise points. Note that, all the clustered in VDBSCAN.
points should be given cluster marks for the further
DBSCAN operation. In the second round, Eps2 =4, and REFERENCES
only non-cluster-marked points are processed. Points
marked ' o ' are identified as cores and those marked ' . ' [1] Jason D. Peterson, “Clustering overview”,
as border points. Both types of points are clustered as C2 http://www.cs.ndsu.nodak.edu/~jasonpet/CSCI779/C
and are given cluster marks. The density of cluster C1 is lustering.pdf.
1 per unit area while that of C2 0.25 per unit area. [2] Pang-Ning Tan, Michael Steinbach, Vipin Kumar,
Non-cluster-marked points are recognized as noise points Introduction to Data Mining, Pearson Education Asia
or anomaly. LTD, 2006.
[3] Jain A. K., Dubes R. C., Algorithms for clustering
4.3 Experiment Conclusion Data, Prentice-Hall, Inc., 1988.
[4] Ester M., Kriegel H.-P., Sander J., Xu X., “A
A synthetic database with 2-dimension data is used for density-based algorithm for discovering clusters in
demonstration. The experiment shows that VDBSCAN is
large spatial databases with noise”, proceeding of 2nd
good at finding out clusters corresponding with varied
International Conference on Knowledge Discovery
densities, and it has the same time complexity as
and Data Mining, Portland, OR, AAAI Press, pp.
DBSCAN.
226-231, 1996.
VDBSCAN algorithm successfully overcomes one of the [5] Mihael Ankerst, Markus M. Breunig, Hans-Peter
main problems of traditional DBSCAN which limits Kriegel, Jörg Sander, “OPTICS: Ordering points to
itself to varied-density datasets. For the experimental identify the clustering structure”, proceeding of 1999
data, traditional DBSCAN algorithm selects only a value ACM-SIGMOD International Conference, ACM
of parameter Eps. Supposing using DBSCAN algorithm Press, pp.49-60, 1999.
in this experiment, as the process of selection of the [6] Sun Xue-gang, Chen Qun-xiu, and Ma Liang, “study
value of Eps using k-dist plot is the same as that in on topic-based web clustering”, The Journal of
VDBSCAN, which has been described before, Eps may Chinese Information Processing, Vol 17, No. 3,
be either 2 or 4. If Eps is 2, cluster C1 will be found out pp.21-26, 2003.
correctly while all points of C2 will be identified as noise [7] Hinneburg A., Keim D., “An efficient approach to
points or anomaly. If the value of Eps is 4, points of both clustering in large multimedia databases with noise”,
C1 and C2 are clustered as a cluster. Thus, traditional Proceeding of 4th International Conference on
DBSCAN algorithm can not distinguish clusters with Knowledge Discovery and Data Mining, New York
respect to different densities, or can not find out them City, NY, 1998.
simultaneously. The experimental results show that [8] Sheikholeslami G., Chatterjee S., Zhang A.,
varied-density datasets can be clustered successfully by “WaveCluster: A multi-resolution clustering
VDBSCAN. approach for very large spatial databases”,
Proceeding 24th International Conference on Very
5. CONCLUSIONS Large Data Bases, pp. 428-439, New York City, NY,
1998.
In this paper, we proposed a new density-based algorithm
[9] Agrawal R., Gehrke J., Gunopulos D., Raghavan P.,
called VDBSCAN in purpose of finding out meaningful
“Automatic subspace clustering of high dimensional
clusters in databases in respect with widely varied
densities. VDBSCAN has the same time complexity as data for data mining applications”, Proceeding ACM
DBSCAN, and can identify clusters with different SIGMOD '98 International Conference on
densities while DBSCAN can not. The experiment shows Management of Data, pp. 94-105, Seattle, WA, 1998.

You might also like