VDBSCAN
VDBSCAN
VDBSCAN
ABSTRACT
Clustering analysis is a primary method for data mining. Density clustering has such advantages as: its clusters are easy
to understand and it does not limit itself to shapes of clusters. But existing density-based algorithms have trouble in
finding out all the meaningful clusters for datasets with varied densities. This paper introduces a new algorithm called
VDBSCAN for the purpose of varied-density datasets analysis. The basic idea of VDBSCAN is that, before adopting
traditional DBSCAN algorithm, some methods are used to select several values of parameter Eps for different densities
according to a k-dist plot. With different values of Eps, it is possible to find out clusters with varied densities
simultaneity. For each value of Eps, DBSCAN algorithm is adopted in order to make sure that all the clusters with
respect to corresponding density are clustered. And for the next process, the points that have been clustered are ignored,
which avoids marking both denser areas and sparser ones as one cluster. Finally, a synthetic database with 2-dimension
data is used for demonstration, and experiments show that VDBSCAN is efficient in successfully clustering uneven
datasets.
Clustering analysis is a primary method for data mining. Density-based approaches apply a local cluster criterion
There are five areas of clustering, which are Partitioning, and are very popular for the purpose of database mining.
Hierarchical, Density, Grid, and Model methods. Density Clusters are regarded as regions in the data space in
clustering methods are very useful to find clusters of any which the objects are dense, and which are separated by
shape, giving the correct parameters (yet hard to regions of low object density (noise).
determine them) [1].
A common way to find regions of high-density in the
Roughly speaking, the goal of a clustering algorithm is to data space is based on grid cell densities [7]. The basic
group the objects of a database into a set of meaningful idea for the algorithm is that the data space is partitioned
subclasses. DBSCAN (Density Based Spatial Clustering into a number of non-overlapping regions or cells, and
of Applications with Noise) is a traditional and cells containing a relatively large number of objects are
widely-accepted density-based clustering method. It can potential cluster centers. However, the success of the
find clusters of arbitrary shapes and sizes, yet may have method depends on the size of the cells which must be
trouble with clusters of varying density. The specified by the user.
density-based algorithms still suffer from several
problems. Traditional algorithms, such as DBSCAN and DBSCAN algorithm is based on center-based approach,
DENCLUE, can have trouble with density if the density one of definitions of density [5]. In the center-based
of clusters varies widely. There are also some approach, density is estimated for a particular point in
improvements which can handle clusters of different the dataset by counting the number of points within a
densities, like OPTICS and Jarvis-Patrick clustering specified radius, Eps, of that point. This includes the
algorithm [4]. However, they lower the cluster validity point itself. The center-based approach to density allows
simultaneously. us to classify a point as a core point, a border point, a
noise or background point. A point is core point if the
We introduce a new improved density-based algorithm in number of points within Eps, a user-specified parameter,
the following, for the purpose of effective clustering exceeds a certain threshold, MinPts, which is also a
analysis of datasets with varied densities. The algorithm user-specified parameter.
is named VDBSCAN (Varied Density Based Spatial
Clustering of Applications with Noise). It selects suitable
The details of DBSCAN algorithm are given in Figure 1
parameters for different density, using k-dist plot, and
[4]. And it can be informally described as follows. Any
adopts DBSCAN algorithm for each chosen parameter.
two core points that are close enough within a distance
Eps of one another are put in the same cluster. Likewise,
any border point that is close enough to a core point is