0% found this document useful (0 votes)
6 views

CLIQUE Algorithm

Uploaded by

Manish Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

CLIQUE Algorithm

Uploaded by

Manish Kumar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

CLIQUE: A Dimension-Growth

Subspace Clustering Method


First dimension growth subspace clustering algorithm
Clustering starts at single-dimensionsubspace and
move upwards towards higher dimension subspace
This algorithm can be viewed as the integration
Density based and Grid based algorithm
Automatically identifying subspaces of a high dimensional data
space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length intervals
It partitions an m-dimensional data space into non-overlapping rectangular
units
A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
Definitions That Need to Be Known
Unit : After forming a grid structure on
the space, each rectangular cell is
called a Unit.
Dense: A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter.
Cluster: A cluster is defined as a maximal
set of connected dense units.
Informal problem statement
Given a large set of multidimensional data points, the
data space is usually not uniformly occupied by the
datapoints.
CLIQUE s clustering identifies the sparse and the
areas in space (or units), thereby
discovering the overall distribution patterns of the
dataset.
A unit is dense if the fraction of total data points
contained in it exceedsan input model parameter.
In CLIQUE, a cluster is defined as a maximal set of
connected denseunits.
Formal Problem Statement
Let A= {A1, A2, . . . , Ad } be a set of bounded, totally
ordered domains and S = A1 A2 · · · Ad a d-
dimensional numericalspace.
We will refer to A1, . . . , Ad as the dimensions
(attributes) of S.
The input consists of a set of d-dimensional points V =
{v1, v2, . . . ,vm}
Wherevi = vi1, vi2, . . . , vid . The j th componentof vi is
drawn from domainAj .
The CLIQUE Algorithm (cont.)
3.
The minimal description of a cluster C, produced by the above
procedure, is the minimum possible union of hyperrectangular regions.

For example
A B is the minimum cluster description of the shaded region.
C D E is a non-minimal cluster description of the same region.

28
Clique Working
2 Step Process

1st step Partitioning the d- dimensional data space

2nd step- Generates the minimal descriptionof each


cluster.
1st step- Partitioning
Partitioning is done for each dimension.
Example
The subspaces representing these dense units are
intersected to form a candidate search space in which
dense units of higher dimensionality may exist.
This approach of selecting candidates is quite similar
to Apiori Gen process of generating candidates.
Here it is expected that if some thing is dense in
higherdimensional space it cant be sparse in lower
dimensionstate.
More formally
If a k-dimensional unit is dense, then so are its projections
in (k-1)-dimensional space.
Given a k-dimensional candidate dense unit, if any of
(k-1)th projection unit is not dense then kth dimensional
unitcannot be dense
So,we can generate candidate dense units in k-dimensional
space from the dense units found in (k-1)-dimensional
space
The resulting space searched is much smaller than the
original space.
The dense units are then examined in order to determine
theclusters.
Intersection

Dense units found with respect to age for the dimensions salary
and vacation are intersected in order to provide a candidate
search space for dense units of higher dimensionality.
2nd stage- Minimal Description
For each cluster, Clique determines the maximal
region that covers the cluster of connected dense units.
It then determines a minimal cover (logic description)
foreach cluster.
Effectiveness of Clique-
CLIQUE automatically finds subspaces of the highest
dimensionalitysuch that high-density clusters exist in
thosesubspaces.
It is insensitive to the orderof input objects
It scales linearly with the size of input
Easily scalable with the number of dimensions in the
data
GRID-BASED CLUSTERING
METHODS

This is the approach in which we


quantize space into a finite number of
cells that form a grid structure on which
all of the operations for clustering is
performed.

So, for example assume that we have a


set of records and we want to cluster with
respect to two attributes, then, we divide
the related space (plane), into a grid
structure and then we find the clusters.
Salary (10,000)

7 plane

20 30 40 50 60
Age
Techniques for Grid-Based Clustering

The following are some techniques


that are used to perform Grid-Based
Clustering:
CLIQUE (CLustering In QUest.)
STING (STatistical Information Grid.)
WaveCluster
Looking at CLIQUE as an Example

CLIQUE is used for the clustering of high-


dimensional data present in large tables.
By high-dimensional data we mean
records that have many attributes.

CLIQUE identifies the dense units in the


subspaces of high dimensional data
space, and uses these subspaces to
provide more efficient clustering.
How Does CLIQUE Work?

Let us say that we have a set of records


that we would like to cluster in terms of
n-attributes.
So, we are dealing with an n-
dimensional space.
MAJOR STEPS :
CLIQUE partitions each subspace that has
dimension 1 into the same number of equal
length intervals.
Using this as basis, it partitions the n-
dimensional data space into non-overlapping
rectangular units.
CLIQUE: Major Steps (Cont.)
-
dimensional units.
It does this in the following way:
CLIQUE finds dense units of higher
dimensionality by finding the dense units in the
subspaces.
So, for example if we are dealing with a 3-
dimensional space, CLIQUE finds the dense
units in the 3 related PLANES (2-dimensional
subspaces.)
It then intersects the extension of the
subspaces representing the dense units to
form a candidate search space in which dense
units of higher dimensionality would exist.
CLIQUE: Major Steps. (Cont.)

Each maximal set of connected dense units is


considered a cluster.
Using this definition, the dense units in the
subspaces are examined in order to find
clusters in the subspaces.
The information of the subspaces is then used
to find clusters in the n-dimensional space.
It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells.
Example for CLIQUE
Let us say that we want to cluster a set
of records that have three attributes,
namely, salary, vacation and age.
The data space for the this data would
be 3-dimensional.
vacation

age

salary
Example (Cont.)

After plotting the data objects,


each dimension, (i.e., salary,
vacation and age) is split into
intervals of equal length.
Then we form a 3-dimensional grid
on the space, each unit of which
would be a 3-D rectangle.
Now, our goal is to find the dense
3-D rectangular units.
Example (Cont.)

To do this, we find the dense units


of the subspaces of this 3-d space.
So, we find the dense units with
respect to age for salary. This
means that we look at the salary-
age plane and find all the 2-D
rectangular units that are dense.
We also find the dense 2-D
rectangular units for the vacation-
age plane.
Example 1
Example (Cont.)

Now let us try to visualize the


dense units of the two planes on the
following 3-d figure :
Example (Cont.)
We can extend the dense areas in the
vacation-age plane inwards.
We can extend the dense areas in the
salary-age plane upwards.
The intersection of these two spaces
would give us a candidate search space in
which 3-dimensional dense units exist.
We then find the dense units in the
salary-vacation plane and we form an
extension of the subspace that represents
these dense units.
Example (Cont.)
Now, we perform an intersection of
the candidate search space with the
extension of the dense units of the
salary-vacation plane, in order to
get all the 3-d dense units.
So, What was the main idea?
We used the dense units in
subspaces in order to find the dense
units in the 3-dimensional space.
After finding the dense units, it is
very easy to find clusters.
Reflecting upon CLIQUE
Why does CLIQUE confine its search for
dense units in high dimensions to the
intersection of dense units in subspaces?
Because the Apriori property employs
prior knowledge of the items in the search
space so that portions of the space can be
pruned.
The property for CLIQUE says that if a k-
dimensional unit is dense then so are its
projections in the (k-1) dimensional
space.
Strength and Weakness of CLIQUE
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in
those subspaces.
It is quite efficient.
It is insensitive to the order of records in input and
does not presume some canonical data distribution.
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases.
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the simplicity
of this method.
Partition the data space and find the number of points
that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters:
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of connected
dense units for each cluster
Determination of minimal cover for each cluster
Salary
(10,000)
=3

0 1 2 3 4 5 6 7
20
30
40
50

Vacation
60
age

30
Vacation
(week)

50
0 1 2 3 4 5 6 7
20
30
40

age
50
60
age
CLIQUE
Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters exist in those
subspaces
It is insensitive to the order of records in input and does not
presume some canonical data distribution
It scales linearly with the size of input and has good
scalability as the number of dimensions in the data
increases
Weakness
The accuracy of the clustering result may be degraded at
the expense of simplicity of the method

You might also like