Fuzzy Extensions of The DBScan Clustering Algorithm
Fuzzy Extensions of The DBScan Clustering Algorithm
DOI 10.1007/s00500-016-2435-0
Abstract The DBSCAN algorithm is a well-known density- ous ones, thus allowing to generate clusters with both fuzzy
based clustering approach particularly useful in spatial data cores and fuzzy overlapping borders. Our proposals are com-
mining for its ability to find objects’ groups with heteroge- pared w.r.t. state of the art fuzzy clustering methods over
neous shapes and homogeneous local density distributions real-world datasets.
in the feature space. Furthermore, it can be suitable as scal-
ing down approach to deal with big data for its ability to Keywords Fuzzy clustering · Density-based clustering ·
remove noise. Nevertheless, it suffers for some limitations, DBSCAN clustering
mainly the inability to identify clusters with variable den-
sity distributions and partially overlapping borders, which is
often a characteristics of both scientific data and real-world
1 Introduction
data. To this end, in this work, we propose three fuzzy exten-
sions of the DBSCAN algorithm to generate clusters with
The advent of the big data era has launched new challenges
distinct fuzzy density characteristics. The original version of
to the research community who reacted either by introducing
DBSCAN requires two precise parameters (min Pts and ) to
new algorithms or by extending existing algorithms to man-
define locally dense areas which serve as seeds of the clus-
age large datasets. Specifically, the first approaches focus on
ters. Nevertheless, precise values of both parameters may be
the “scaling up” objective to deal with big data sets. Never-
not appropriate in all regions of the dataset. In the proposed
theless, they risk to become useless in a short time, due the
extensions of DBSCAN, we define soft constraints to model
data continuous growth. CISCO1 estimated that the data on
approximate values of the input parameters. The first exten-
the Internet will increase at a compound annual growth rate
sion, named Fuzzy Core DBSCAN, relaxes the constraint on
of 25% by the year 2017. Thus, to deal with datasets continu-
the neighbourhood’s density to generate clusters with fuzzy
ously growing in size it will be necessary to frequently scale
core points, i.e. cores with distinct density; the second exten-
up algorithms. The second kind of approach aims at scaling
sion, named Fuzzy Border DBSCAN, relaxes to allow the
down, i.e. at synthesizing, the data sets by reducing their size,
generation of clusters with overlapping borders. Finally, the
and to use existing algorithms on the reduced data. Although
third extension, named Fuzzy DBSCAN subsumes the previ-
scaling down may risk to cancel important information it has
Communicated by V. Loia. the chance of reducing the datasets by eliminating noise or
redundant data. Clustering techniques can be categorized as
B Dino Ienco scaling down approaches, since their objective is to identify
dino.ienco@irstea.fr groups of items within the dataset with common character-
Gloria Bordogna istics in a feature space, while removing outliers and noise
bordogna.g@irea.cnr.it which are considered uninteresting for further analysis.
1 IRSTEA, UMR TETIS, Montpellier, France
2 LIRMM, Montpellier, France 1 http://www.cisco.com/c/en/us/solutions/collateral/service-provider/
3 CNR IREA, Milan, Italy global-cloud-index-gci/Cloud_Index_White_Paper.html.
123
D. Ienco, G. Bordogna
Many real-world applications such as social network soft clustering approaches have been defined which gener-
community identification and satellite image analysis need ate clusters with fuzzy overlapping boundaries (Pal et al.
effective means to identify regions that are characterized 2005; Ji et al. 2014; Yager and Filev 1994). Most of the
by locally dense areas in the feature space representing the soft clustering approaches detect fuzzy clusters with same
objects of interests. For instance, such regions may repre- shape, with each object of the dataset belonging to all clus-
sent communities of users linked by the friend of friend ters to a distinct degree. Moreover, even the fuzzy extensions
relationships in social networks, ecosystems may appear as of DBSCAN generate fuzzy clusters with the same charac-
regions characterized by homogeneous feature’ values in teristics of fuzziness, i.e. all clusters have same fain borders
satellite images. To detect such objects, density-based clus- (Ulutagaya and Nasibov 2012; Smiti and Eloudi 2013). In this
tering algorithms have been widely applied. They evaluate paper, we investigate new extensions of the DBSCAN algo-
a local criterion to group objects: clusters are regarded as rithm, defined within the framework of fuzzy set theory, with
regions in the feature space where the objects are densely dis- the aim to cope with the limitation of both classic DBSCAN
tributed and separated by regions with sparse objects, which and soft clustering algorithms. The idea is to define distinct
are considered noise or outliers. DBSCAN extensions capable to manage approximate values
Indeed, in many real applications, such as in satellite of the input parameters, thus less sensible to the input parame-
image analysis, one needs to cope with noise invariably ter setting, and capable to detect possibly fuzzy overlapping
affecting data. Furthermore, one does not have any knowl- clusters with distinct density characteristics and profiles.
edge about the number of clusters, the possible clusters’ There are several real applications in which it could be
shapes and the objects distribution in the feature space. useful specifying approximate values of the input parameters
Finally, crisp clustering algorithms fail to detect the variable and detecting fuzzy clusters with both distinct shapes and dis-
and fuzzy nature of cluster borders which are often faint and tinct density profiles. Consider the detection of communities
overlapped one another. Among the proposed crisp density- of users in a social network based on the FoF relationships:
based clustering algorithms DBSCAN (Ester et al. 1996) is while one can specify easily that the users must have a given
a well-known and widely applied approach as it does not number of degrees of separation from other users on the net-
require to specify in input the number of clusters, it can detect work to belong to the community, it may be questionable
clusters of any shape, and it can remove noisy points. Fur- defining the precise minimum number of users that consti-
thermore, this algorithm is suitable to process big data when tutes a community. In this case, it can be useful to apply a
adopting a spatial index, such as an R-tree, since its com- clustering algorithm, as in our first extension of DBSCAN
plexity varies as O(N ∗ log N ) (Sander et al. 1998). in Bordogna and Ienco (2014), by allowing the specifica-
Nevertheless, it suffers for some drawbacks: first, to drive tion of an approximate density, i.e. an approximate number
the process, this algorithm needs two numeric input para- of users and detecting non-overlapping fuzzy communities
meters, minPts, i.e. the neighbourhood density, and , i.e. where a user can belong to a community to a degree.
the distance to define the local neighbourhood size, which On the other side, in the case, one has to detect stars and
together define the desired local density of the generated clus- galaxies in astronomical optical images, appearing with a
ters. Specifically, minPts is a positive integer specifying the crisp nucleus and faint borders, it can be easier to specify
minimum number of objects that must exist within a max- an approximate local neighbourhood size, as in our second
imum distance in the feature space in order for an object extension of DBSCAN proposed in this paper, and thus detect-
to belong to a cluster. Second, the results of DBSCAN are ing objects with crisp core and faint border.
strongly dependent on the setting of these input parameters Last but not least, there are applications in which objects
that must be chosen with great accuracy (Ester et al. 1996) are characterized by distinct local densities and faint, possi-
by considering both the scale of the dataset and the closeness bly overlapping, borders, such as in remote sensing images,
of the objects in order to achieve both speed and effective- where distinct ecosystems have distinct densities of trees
ness of the results. To set the right values of these parameters (objects) and they may appear merged one another; in this
one generally engages a trial and error exploratory phase in case, it would be useful to allow specifying both an approx-
which the algorithm is run several times with distinct val- imate neighbourhood density, and an approximate local
ues of the input parameters. These repeated trials are costly neighbourhood size to generate fuzzy overlapping clusters
when dealing with big data volumes. A final drawback of as in our third extension of DBSCAN.
DBSCAN is that it detects clusters with sharp boundaries, In the literature, several fuzzy extensions of DBSCAN have
which is a common limitation of all crisp clustering algo- been proposed with the objective of leveraging the setting of
rithms when used to group objects whose distribution has a the precise input parameters (Ulutagaya and Nasibov 2012).
faint and smooth density profile in the feature space. They Nevertheless, none of them have tackled the objective of gen-
draw crisp boundaries to separate clusters, which are often erating fuzzy clusters modelling distinct kind of fuzziness as
somewhat arbitrary. To cope with undesired crisp boundaries, we do in this paper. We leverage the setting of either one or
123
Fuzzy extensions of the DBScan clustering algorithm
both the input parameters of DBSCAN by allowing the speci- simultaneously belong to several clusters (Ji et al. 2014; Pal
fication of soft constraints on both the number of objects and et al. 2005).
the closeness (reachability) between objects. Specifically, the Density-based clustering algorithms grow clusters around
precise value minPts is replaced by a soft constraint defined seeds located in regions of the feature space which are locally
by a pair (min Ptsmin , min Ptsmax ) that specifies an approx- dense of objects. DBSCAN (Ester et al. 1996) is one of the
imate minimum number of objects for defining a cluster, i.e. most popular density-based methods used in data mining due
there is a tolerance on the crisp limit min Ptsmax defined by to both its ability to detect irregularly shaped clusters by
min Ptsmax − min Ptsmin ; in the same way, the precise dis- copying with noise data, and to its relatively low complexity
tance , is replaced by a soft constraints (min , max ) on the that varies as O(N ∗ log N ) when adopting a spatial index,
closeness of objects so that again on the crisp limit min we thus making it suitable to process big data (Sander et al.
have a tolerance defined by max − min . 1998). Nevertheless, its effectiveness in detecting clusters is
The three extensions of DBSCAN generate clusters with strongly dependent on the parameters setting, and this is the
either a fuzzy core, i.e. clusters whose elements are associated main reason that leads to its soft extensions. Besides this
with a numeric membership degree in [0,1] but not overlap- motivation we argue that, in order to properly adopt a soft
ping one another, clusters with fuzzy overlapping border and density-based clustering approach with respect to another
a crisp core, and clusters with both fuzzy core and over- one, one should be able to understand the properties of the
lapping borders. Having three extensions producing clusters generated soft clusters. This is the reason that leads us to
with distinct fuzzy and overlapping properties one can choose define three distinct extensions of DBSCAN each one gener-
the most appropriate for task to accomplish. ating fuzzy clusters with distinct characteristics.
Furthermore, fuzzy clusters allow several advantages: for Ulutagaya and Nasibov (2012) reports a survey of the main
instance, with a single run of the clustering it is possible to fuzzy density-based clustering algorithms, while Shamshir-
summarize several distinct runs of the original approach by band et al. (2014) presents a study in which they show that
specifying distinct thresholds on the membership degrees of density-based clustering algorithm coupled with fuzzy logic
the objects to the clusters. For this reason, it can be employed can efficiently deal with the task of intrusion detection in
as intelligent reduction strategy for big data. In our case, this wireless sensor networks.
allows an easy exploration of the spatial distribution of the The most cited paper (Nasibov and Ulutagay 2009) pro-
objects avoiding the tedious exact setting of the DBSCAN poses a fuzzy extension of the DBSCAN, named fuzzy
parameters (Ester et al. 1996). neighbourhood FN-DBSCAN, whose main characteristic is to
The paper is organized as follow: Sect. 2 discusses related use a fuzzy neighbourhood size definition. In this approach,
work, in Sect. 3 we recall the classic DBSCAN algorithm. the authors address the difficulty of the user in setting the
The clustering algorithm generating clusters with fuzzy core values of the input parameter when the distances of the
points Fuzzy Core DBSCAN, firstly introduced in Bordogna points are in distinct scales, as it happens in astronomical
and Ienco (2014), the extension generating clusters with images. Thus, they first normalize the distances between all
fuzzy overlapping borders Fuzzy Border DBSCAN and the points in [0, 1], and then they allow computing distinct mem-
most general strategy generating fuzzy overlapping clusters bership degrees on the distance to delimit the neighbourhood
(Fuzzy DBSCAN) are introduced in Sect. 4. of points, i.e. the decaying of the membership degrees as a
After the definitions of the three algorithms, Sect. 6 function of the distance. Then, they select as belonging to the
discusses and compares the performance of the different fuzzy neighbourhood of a point only those points belonging
approaches over real-world data sets in comparison with to the support of the membership function. This extension
those yielded by Fuzzy C-Means and Soft-DBSCAN fuzzy of DBSCAN uses a level-based neighbourhood set, instead
clustering algorithms. Section 7 concludes and summarizes of a distance-based neighbourhood size, and it uses the con-
the main achievements. cept of fuzzy cardinality, instead of classical cardinality, for
identifying core points. This last choice causes the creation
(within the same run of the algorithm) of fuzzy clusters with
very heterogeneous local density characteristics: both fuzzy
2 Related work clusters with cores having a huge number of sparse points
(points located at the border of the local neighbourhood of
The relevant works to our proposal are those related to the each other), and fuzzy clusters with small cores, constituted
literature on soft density-based clustering algorithms. Soft only by a few close points. This approach can be consid-
clustering algorithms are modelled within either fuzzy set, ered dual to our first extension, the Fuzzy Core DBSCAN
probability theory or possibilistic typicalities to allow assign- algorithm (Bordogna and Ienco 2014), since we fuzzify the
ing objects to clusters with a full or a partial membership minimum number of points min Pts defining the local neigh-
degree, in this latter case with the possibility for an object to bourhood density, while the distance is maintained crisp.
123
D. Ienco, G. Bordogna
As a consequence, the membership degree of a point to the heterogeneous. To remove noise they first map the distance of
fuzzy core depends on the number of points in its crisp neigh- any point from its k-neighbours and rank the distance values
bourhood. By this choice, and the computation of the local in decreasing order; then they determine the threshold θ on
density based on the classic set cardinality, a point is assigned the distance which corresponds to the first minimum on the
to only one cluster to a distinct extent, thus generating non- ordered values. All points in the first ranked positions having
overlapping clusters with possibly fuzzy cores. Clusters may a distance above the thresholds θ are deemed noisy points
have cores with faint profiles reflecting low density of the and are removed, while the remaining points will belong to
clusters’ nucleus. a cluster. Only these latter points are clustered with the clas-
FNDBSCAN (Parker and Downs 2013) is closer to our sec- sic DBSCAN providing as input parameters min Pts = K
ond extension, named FBorder, in which we fuzzify only the and = θ . By adopting this same procedure, we can imple-
membership of objects belonging to the border of clusters, ment the proposed algorithms: we can determine the most
this way allowing their partial overlapping. Nevertheless, appropriate distance M ax = θ (which delimits the support
differently than FNDBSCAN, FBorder grows clusters’ cores of the membership function defining the approximate size of
around points characterized by homogeneous local density, the local neighbourhood). This way, depending on the data
thus generating clusters with crisp, not overlapping and set, we remove noise and then apply one of the proposed
homogeneously dense cores. algorithms on the remaining points.
Kriegel and Pfeifle (2005) algorithm is employed to clus- Finally, the extension of DBSCAN with fuzzy logic
ter objects whose position is ill-known. The authors propose reported in Shamshirband et al. (2014) shares with our exten-
the FDBSCAN algorithm in which a fuzzy distance mea- sion the idea of generating clusters with distinct fuzziness
sure is defined as the probability that an object is directly properties, as specified by the fuzzy rules: specifically, a
density-reachable from another objects. This problem can hybrid clustering method is introduced, namely a density-
be modelled by our third extension, named FDBScan, that based fuzzy imperialist competitive clustering algorithm
allows defining the local neighbourhood density of any object (D-FICCA), to detect malicious behaviours in wireless sen-
by specifying an approximate number of objects within an sor networks (WSNs) with the aim to enhance the accuracy
approximate maximum distance, thus capturing the uncer- of malicious detection. A density-based clustering algorithm
tainty on the positions of the moving objects, and generating helps to improve the imperialist competitive algorithm for
fuzzy clusters with both faint cores and fuzzy overlapping the formation of arbitrary cluster shapes as well as han-
border. Finally, the most recent soft extension of DBSCAN dling noise. The fuzzy logic controller is introduced to avoid
has been proposed in Smiti and Eloudi (2013) where the possible errors of the worst imperialist action selection strat-
authors combine the classic DBSCAN with the Fuzzy C- egy. The results demonstrate that the proposed framework
Means algorithm (Bezdek et al. 1984) proposing a method achieves higher detection accuracy compared to existing
called soft-DBSCAN. They detect seeds points by the classic approaches.
DBSCAN and in a second phase they compute the degrees of
membership to the clusters around the seeds by relying on
the Fuzzy C-Means clustering algorithm. A similar objective 3 Classic DBScan algorithm
of selecting the seeds to feed the Fuzzy C-Means is pursued
by the mountain method proposed in Yager and Filev (1994). For sake of clarity in the following, we will consider a set
Nevertheless, these extensions do not grow the clusters of objects represented in multidimensional feature space. We
by applying density-reachable criteria as in our proposed can figure out these objects as either cars, taxi cabs, airplanes
approaches. Distinct density characteristics of clusters: faint represented in the feature space defined by their geographic
cores and not overlapping distributions are modelled by coordinates (both 2D or 3D). DBSCAN can be applied to
Fuzzy Core DBSCAN; semi-overlapping distributions with group these objects based on their local densities in the fea-
homogeneous dense cores by our Fuzzy Border DBSCAN ture space. For example, this makes it possible to identify
extension; finally, faint cores and semi-overlapping distri- traffic jams of cars on the roads.
butions by the third extension Fuzzy DBSCAN. Specifically, DBSCAN assigns points of the feature space
Another important issue when using a clustering algorithm defined on R × R × R · · · × R to particular clusters or des-
on big data is its scalability. In this respect, Parker et al. (2010) ignates them as outliers or noise if they are not sufficiently
proposes a scalable implementation of the FN-DBSCAN, close to other points. It determines cluster assignments by
named SFN-DBSCAN, with the objective of improving the assessing the local density at each point using two para-
efficiency when dealing with big data sets. Another efficient meters: distance radius () and minimum number of points
implementation is proposed in Ester et al. (1996). It tackles (min Pts) that must exists within the neighbourhood of
the problem of clustering a huge number of objects strongly the point. A single point which meets the minimum den-
affected by noise when the scale distributions of objects are sity criterion, namely it has min Pts located within distance
123
Fuzzy extensions of the DBScan clustering algorithm
, is designated a core point. Formally, given a set of points Fuzzy Core DBSCAN, for short FCore, is obtained by con-
P = ( p1 , p2 , . . . , pn ), p is a core point if at least a minimum sidering crisp distances and by introducing an approximate
number min Pts of points exist s.t p j ∈ P and || p− p j || < , value of the minimum cardinality of the local neighbourhood
where ||x|| is the Euclidean distance in the n-dimensional fea- of a point. This can be done by substituting the value min Pts
ture space. Two core points pi and p j with i = j belong to with a soft constraint defined by a monotonic non-decreasing
the same cluster c if || pi − p j || < . Both are core points of membership function on the domain of the positive integers.
c ( pi , p j ∈ cor e(c)). All the points that are not core points This soft constraint specifies the minimum approximate num-
and lie within the maximum distance from a core point of ber of points that are required in the local neighbourhood of a
a cluster c are defined as border points of c: p ∈ / cor e(c) is point for starting the generation of a fuzzy core. Let us define
a border point of c if ∃ pi ∈ cor e(c) with || p − pi || < . the piecewise linear membership function as follows:
Finally, the points that are not part of any cluster are consid- ⎧
⎪
⎨1, if n̂ ≥ M pts Max
ered noisy points: p ∈ / cor e(c) is noise if ∀c, pi ∈ cor e(c) n̂−M pts Min
μmin P (n̂) , if M pts Min < n̂ < M pts Max
with || p − pi || < . In the following, the classic DBSCAN ⎪ M pts Max −M pts Min
⎩
algorithm is formalized: 0, if n̂ ≤ M pts Min
(1)
Algorithm 1 DBSCAN(P,,Min Pts)
Require: P : dataset of points
Require: : the maximum distance around a point defining its local neighbourhood This membership function gives the value 1 when the
Require: Min Pts : minimum local density, in points, around a point to be a candidate number n̂ of elements in the neighbourhood of a point is
core point
1: C = 0 greater than M pts Max , a value 0 when n̂ is below M pts Min
2: Cluster s = ∅ and intermediate values when n̂ is in between M pts Min and
3: for all p ∈ P s.t. p is unvisited do
4: mark p as visited M pts Max .
5: neighboursPts = regionQuery( p, )
6: if (si zeo f (neighbour s Pts) <= Min Pts ) then Since users may find it difficult to specify the two values
7: mark p as NOISE M pts Min and M pts Max in the case of big data, we can try
8: else
9: C = next cluster to automatically suggest two appropriate values. This can
10: Cluster s = Cluster s ∪ expandCluster( p, neighbour s Pts, C, , Min Pts ) be done by mapping the number of points of the data sets
11: end if
12: end for which are at a maximum distance among each other below
13: return Cluster s , for increasing values of . This function is monotonically
not decreasing: we then suggest the values of the functions
Algorithm 2 ex pandCluster ( p, neighbour s Pts, C, ,
corresponding to the first two flexes as the appropriate values
Min Pts)
Require: p: the point just marked as visited
of M pts Min and M pts Max .
Require: neighbour s Pts : the neighbourhood of p Another approach is to allow a user to specify two percent-
Require: C : the actual cluster
Require: the distance around a point to compute its local density
age values, %M pts Min and %M pts Max on the total dataset
Require: Min Pts : local density, in points, defining the minimum cardinality of the size, measured in number of objects, and then convert these
neighbourhood of a point to be a candidate core point
1: add p to cluster C percentages to determine M pts Min and M pts Max as follows:
2: for all p ∈ neighbour s Pts do M pts Min = round(%M pts Min ∗N ) and pts Max = round
3: if p is not visited then
(%M pts Max ∗N ), in which N is the total number of objects
4: mark p as visited
in the data set and r ound(m) returns the closest integer to m.
5: neighbour s Pts = regionQuery( p , )
6:
if si zeo f (neighbour s Pts ) > Min Pts then
Let us now define the fuzzy core. Considering a set P of
7: neighbour s Pts = neighbour s Pts ∪ neighbour s Pts
N objects represented by N points p1 , p2 , . . . , p N in the n-
8: end if dimensional space R n , so that each pi has the coordinates
9: end if
10: if p is not yet member of any cluster then xi1 , xi2 , . . . , xin .
11: add p to cluster C Given a point p ∈ P, if n̂ points pi exist in the local neigh-
12: end if
13: end for bourhood of point p, i.e. with pi − p < , s.t. μmin P (n̂) >
14: return C 0 then p is a fuzzy core point with membership degree
to the fuzzy core given by Fuzzycor e( p) = μ Min P (n̂) If
two fuzzy core points pi , p j with Fuzzycor e( pi ) > 0 and
4 Generating clusters with distinct fuzzy Fuzzycor e( p j ) > 0 ∃ with i = j s.t. pi − p < then
characteristics they belong to the same cluster c ( pi , p j ∈ c) and both
are fuzzy core points of c, ( pi , p j ∈ f uzzycor e(c)) with
4.1 Generating clusters with fuzzy cores membership degrees f uzzycor ec ( pi ) = Fuzzycor e( pi )
and f uzzycor ec ( p j ) = Fuzzycor e( p j ). They belong to the
The first extension of the classic DBSCAN algorithm we pro- cluster with membership degree μc ( pi ) = Fuzzycor e( pi )
posed in Bordogna and Ienco (2014), named and μc ( p j ) = Fuzzycor e( p j ).
123
D. Ienco, G. Bordogna
A point p of a cluster c is a border point if it is not a fuzzy point p is crisp, while we introduce a fuzzy assignment (line
core point and ∃ pi ∈ f cor e(c) s.t. pi − p < then p 1) modelled by the fuzzy function μ Min P () defined in Eq. 1.
gets a membership degree to c defined as: The same function is employed when a new fuzzy core point
is detected (line 8). Also in this case, firstly we verify the
density around a given point p w.r.t. M pts Min and then, if
μc ( p) = min pi ∈neighcor e( p) f uzzycor ec ( pi ) (2) the point satisfies the soft constraint to a positive degree, we
add the point to the fuzzy core of cluster C with its associated
where neighcor e( p) = { pi s.t. f uzzycor ec ( pi ) > 0 ∧ membership value.
pi − p < }.
Finally, points p that are neither fuzzy core points nor
border points are considered as noisy points Algorithm 3 Fuzzy Core DBSCAN(P,,M pts M in,
Notice that, the points belonging to a cluster c get dis- M pts Max )
Require: P : dataset of points
tinct membership values to the cluster reflecting the number Require: : the maximum distance around a point defining the point neighbourhood
of their neighbours within a maximum distance . This def- Require: M pts Min , M pts Max : soft constraint on the density around a point to be a
candidate fuzzy core point
inition allows generating fuzzy clusters with a fuzzy core, 1: C = 0
where the membership degrees represent the variable cluster 2: Cluster s = ∅
3: for all p ∈ P s.t. p is unvisited do
density. 4: mark p as visited
Moreover, a border point p can partially belong to a single 5: nPts = regionQuery( p, )
6: if (si zeo f (n Pts) ≤ M pts Min ) then
cluster c since its membership degree is upperbounded by the 7: mark p as NOISE
8: else
minimum membership degree of its neighbours fuzzy core 9: C = next cluster
points. Notice that, this algorithm does not generate overlap- 10: Cluster s = Cluster s ∪ expandClusterFuzzyCore
( p, n Pts, C, , M pts Min , M pts Max )
ping fuzzy clusters, but the support of the fuzzy clusters is 11: end if
still a crisp partition as in the classic DBSCAN: ci ∩ c j = 12: end for
13: return Cluster s
Further property, the FCore DBSCAN reduces to the clas-
sic DBSCAN when the input values M pts Min = M pts Max :
in this case FCore DBSCAN produces the same results of
the classic DBSCAN with min Pts = M pts Min = M pts Max
and same distance . In fact, the level-based soft condition Algorithm 4 ex pandCluster FuzzyCor e( p, n Pts, C, ,
imposed by μmin P is indeed a crisp condition μ Min P (x) ∈ M pts Min , M pts Max )
{0, 1} on the minimum number of points defining the local Require: p: the point just marked as visited
Require: n Pts : the points in the neighbourhood of p
density of the neighbourhood: μmin P (n̂) = 0 when the num- Require: C : the actual cluster
ber of points n̂ within a maximum distance of any point p is Require: the distance around a point to compute its density
Require: M pts Min , M pts Max : soft constraint on the density around a point to be a
less than min Pts = M pts Min = M pts Max , on the contrary candidate fuzzy core point
μmin P (n̂) = 1. In this case, the membership degrees of all 1: add p to C with membership Fuzzycor e( p) = μ Min P (|n Pts|)
2: for all p ∈ n Pts do
fuzzy core points is 1, and thus the fuzzy core reduces to a
3: if p is not visited then
crisp core as in the classic DBSCAN.
4: mark p as visited
The border points are thus defined as in the classic 5: n Pts = regionQuery( p , )
approach too, since their membership degrees are the mini- 6: if si zeo f (n Pts ) > M pts Min then
7: n Pts = n Pts ∪ n Pts
mum of the membership degrees of the core points in their 8:
add p to C with membership Fuzzycor e( p ) = μ Min P (|n Pts |)
123
Fuzzy extensions of the DBScan clustering algorithm
distance instead of asking for a precise numeric parame- c. While a point p is considered as noise if ∀c pi ∈
ter and in defining a soft constraint with a monotonic not cor e(c) s.t. μdist ( pi , p) > 0.
increasing membership function on the positive real domain The strategy is outlined in Algorithm 5 and 6. The outer
of distance values. The soft constraint defines the concept of loop (Algorithm 5) starts the process. Given a point, the
fuzzy neighbourhood size, so that a point can belong to the neighbourhood is selected considering Min . If the Min Pts
fuzzy neighbourhood of another point to a degree in (0,1]. constraint is not satisfied the point is initially marked as
This allows computing a gradual membership to the clusters. NOISE otherwise the creation of a new cluster begins, and the
Differently than the proposal of Nasibov and Ulutagay procedure ex pandCluster Fuzzy Bor der is called. Algo-
(2009) we allow to specify the membership function on the rithm 6 tries to expand the current cluster C as much as
distance as a soft constraint with piecewise linear shape possible. The difference with the original version of DBSCAN
defined by two values Min and Max so that when the lies in the way the border points are managed and detected.
distance is smaller than Min the membership degree is max- Here, we employ a temporary structure f uzzy Bor der Pts
imum (1), when it is greater than Max its membership is null to collect the current set of border points. Border points are
(0) and it decreases linearly when it is in between Min and points with density lower than Min Pts (line 6) but, differ-
Max : ently from the original algorithm, a point can be a border
⎧ point if it is reasonably at a distance from the cluster in
⎪
⎨1, if p − pi || ≤ Min between Min and Max . To verify this second condition,
Max − p− pi ||
μdist ( p, pi) = , if Min < p − pi || < Max we query the neighbourhood of a point p for both Min and
⎪ Max − Min
⎩
0, if p − pi || > Max Max distances (line 2 and 8). Formula 4 specifies that the
(3) membership of a border point is the minimum of the mem-
berships μdist between the point and all the core points of
the cluster directly reachable. In order to compute the min-
In this definition, p − pi || can be defined as either the imum, we need first to detect all core points of the cluster
Euclidean distance or the complement of a cosine similarity and then compute the μc (·) for all the border points (line 15–
distance or any other distance measure more suitable in the 18). Line 15 is particularly important because a point that
application context. was inserted in the temporary structure f uzzy Bor der Pts,
We can then redefine a core point of a cluster with fuzzy successively can verify the condition to be a core point. The
border: given a point p if at least a number min Pts of points difference between the two sets ( f uzzy Bor der Pts and C)
P = { p1 , . . . , pmin Pts }∃ s.t. ∀ pi ∈ P, μdist ( pi, p) = 1 then guarantees that only the border points are considered after
p is a core point. line 15. Note that when min = max this extension reduces
If two core points pi p j with i = j and μdist ( pi , p j ) = 1 to the classic DBSCAN algorithm, since a point will get from
then pi , p j belongs to c, i.e. they define a cluster c with fuzzy Eq. 1 either a zero or a full (1) membership degree to the clus-
border and are core points of c, i.e. pi , p j ∈ cor e(c) and thus ter. This extension is very similar to the approach proposed
they get a membership degree to the cluster μc ( p) = 1. in Nasibov and Ulutagay (2009), since we fuzzify the input
A point p of a cluster that is not a core point is a parameter too. Nevertheless, in our proposal, the core is
fuzzy border point if it satisfies the following: ∀ p s.t. p ∈ / still crisp and not fuzzy as in Nasibov and Ulutagay (2009).
cor e(c) and pi ∈ cor e(c) and μdist ( pi , p) > 0 then p gets Further, differently than in the previous cited paper, minPts
a membership degree to the fuzzy border of cluster c defined is still a numeric value that defines the local density of a core
as: as in the classic DBSCAN. This allows generating fuzzy clus-
ters with a crisp core, and a fuzzy border. More clearly, in
this extension of DBSCAN, all generated clusters have cores
μc ( p) = min pi ∈neighcor e( p) μdist ( p, pi ) (4) with same density but that may differ for the density of their
border, which may have faint overlapping profiles.
where neighcor e( p) = { pi ∈ cor e(c) ∧ μdist ( p, pi ) > 0}
This definition allows generating fuzzy clusters with faint
borders. 4.3 Generating clusters with fuzzy cores and
Moreover, a point p can partially belong to the fuzzy overlapping fuzzy border
borders of more clusters at the same time with distinct mem-
bership values. This allows generating fuzzy clusters with In this subsection, we introduce how to model fuzziness over
overlapping boundaries, i.e. semi-overlapping fuzzy clus- both cores and borders in order to subsume the previous pro-
ters. This is guaranteed by the condition for the selection posed approaches into what is named Fuzzy DBSCAN, i.e.
of the points to be evaluated as border points of clus- F D B Scan. The two soft constraints defined in (1) and (3)
ters which requires that μc ( p) < 1 for each generated replace both min Pts and to allow the definition of the
123
D. Ienco, G. Bordogna
Algorithm 5 Fuzzy Border DBSCAN(P, Max , Min , If μmin P (dens( p)) > 0 then the point p belongs to the
Mint Pts) fuzzycore of a certain cluster with a membership degree
Require: P : dataset of points Fuzzycor e( p) = μmin P (dens( p))
Require: Min Pts : the minimum density around a point to be a candidate core point
Require: Min , Max : soft constraint on the distance around a point defining the point If μmin P (dens( p)) = 0, then p is a border or a noise
fuzzy neighbourhood size point.
1: C = 0
2: Cluster s = ∅ If in the local neighbourhood of a fuzzy core point pi there
3: for all p ∈ P s.t. p is unvisited do exists another fuzzy core point p j , then a cluster c is gener-
4: mark p as visited
5: nPts = regionQuery( p, Min ) ated: ∃ pi , p j , s.t. μdist ( pi , p j ) > 0∧ Fuzzycor e( pi ) > 0∧
6: if (si zeo f (n Pts) ≤ Min Pts ) then
7: mark p as NOISE Fuzzycor e( p j ) > 0 then f uzzycor ec ( pi ) =
8: else Fuzzycor e( pi ) ∧ f uzzycor ec ( p j ) = Fuzzycor e( p j ).
9: C = next cluster
10: Cluster s = Cluster s ∪ expandClusterFuzzyBorder A point p that is not a fuzzy core point is a fuzzy border
( p, n Pts, C, Max , Min , Min Pts ) point of a cluster c, if it satisfies the following condition:
11: end if
12: end for ∃ pi and ∃ p s.t. f uzzycor e( pi ) = 0 ∧ μdist ( p, pi ) > 0 ∧
13: return Cluster s f uzzycor ec ( p) > 0.
If a point is a border point it cannot be a fuzzy core point
Algorithm 6 ex pandCluster Fuzzy Bor der ( p, n Pts, C, of any cluster:
Max , Min , Min Pts)
Require: p: the point just marked as visited c s.t. f uzzycor ec ( p) > 0
Require: n Pts : the points in the neighbourhood of p
Require: C : the actual cluster
Require: Max , Min : soft constraint on the distance around a point defining its fuzzy If all the conditions are respected we define p as a fuzzy
neighbourhood size
Require: Min Pts : the density around a point to be considered a core point border point of a cluster c with a membership function to the
1: add p to C (as core point) cluster defined as:
2: f uzzy Bor der Pts = r egion Quer y( p, Max ) \ n Pts
3: for all p ∈ n Pts do
4: mark p as visited μb( p) = min pi ∈neigh f cor e( p) (min( f uzzycor ec ( pi ),
5: n Pts = regionQuery( p , Min ) μdist ( p, pi ))) (6)
6: if si zeo f (n Pts ) > Min Pts then
7: n Pts = n Pts ∪ n Pts
8:
f uzzy Bor der Pts = r egion Quer y( p , Max ) \ n Pts
where neigh f cor e( p) = { pi s.t. f uzzycor ec ( pi ) > 0 ∧
9: f uzzy Bor der Pts = f uzzy Bor der Pts ∪ f uzzy Bor der Pts μdist ( p, pi ) > 0}
10: add p to C (as core) The procedures are described in Algorithms 7 and 8.
11: else
12: f uzzy Bor der Pts = f uzzy Bor der Pts ∪ p The general schema is similar to the original DBSCAN.
13: end if The main difference concerns the decision between core
14: end for
15: f uzzy Bor der Pts = f uzzy Bor der Pts \ C and border points which is made by considering μmin P (·)
16: for all p ∈ f uzzy Bor der Pts do and the possibility of a point to belong to multiple clus-
17: add p to C (as border point) with membership μc ( p ) Equation (4)
18: end for ters. Note that, this algorithm reduces to either FCore when
19: return C M pts Min = M pts Max or to FBorder when Min = Max
fuzzy local density and the fuzzy local neighbourhood size Algorithm 7 Fuzzy DBSCAN(P, Max , Min ,M pts Max ,
of points respectively: M pts Min )
Require: P : dataset of points
Require: Min , Max : soft constraint on the distance around a point defining the point
– a soft constraint specified by two values (M ptsmin ≤ fuzzy neighbourhood size
Require: M pts Min , M pts Max : soft constraint on the density around a point to be
M ptsmax ) on the Natural domain defines a fuzzy local considered as fuzzy core point
1: C = ∅
dense region; 2: Cluster s = ∅
– a soft constraint specified by a pair (min ≤ max ) on the 3: for all p ∈ P s.t. p is unvisited do
4: mark p as visited
positive reals defines the local fuzzy neighbourhood size 5: nPts = regionQuery( p, Max )
of a point. 6: dens(p) = as in equation (5)
7: if (μmin P (dens( p)) == 0) then
8: mark p as NOISE
9: else
We define the local density of a point p as follows: 10: C = next cluster
11: Cluster s = Cluster s ∪ expandClusterFuzzy
( p, n Pts, C, Max , Min , M pts Max , M pts Min )
dens( p) = μdist ( p, pi ) (5) 12: end if
13: end for
pi ∈neigh( p,max ) 14: return Cluster s
123
Fuzzy extensions of the DBScan clustering algorithm
6 Experiments
123
D. Ienco, G. Bordogna
For FCore, FBorder and FDBScan we vary the soft con- where
straints considering all the possible values combination in
the previous intervals. For each method we retain the solu- i∈C j ∩Dcl μi j
FPrecision(C j , cl) =
tion with the least number of noise points for, in principle, |C j |
the used datasets should not contain noise.
and
6.1 Internal and external clustering validity measures i∈C j ∩Dcl μi j
FRecall(C j , cl) =
|Dcl |
The clustering results are assessed under both internal and
external validity measures. As internal criteria we choose the and the final Fuzzy F-Measure is defined as:
Partition Coefficient (Guillén et al. 2007) and the Fuzzy Per-
formance Index (Smiti and Eloudi 2013), while we employ |C j |
× Fuzzy F-Measure(C j , cl)
the Fuzzy F-Measure as external one (Suanmali et al. 2009), |D|
C j ∈C
which is a combination of Recall and Precision.
We define with D the dataset, with |D| the size of the
Each cluster C j is associated with the class cl that maximizes
dataset, with Dcl the instances of the dataset belonging to
the corresponding Fuzzy F-Measure(C j , cl). The final solu-
class cl, and with C the obtained cluster solution. We indicate
tion is a weighted sum between the Fuzzy F-Measure of a
with μi j the membership degree of the ith object to the jth
cluster C j and its importance considering the clustering solu-
cluster.
tion.
The Partition Coefficient (Guillén et al. 2007) is calculated
as follows:
6.2 Results
|D| |C|
1 We report the evaluation results of the different approaches
PC = × μi2j
|D| in Tables 2, 3 and 4. Table 2 shows the results in term of
i=1 j=1
Fuzzy F-Measure. We can observe that, most of the time,
our proposals outperform the competitors. Considering the
This internal evaluation measure allows to compute the Breast and (Iris) datasets, FCM obtains the highest score,
amount of overlap between clusters. High values of this mea- while our strategies still obtain reasonable and competitive
sure indicate more cluster cohesion and density. results. Regarding the comparison among the three differ-
As second internal measure we employ the Fuzzy Partition ent fuzzy extensions we proposed, we can observe that the
Index (Smiti and Eloudi 2013). This measure is defined as: FBorder strategy always reaches the same or the best score
⎛ ⎞ in term of Fuzzy F-Measure w.r.t. the other extensions. This
|D|
|C| model, contrary to the others we proposed, allows a fuzzi-
|C| μi2j
FPI = 1 − × ⎝1 − ⎠ ness degree only for border points, while it considers that core
|C| − 1 |D|
i=1 j=1 points can belong to only one cluster. The empirical results
underline that the assumption behind the FBorder well fits
This measure evaluates the degree of separation of the the underlined data distribution of the real-world benchmark
fuzzy partition produced by the clustering algorithm. More we considered.
in detail, the Fuzzy Partition Index quantifies the average Tables 3 and 4 summarize the results in term of Partition
cohesion of the clusters according to the membership func- Coefficient and Fuzzy Partition Index of the different algo-
tion of the element of each cluster. Also for this measure, rithms. Both measures highlight the quality of our new fuzzy
high values indicate more cluster cohesion. DBSCAN extensions. We can see that all the three extensions
The external measure we use is the Fuzzy F-Measure yield high values for the internal measures and outperform
(Suanmali et al. 2009). This measure is a fuzzy adaptation any of the competitors.
of the standard F-Measure commonly involved to compare In order to explain this result, we deeply inspected the
clustering results with the reference classification. First of all different clustering solutions. We observed that, first, the
we define the Fuzzy F-Measure for a cluster C j given a class Soft-DBSCAN and the FCM algorithms assign each object
cl as: to more clusters than the FCore, FBorder and FDBScan.
Second, for an object its membership values distribution over
all fuzzy clusters has often a multi-modal shape for both the
FPrecision(C j , cl) ∗ FRecall(C j , cl) Soft-DBSCAN and the FCM. This means that, several clus-
FMeasure(C j , cl) = 2 ×
FPrecision(C j , cl) + FRecall(C j , cl) ters share high membership values for the same object. This
123
Fuzzy extensions of the DBScan clustering algorithm
is not the case for our fuzzy DBSCAN extensions where, in contain core points with different membership values, thus
theory, an object can belong to multiple clusters but, in prac- allowing to detect clusters with heterogeneous densities of
tice, it has membership degrees greater than zero for a limited their nucleus with a single run of the algorithm. The second
number of clusters (usually no more than two or three clus- extension, Fuzzy Border DBSCAN, allows generating semi-
ters), which seems a reasonable characteristics of real data overlapping clusters with fuzzy border and homogeneous
distributions. dense cores. The third extension, Fuzzy DBSCAN, combines
the previous ones thus detecting clusters with both fuzzy core
and fuzzy border points, i.e. heterogeneous dense cores and
7 Conclusion overlapping borders.
The main novelty of the proposal is the intent to con-
In this contribution, we presented three fuzzy extensions of trol distinct fuzzification characteristics of the clusters that
the DBSCAN clustering algorithm, to the aim of modelling can be generated when using a clustering algorithm, thus
distinct density-based characteristics of the objects spatial suiting distinct application domains, such as user commu-
distributions in the feature space. The main characteristics nity detection in social networks with partial membership
of these algorithms are the definition of distinct soft con- either to disjoint communities Fuzzy Core DBSCAN or to
straints to specify the approximate local density of points semi-overlapped communities Fuzzy Border DBSCAN, and
needed for generating a cluster. Specifically, the first exten- ecosystems detection in satellite images Fuzzy DBSCAN.
sion, Fuzzy Core DBSCAN allows assigning a core point to Furthermore, besides leveraging the specification of the
a cluster with a membership value; in doing so, clusters can precise input, the proposals supply with a single run a solu-
123
D. Ienco, G. Bordogna
tion that summarizes multiple runs of the original classic Ji Z, Xia Y, Sun Q, Cao G (2014) Interval-valued possibilistic fuzzy
DBSCAN algorithm. Experimental comparison w.r.t. state of c-means clustering algorithm. Fuzzy Sets Syst 253:138–156
Kriegel H, Pfeifle M (2005) Density-based clustering of uncertain data.
the art fuzzy clustering approaches over real-world datasets In: KDD’05 vol 17, pp 672–677
underlined the higher quality of the results produced by our Nasibov EN, Ulutagay G (2009) Robustness of density-based cluster-
proposals, which better model the fuzzy characteristics of the ing methods with various neighborhood relations. Fuzzy Sets Syst
real datasets. 160(24):3601–3615
Pal NR, Pal K, Keller JM, Bezdek JC (2005) A possibilistic fuzzy c-
means clustering algorithm. IEEE Trans Fuzzy Syst 13(4):517–
Compliance with ethical standards
530
Parker J, Downs J (2013) Footprint generation using fuzzy-
Conflict of interest Dino Ienco and Gloria Bordogna declares that they neighborhood clustering. Geoinformatica 17:283–299
have no conflict of interest. Parker J, Hall L, Kandel A (2010) Scalable fuzzy neighborhood dbscan.
In: IEEE-fuzzy, pp 1–8
Ethical standard This article does not contain any studies with human Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in
participants or animals performed by any of the authors. spatial databases: the algorithm GDBSCAN and its applications.
Data Min Knowl Discov 2(2):169–194
Shamshirband S, Amini A, Anuar NB, Kiah LM, Wah TY, Furnell
S (2014) D-ficca: a density-based fuzzy imperialist competitive
References clustering algorithm for intrusion detection in wireless sensor net-
works. Measurement 55:212–226
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering Smiti A, Eloudi Z (2013) Soft dbscan: improving dbscan clustering
algorithm. Comput Geosci 10(2):191–203 method using fuzzy set theory. Hum Syst Interact 1:380–385
Bordogna, G., Ienco, D.: Fuzzy core dbscan clustering algorithm. In: Suanmali L, Salim N, Binwahlan MS (2009) Fuzzy logic based method
IPMU, pp. 100–109 (2014) for improving text summarization. CoRR arXiv:0906.4690
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm Ulutagaya G, Nasibov E (2012) Fuzzy and crisp clustering methods
for discovering clusters in large spatial databases with noise. KDD based on the neighborhood concept: a comprehensive review. J
160:226–231 Intell Fuzzy Syst 23:1–11
Guillén A, González J, Rojas I, Pomares H, Herrera LJ, Valenzuela Yager R, Filev D (1994) Approximate clustering via the mountain
O, Prieto A (2007) Using fuzzy logic to improve a clustering method. IEEE Trans Syst Man Cybern 24(8):1279–1284
technique for function approximation. Neurocomputing 70(16–
18):2853–2860
123