Received July 18, 2017, accepted August 14, 2017, date of publication September 18, 2017,
date of current version November 28, 2017.
Digital Object Identifier 10.1109/ACCESS.2017.2743985
A Clustering Validity Index Based
on Pairing Frequency
HONGYAN CUI1,2,3,5 , (Senior Member, IEEE), KUO ZHANG1,2,3 , YAJUN FANG7 ,
STANISLAV SOBOLEVSKY4,5,6 , CARLO RATTI5 , (Fellow, IEEE),
AND BERTHOLD K. P. HORN7
1 State
Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
Laboratory of Advanced Information Networks, Beijing 100876, China
Laboratory of Network System Architecture and Convergence, Beijing 100876, China
4 Center for Urban Science and Progress, New York University, Brooklyn, NY 10003 USA
5 Senseable City Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
6 Institute of Design and Urban Studies of the National Research University ITMO, 197101 Saint-Petersburg, Russia
7 CSAIL Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
2 Beijing
3 Key
Corresponding author: Hongyan Cui (yan555cui@163.com)
This work was supported in part by the National Natural Science Foundation of China under 61201153, in part by the National 973
Program of China under Grant 2012CB315805, and in part by the National Key Science and Technology Projects under
Grant 2010ZX03004-002-02.
ABSTRACT Clustering is an important problem, which has been applied in many research areas. However,
there is a large variety of clustering algorithms and each could produce quite different results depending on
the choice of algorithm and input parameters, so how to evaluate clustering quality and find out the optimal
clustering algorithm is important. Various clustering validity indices are proposed under this background.
Traditional clustering validity indices can be divided into two categories: internal and external. The former
is mostly based on compactness and separation of data points, which is measured by the distance between
clusters’ centroids, ignoring the shape and density of clusters. The latter needs external information, which
is unavailable in most cases. In this paper, we propose a new clustering validity index for both fuzzy and hard
clustering algorithms. Our new index uses pairwise pattern information from a certain number of interrelated
clustering results, which focus more on logical reasoning than geometrical features. The proposed index
overcomes some shortcomings of traditional indices. Experiments show that the proposed index performs
better compared with traditional indices on the artificial and real datasets. Furthermore, we applied the
proposed method to solve two existing problems in telecommunication fields. One is to cluster serving GPRS
support nodes in the city Chongqing based on service characteristics, the other is to analyze users’ preference.
INDEX TERMS Pairwise pattern, clustering validity, clustering analysis, fuzzy c-means.
I. INTRODUCTION
In hard clustering, each data point is assigned to exactly one
cluster. A well-known example of hard clustering is k-means
algorithm [1]. In fuzzy clustering, each of the data points
can belong to multiple clusters. Fuzzy clusters can be easily
converted to hard clusters by assigning the data point to
the cluster with greatest probability. The most widely used
fuzzy clustering algorithm is fuzzy c-means (FCM) [2]–[4].
FCM selects the centroid of each initial cluster randomly
and computes initial fuzzy membership matrix, then tries to
iteratively minimize an objective function until the algorithm
converges. We will use FCM to generate flexible partitions in
this paper.
Result from different clustering algorithms or even the
same algorithm can be very different from each other on the
24884
same dataset, because the input parameters, which greatly
decide the behavior of an algorithm, could be varied. The
aim of CV is to find the partition result that best fits the
input dataset. With the help of CV, parameters needed for
the algorithm can be tuned more efficiently. For example, the
number of clusters for a clustering process (represented by c)
usually needs to be specified in advance, however people
often do not have any specific criteria for choosing it, instead,
they usually make an arbitrary choice based on common
sense. In the proposed approach all the clustering parameters
and results are evaluated by a clustering validity criterion,
then the partition that best fits the dataset is produced as well
as the corresponding value of c [5]–[7].
Clustering validity techniques are classified into two
categories: external validation and internal validation.
2169-3536 2017 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
VOLUME 5, 2017
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
External validation evaluates a partition by comparing it with
the assumed correct partition result, while internal validation
evaluates a partition by examining just the result. Obviously,
the former one can only be applied in some limited scenarios,
since in a real application the underlying structure of the
dataset is unknown, and the ground truth is correct partition
result is not available [8]. This paper will focus our research
focuses on internal validation which often measures the
compactness and separation of clusters.
Many internal indices have been proposed for clustering
validation over the past few decades [9], [10]. In the following part of this paper we will call these Clustering Validity
Indices (CVIs). Here are some typical CVIs for fuzzy clustering: Bezdek’s partition coeffcient (PC) [11] and partition
entropy (PE) [12], XB [13], FS [14], the fuzzy hypervolume
validity (FHV) proposed by Gath and Geve [15], SC [16],
PBM index [17], [18], Wu and Yang’s PCAES index [19],
Zhang’s VW index [20], etc. In addition, CVIs for crisp partitions, like Dunn [2], Davies and Bouldin [21] are widely used.
In common sense, a good clustering result should resemble high compactness and significance separation. However,
those CVIs suffer several inherent shortcomings. For example, variance is a common measure of compactness, and it
tends to prefer hyperspherical-shaped clusters. Another problem of the compactness measure is that these indices tend to
monotonically decrease when the number of clusters tends
toward the number of data points in the dataset. In addition,
the calculation of separation measure between clusters is usually based on geometric centroid of each cluster but ignores
other features of clusters such as shape, density and scatter
features. The latest proposed OSI [22] uses a measure of
multiple cluster overlap and a separation measure for each
data point, both based on an aggregation operation [23], [24]
of membership degrees. We can get a series of OSI indices
using different aggregation operations, while how to choose
the appropriate aggregation operations is always a challenge.
In this paper, we present Pairing Frequency Clustering
Validity Index (PFCVI), a new clustering validity index aims
to overcome the shortcomings of CVIs using compactness
and separation measures. The proposed index was inspired
by the following idea: For different values of c, different clustering results obtained by the same algorithm, If an arbitrary
pair of data points in a dataset is always partitioned into the
same cluster, then the optimal partition should also assign this
very duo to the same cluster. We call the phenomenon, that a
certain pair of data points always belongs to the same cluster
across different value of c, pairing frequency. One advantage
of PFCVI is that it is designed upon logical reasoning based
on statistical analysis of pairwise patterns, rather than the
frequently used compactness or separation measures. Moreover, PFCVI can be applied to both fuzzy clustering and hard
clustering. Lastly, the computational cost of PFCVI does not
depend on the dimension of the feature vector. A procedure
for choosing the optimal value of parameter c (the number
of clusters) from a range of alternatives using PFCVI is
presented. At the end of this paper, evaluation of PFCVI
VOLUME 5, 2017
is performed. Experiments on artificial and real datasets show
that PFCVI is stable and efficient.
The rest of the paper is organized as follows. Section II
describes our proposed clustering validity index in detail.
In section III, we evaluate the new index with artificial and
real datasets. Practical applications are showed in section IV.
Concluding remarks are given in section V.
II. THE PROPOSED CLUSTERING VALIDITY INDEX
BASED ON PAIRING FREQUENCY
In this section, we present the proposed clustering validity
index based on pairing frequency (PFCVI). Unlike traditional
CVIs introduced in section 6, our proposed PFCVI provides
a new perspective to the issue of clustering validity.
A. THEORY OF PFCVI
Certain steps should be followed to determine the optimal
value of c using traditional CVIs. First, perform a clustering
algorithm several times with c varying in a user-defined
range [cmin , cmax ]. Second, compute CVI(c) for each partition
result. Finally, set copt so that CVI(copt ) is optimal within the
predefined range, and the process of looking for copt uses
each partition result independently. Unlike traditional CVIs,
our proposed PFCVI takes many partition results together in
order to take advantage of global information also and logical
reasoning. When given different values of c in a clustering
algorithm, different results will be produced. In this situation,
if a pair of objects in a dataset are always assigned to the same
cluster, then the optimal partition should also have the pair
belong to a same cluster. PFCVI can not only tell us whether
a pair of objects should be partitioned into the same cluster
or not, but also it delivers a belief value, which indicates a
degree of confidence that a pair of objects belong to the same
cluster.
B. THE CALCULATION PROCEDURE OF PFCVI
This section describes the steps to calculate PFCVI. In next
subsection.
First, we obtain the membership matrix U = [uij ] from the
result of a clustering algorithm like FCM
u11 u12 · · · u1n
u21 · · · · · · u2n
(1)
Uc×n = .
..
..
..
..
.
.
.
uc1 · · · · · · ucn
For each object j we define Ij = uij max , the notation
[ ]max acquires the value of i when uij (1 ≤ i ≤ c) reaches
its maximum. We can conclude that a pair of objects xs , xk
(1 ≤ s, k ≤ n) belong to the same cluster under this value of
c if Is = Ik .
Second, for each value of c , we define:
f11 f12 · · · f1n
f21 · · · · · · f2n
Fc = .
(2)
..
..
..
..
.
.
.
fn1
···
···
fnn
24885
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
We call Fc a pattern matrix . The element of Fc denoted by
fsk indicates the degree to which xs , xk belong to the same
cluster or different clusters. It is calculated as follows:
c
× |max us − max uk |
if Is = Ik
1 −
c
−
1
fsk =
c
2
−
× (max us + max uk − ) if Is 6= Ik
2c − 2
c
(3)
where max us = max{uis : 1 ≤ i ≤ c} , max uk = max{uik :
1 ≤ i ≤ c} . Obviously, Fc is a symmetric matrix, and fsk is
a measurement of the extent to which two points belonging
to the same or different clusters. Condition Is = Ik means
that point s and point k are in the same cluster to some extent.
A small value of |max us − max uk | means that s and k tend to
belong to a certain cluster. For the second condition, Is 6= Ik
means that member list for point s and point k are greatly
different. And s and k should be assigned to different clusters
shows a large value. In the following works, we must normalize those two values, because for fuzzy clustering, we have
c
P
uik = 1∀k. So 1c ≤ max us , max uk ≤ 1, then we have 0 ≤
i=1
c−1
c
2
c
|max us − max uk | ≤
and ≤ |max us + max uk | ≤ 2.
To make the nornalization meet the above requirement, we
set the normalized formula as the Eq.(3). Overall, the case
0 < fsk ≤ 1 suggests that xs , xk share the majority of their
membership in the same cluster. The closer fsk is to 1, the
stronger is their affinity for belonging together. Correspondingly, the case −1 ≤ fsk < 0 suggests that xs , xk have little
in common for this value of c. The closer fsk is to -1, the
stronger is the disassociation between xs , xk . For the case of
hard clustering, the following formula is used to calculate fsk
(
1
if s, k in the same cluster
fsk =
(4)
−1 if s, k in different clusters
Next, we combine a certain number of Fc to obtain the
pairwise pattern the matrix which is denoted by P. P is
cupper
P
Fc , where cupper is a parameter will
defined as: P =
c=2
be discussed later in Section 2.4. Usually cupper ≥ cmax , and
P can be normalized under this condition as Q:
Q = P/(cupper − 1)
(5)
matrix Q is the final global pairwise pattern matrix which
indicates the likelihood of two data points to the same cluster
(or different clusters). From definition from above we can see
that matrix Q takes advantage of the information about all of
the partitions obtained by FCM for the range of c used.
At last, our proposed clustering validity index PFCVI is
defined as:
PFCVI (c) = S(Q◦Fc )
(6)
where notation ‘◦’ represents Hadamard product, and S represents the sum of Q◦Fc.
Let’s look at element (qsk · fsk ) to fully understand PFCVI.
If fsk and qsk share the same sign, namely pair (s, k) in
24886
a FCM’s result for a specific value of c will be in accord with
that in the global pattern matrix, so the pair (s, k) will contribute positively to PFCVI (c), and vice versa. The ‘‘ pattern
of pair (s, k)’’ means the occurrence of (s, k) belonging to the
same cluster or different clusters. Therefore, a larger value of
PFCVI (c) means that the clustering process with parameter c
is more appropriate for a given dataset, and c producing the
optimal value of PFCVI will be chose as the final result.
PFCVI does not compute Euclidean distances like many of
the other CVIs, so the computation cost is independent of the
dimension of feature vectors. So PFCVI is relatively efficient
when dealing with high dimension data.
C. THE PROCEDURE OF SELECTING THE OPTIMAL VALUE
OF c USING PFCVI
1) For each value of c = 2, 3, . . . , cupper , we carry out
the corresponding clustering algorithm and compute Fc
(c = 2, 3, . . . , cupper ) using Eq.(2).
2) Compute the matrix Q using Eq.(5).
3) Compute PFCVI (c) (c = cmin , . . . , cmax ) using Eq.(6).
4) copt = arg max (PFCVI(c))
cmin ,...,cmax
D. COMMENTS ABOUT cupper
Step 1 of this algorithm generates Fc for c = 2 to cuppper ,
and when computing the global pattern matrix Q , the upper
bound of the summation operator is cupper instead of cmax .
cupper is a main factor influencing the performance of PFCVI
for the following reasons. If the optimal but unknown number
of clusters in a dataset (denoted by c∗ ) is much larger than
cupper (cupper ≪ c∗), there will be some pairs in all of the
partitions that would eventually be split when c gets to c∗,
so the matrices Fc on hand will add incorrect information
global pattern matrix Q. Let us give an example to illustrate
this case. Suppose that xs , xk belong to different clusters.
The value of c (c = 2, 3 . . . , cupper ) may be too small
to split the pair xs , xk into different clusters. We suppose
cupper = 2 and the actual number of clusters in a dataset
c∗ ≫ 2. In this case, many pairs of objects will be incorrectly
paired because there are only two clusters, although most of
them actually belong to different clusters. On the other hand,
if the number of clusters in a dataset is much smaller than
cupper (cupper ≫ c∗), some pairs of objects will be split into
different clusters in the clustering process with a larger value
of parameter c , although they are likely to belong to the
same cluster. This also adds incorrect information into the
global pattern matrix Q. To prevent this ‘‘splitting action’’
from happening, we try to make cupper satisfy the formula:
cupper ≥ cmax .
The data sets used in our experiments are all labeled,
and the maximum number of labelled subsets is
than
√less
n
) and
10, so we can simply
set
c
=
min(10,
√ max
cupper = max(cmax , 0.5 n ), which worked well in our
experiments.The optimal choice for copt falls in the middle
of (2, cupper ).
When the number of samples n is too large, these heuristics
become useless, so we can simply set cmax = cupper .
VOLUME 5, 2017
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
FIGURE 1. All the artificial datasets. (a) Over. (b) Over+Noise. (c) Bridge. (d) 3D-GD. (e) Shape. (f) Local Closeness. (g) Shape+Density.
(h) Gaussian+Density. (i) Edge. (j) Size. (k) X30. (l) Local Overlap.
III. EXPERIMENTAL RESULTS
We evaluate the performance of the proposed PFCVI by conducting extensive comparisons between eight CVIs (PC, PE,
XB, FHV, PBMF, PCAES, VW, OSI) and FCM algorithm.
A small note about the OSI is that the standard norms,
namely, the min of t-norms and the max of t-conorms are
used in our experiment for simplicity and representativeness. As in almost all papers dealing with fuzzy clustering
validity, the fuzzifier exponent m is set to 2, the termination
parameter for FCM for convergence is set to 10−3 and the
Euclidean distance is used. The optimal number of√
clus
n )
ters in range [cmin = 2, cmax ], with cmax = min(10,
in order to ensure a good balance between the number
VOLUME 5, 2017
of clusters and the number of √
points
in dataset [12]. We
set cupper = max(cmax , 0.5 n ) in our experiments.
In order to reduce the influence of random initialization
for FCM, we run the FCM algorithm 20 times for each
dataset and compute the corresponding CVIs 20 times
too. Then we take the average as the final result. All
the values of CVIs are normalized to fall in the interval
interval [0, 1].
A. DATASETS
15 diverse characteristics datasets were used to evaluate our
index, such as good separation, overlapping clusters, different
shape of clusters, differences in density and additional noisy
24887
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
TABLE 1. Datasets with different properties.
points. The first twelve datasets are artificial two-dimensional
datasets such that ground truth can be visually assessed by
examining their scatterplots in Fig. 1. Remaining three are
real datasets from UCI Machine Learning Repository [31].
Table 1 presects the brief descriptions of all datasets. The
dataset Over contains 200 points sampled from a mixture
of c = 4 bivariate normal distributions of where 50 points
for each component[see Fig.1(a)]. In Over+Noise dataset,
30 noise points sampled from a uniform distribution are
added to the over dataset to simulate a noisy environment
[see Fig.1(b)]. The dataset Bridge is composed of four connected clusters [see Fig.1(c)]. The dataset Local Closeness
consists of 6 clusters which is likely to be partitioned into
two clusters because of the local closeness [see Fig.1(f)].
The Iris dataset contains data from three types of Iris,
named respectively as Setosa, Versicolor and Virginica, each
of which contains 50 objects described by 4 dimensions
(features). Although Iris has 3 labeled subsets, 2 of them
are substantially overlapped. The consensus in most of the
literature is that Iris has only 2 clusters which are optimal
for most clustering models, but there are still some clustering
algorithms can actually produce three clusters, so 2 or 3 are
commonly regarded as the reasonable clustering results of Iris
dataset.The Breast Cancer dataset contains 699 instances but
16 of them are removed because they are incomplete. The
Pima dataset consists of 768 instances from two overlapping
labels.
B. EXPERIMENT RESULT
Fig. 2 and Fig. 3 plot average values of all 8 indices for
all the considered datasets (Fig. 3 for artificial datasets,
24888
Fig. 2 for real datasets). Optimal number of clusters are
displayed on the horizontal axis and the y-value denotes
normalized value of the CVIs. The copt for each CVI is
shown in the filled dots and the PFCVI is denote as filled
triangles. Table 2 summarizes the optimal number of clusters
obtained from the tested CVIs on artificial and real datasets.
The expected numbers of clusters are showned in c∗ column,
which is either the physical number of clusters given by an
expert (real datasets) or can be figured out visually (artificial
datasets). For datasets Over and Over+Noise, we can say they
have 2 visually apparent clusters, or have 4 clusters with three
of them being overlapped a little. We can see from Table 2 that
PC,PE,OSI,VW suggest 2 as the result for copt while PFCVI
shows that copt = 4 should be equal to 4, which indicates
that these two dataset may have 2 well separated clusters and
4 fuzzy clusters. The dataset Bridge with linking points is a
difficult problem for most of the indices. Here VW, PFCVI
find the right number of clusers. The structures of the Shape
and X30 datasets is relative easy, therefore, most of the presented indices, including the proposed one, correctly identify
the right values 3 and 4. Data points from Edge dataset are
sampled from a mixture of 4 gaussian distributions, however,
the edge of the cluster is got blurred. As a result, PFCVI
and XB think the optimal cluster numbers to be 4 and that
is aligned with expectation. For the Size dataset, PC, OSI,
PFCVI performs well on estimate the correct number of clusters. The dataset Local overlap has four clusters with three of
them being overlapping, this is very similar with the dataset
Over and Over+Noise. In the same way, we think 2 and 4 are
the reasonable cluster number. On this view, FHV and PFCVI
produced the correct number of clusters 4 while others tend
VOLUME 5, 2017
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
FIGURE 2. Value of all CVIs with c from cmin to cmax on 3 real datasets. (a) Iris dataset. (b) Breast Cancer dataset. (c) Pima dataset.
TABLE 2. Optimal number of clusters obtained using different CVIs on artificial and real datasets.
to get 2 as optimal number except for XB, PBMF, PCAES,
which are far from the reasonable cluster number. It is generally accepted that the right number of clusters for the Iris
dataset is two or three (the number of physical classes). Most
of the indices indicate either c∗ = 2 or c∗ = 3. Interestingly,
VOLUME 5, 2017
for c=2 and c=3, values of PFCVI is very close, which may
also indicate that both of the results are rational. Almost all
indices identify the right number of clusters for the Breast
Cancer dataset except PBMF and PCAES. For the dataset
Pima, our proposed PFCVI does not identify the right number
24889
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
FIGURE 3. Average values of all CVIs with c from cmin to cmax on 12 artificial datasets. (a) Over. (b) Over+Noise. (c) Bridge. (d) 3D-GD.
(e) Shape. (f) Local Closeness.
of labels infuenced by the ’splitting action phenomenon’
phenomenon splitting action. In conclusion, none of the
indices correctly recognizes the expected number for all the
datasets. There will hardly be an index that is suitable for a
large number of different datasets. As Pal and Bezdek [32]
stated, ‘‘no matter how good your index is, there is a dataset
out there waiting to trick it (and you)’’.
24890
IV. PRACTICAL APPLICATION
In this section, we apply our proposed clustering validity
index in two real tasks. The first one is to cluster Serving
GPRS Support Nodes (SGSNs) in one main city of China
based on service characteristics. The second one is to analyse
user preferences for eight types of internet service based on
their behavior records.
VOLUME 5, 2017
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
FIGURE 4. Average values of all CVIs with c from cmin to cmax on 12 artificial datasets. (a) Shape+Density. (b) Gaussian+Density.
(c) Edge. (d) Size. (e) X30. (f) Local Overlap.
A. CLUSTER OF SGSNs
We use our proposed index FPCVI combined with FCM to
analyze the SGSNs’ and divide SGSNs based on the characteristics of customers’ behavior. The dataset, which contains
3270860 usage records with connections to 921 SGSNs.
Each SGSN acts as a data point, which contains eight
feature items: the traffic of three IM ( Instant Messaging)
applications ( QQ, MSN, WeChat), three streaming media
VOLUME 5, 2017
applications (PPTV, PPS, KanKan), the records’ amount and
total traffic in each SGSN. So, FCM is applied to a set of
n=921 objects, each represented by an 8 dimensional feature
vector.
After choosing co pt = 8 and hardening the corresponding
fuzzy 8-partition of the data, Table 3 shows that the number
of SGSNs in Cluster III is 833, and no more than 30 in the
other seven clusters. The cluster center of traffic amount for
24891
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
TABLE 3. SGSN characteristics of various types.
TABLE 4. c_optimal regarding to c_max ranging from 8,9,10,11,12.
Cluster III is 107 , while other cluster centers are 108 or 109 .
That is to say, the magnitude of records’amount and users’
amount in the Cluster III are 1-2 orders lower than other
clusters. The relationship of SGSNs and real geographical
positions can be used to explain the subscribers’ usage pattern
of traffics. When a subscriber uses traffic service in other
cities or countries instead of his own city, the traffic expense
will be high because of the cost of roaming. So it is easy
to understand that the user would reduce his traffic usage
in cellular networks because of the economic factor when
he roams out of his own city, and he would turn to WiFi
instead. The subscriber may prefer to use more data traffic
in his own city. Now let us look at our clustering results.
Those SGSNs with high traffic amounts are local SGSNs,
like Cluster VI. On the contrary, those SGSNs with lower
traffic amounts in the Cluster III spread all over the country,
the real geographical positions of SGSNs in the Cluster III
involve 25 different provinces, and the traffic amounts passing through these SGSNs are far lower than other SGSNs.
From this, it can be concluded that the SGSNs in clusters V,
VI, and VII are located in the home city.
B. CLUSTERS OF USERS GROUP
We applied this cluster index to cluster an operators’s users
in a certain city, and divided user groups of different traffic
characteristics. Each user is represented by an 8-dimensional
feature vector, which records the percentage of different types
of services within an individual’s total records. The services
include Multimedia Message Service(MMS), Web, Instant
Messaging (IM), Streaming media, E-mail, Phone call, File
transfer / P2P and other types of services.
Here, cmax and cupper are set equal, and take the values
8, 9, 10, 11, 12. Our cluster results are listed in Table 4,
for n = 313505 user profiles. From the cluster results, we
find that there is a big difference between the clusters when
we divide the users into 4 groups. If users are clustered
into more clusters, some clusters present similar patterns and
could have been included into the 4 clusters so the result of
c_optimal = 4 when c_ max = 11 is most reasonable.
In the case of 4 clusters, for the users in the 1st cluster
and the 2nd cluster, Web and Instant messaging account
24892
TABLE 5. Cluster centers of 4 clusters.
for almost 50% of traffic record amounts. Users in the 1st
cluster tend to use Web service more, while the users in the
2nd cluster have a tendency to use IM service more. Users in
the 3rd cluster are mild users of Web and IM, while other
unmentioned applications account for a big chunk of their
usage. Users in the 4th cluster are heavy users of Web and
IM, and Web usage weights far more than IM. We lists the
characteristics of the cluster centers in Table 5. Because the
percentages of other kinds of traffic uses are very small, so
only Web, IM and Others out of 8 dimensions are listed here.
V. CONCLUSION
In this paper, a new clustering validity index called PFCVI
has been proposed which is based on pairing frequency
instead of compactness-to-separation ratio criteria employed
by some classic CVIs. Another significant difference compared with other CVIs is that PFCVI, with the help of the
global pattern matrix, takes advantage of the information
from more than one clustering processes to compute the
value of PFCVI(c). An extensive comparison with seven other
widely used indices shows that our new index performs well
for most of the datasets used in this study.
REFERENCES
[1] S. Theodoridis and K. Koutrombas, Pattern Recognition. London, U.K.:
Academic, 2006, pp. 529–533.
[2] J. C. Dunn, ‘‘A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,’’ J. Cybern., vol. 3, no. 3, pp. 32–57,
1973.
[3] J. C. Bezdek, ‘‘A convergence theorem for the fuzzy ISODATA clustering
algorithms,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-2, no. 1,
pp. 1–8, Jan. 1980.
[4] J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms. New York, NY, USA, 1981.
[5] J. C. Bezdek, J. M. Keller, R. Krishnapuram, and N. R. Pal, Fuzzy Models
and Algorithms for Pattern Recognition and Image Processing. Norwell,
MA, USA: Kluwer, 1999.
[6] G. W. Milligan and M. C. Cooper, ‘‘An examination of procedures for
determining the number of clusters in a data set,’’ Psychometrika, vol. 50,
no. 2, pp. 159–179, 1985.
[7] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, ‘‘On clustering validation
techniques,’’ J. Intell. Inf. Syst., vol. 17, no. 2, pp. 107–145, 2001.
[8] M. Brun, C. Sima, J. Hua, and B. Lowey, E. Suh, and E. R. Dougherty,
‘‘Model-based evaluation of clustering validation measures,’’ Pattern
Recognit., vol. 40, no. 3, pp. 807–824, 2007.
[9] N. R. Pal and J. C. Bezdek, ‘‘On cluster validity for the fuzzy C-means
model,’’ IEEE Trans. Fuzzy Syst., vol. 3, no. 3, pp. 370–379, Aug. 1995.
[10] W. Wang and Y. Zhang, ‘‘On fuzzy cluster validity indices,’’ Fuzzy Sets
Syst., vol. 158, no. 19, pp. 2095–2117, Oct. 2007.
[11] J. C. Bezdek, ‘‘Numerical taxonomy with fuzzy sets,’’ J. Math. Biol., vol. 1,
no. 1, pp. 57–71, 1974.
[12] J. C. Bezdek, ‘‘Cluster validity with fuzzy sets,’’ J. Cybern., vol. 3, no. 3,
pp. 58–74, 1973.
[13] X. L. Xie and G. Beni, ‘‘A validity measure for fuzzy clustering,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 13, no. 8, pp. 841–847, Aug. 1991.
VOLUME 5, 2017
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
[14] Y. Fukuyama and M. Sugeno, ‘‘A new method of choosing the number of
clusters for the fuzzy C-means method,’’ in Proc. 5th Fuzzy Syst. Symp.,
1989, pp. 247–250.
[15] I. Gath and A. B. Geva, ‘‘Unsupervised optimal fuzzy clustering,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 11, no. 7, pp. 773–780, Jul. 1989.
[16] N. Zahid, M. Limouri, and A. Essaid, ‘‘A new cluster-validity for fuzzy
clustering,’’ Pattern Recognit., vol. 32, no. 7, pp. 1089–1097, 1999.
[17] M. K. Pakhira, S. Bandyopadhyay, and U. Maulik, ‘‘Validity index for
crisp and fuzzy clusters,’’ Pattern Recognit., vol. 37, no. 3, pp. 487–501,
2004.
[18] M. K. Pakhira, S. Bandyopadhyay, and U. Maulik, ‘‘A study of some
fuzzy cluster validity indices, genetic clustering and application to pixel
classification,’’ Fuzzy Sets Syst., vol. 155, no. 2, pp. 191–214, 2005.
[19] K.-L. Wu and M.-S. Yang, ‘‘A cluster validity index for fuzzy clustering,’’
Pattern Recognit. Lett., vol. 26, no. 9, pp. 1275–1291, 2005.
[20] Y. Zhang, W. Wang, X. Zhang, and L. Yi, ‘‘A cluster validity index for
fuzzy clustering,’’ Inf. Sci., vol. 178, no. 4, pp. 1205–1218, 2008.
[21] D. L. Davies and D. W. Bouldin, ‘‘A cluster separation measure,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. PAMI-1, no. 2, pp. 224–227,
Apr. 1979.
[22] H. L. Capitaine and C. Frélicot, ‘‘A cluster-validity index combining an
overlap measure and a separation measure based on fuzzy-aggregation
operators,’’ IEEE Trans. Fuzzy Syst., vol. 19, no. 3, pp. 580–588, Jun. 2011.
[23] T. Calvo, A. Kolesárová, M. Komorníková, and R. Mesiar, Aggregation
Operators: Properties, Classes and Construction Methods. Heidelberg,
Germany: Physica-Verlag, 2002, pp. 3–106.
[24] M. Grabisch, J. Marichal, R. Mesiar, and E. Pap, Aggregation Functions
(Encyclopedia of Mathematics and its Applications Series), vol. 127.
Cambridge, MA, USA: Cambridge Univ. Press, 2009.
[25] J. Yu, Q. Cheng, and H. Huang, ‘‘Analysis of the weighting exponent in
the FCM,’’ IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 34, no. 1,
pp. 634–639, Feb. 2004.
[26] D. Dembélé and P. Kastner, ‘‘Fuzzy C-means method for clustering
microarray data,’’ Bioinformatics, vol. 19, no. 8, pp. 973–980, 2003.
[27] M. Bouguessa, S. Wang, and H. Sun, ‘‘An objective approach to cluster
validation,’’ Pattern Recognit. Lett., vol. 27, no. 13, pp. 419–430, 2006.
[28] E. P. Klement and R. Mesiar, Logical, Algebraic, Analytic and Probabilistic
Aspects of Triangular Norms. New York, NY, USA: Elsevier, 2005.
[29] L. Mascarilla, M. Berthier, and C. Frélicot, ‘‘A k-order fuzzy OR operator
for pattern classification with k-order ambiguity rejection,’’ Fuzzy Sets
Syst., vol. 159, no. 15, pp. 2011–2029, 2008.
[30] S. H. Kwon, ‘‘Cluster validity index for fuzzy clustering,’’ Electron. Lett.,
vol. 34, no. 22, pp. 2176–2177, Oct. 1998.
[31] A. Asuncion and D. J. Newman. (2007). ‘‘UCI machine learning repository.’’ School Inf. Comput. Sci., Univ. California, Irvine, CA, USA.
Tech. Rep. [Online]. Available: http://archive.ics.uci.edu/ml/datasets.html
[32] N. R. Pal and J. C. Bezdek, ‘‘Correction to ‘on cluster validity for the fuzzy
C-means model’ [Correspondence],’’ IEEE Trans. Fuzzy Syst., vol. 5, no. 1,
pp. 152–153, Feb. 1997.
HONGYAN CUI (SM’14) researched in Massachusetts Institute of Technology as a visiting
scholar since 2014, and in Australia CSIRO ICT
Center in 2009. She is currently a Professor and
a Ph.D. Supervisor in the School of Information
and Communications Engineering, Beijing University of Posts and Telecommunications. She is
a Founding Partner and a Deputy Director of the
Specialties Committee of Smart Healthcare, China
Ministry of Industry and Information Technology.
She also acts as the Project Leader or a Primary Researcher for more than
ten national research projects. She has published over 70 SCI/Ei papers
and five books. She holds more than ten national patents. Her research
interests include big data analysis and visualization, intelligent resource
management in future networks, cloud technology, and social physics. She
acts as the Publish Chair of IEEE UV’18, and the Track Chair of WPMC’13,
Global Wireless Summit’14, ICC’15, the IEEE UV’16, and Globecom’16.
She is a TPC member in IEEE WPMC’14, ICCC’15, ICC’15 and ’16, and
Bigdata’15, CIoT’16, BigData’15, Globlecom’16. She acts as a Reviewer for
JSAC, Chaos, Transactions on NNLS, Globecom, ICC, WCNC, and WPMC.
VOLUME 5, 2017
KUO ZHANG received the master’s degree
in information and communications engineering
from the Beijing University of Posts and Telecommunications, China, in 2014. He is currently pursuing the Ph.D. degree in computer science at
Rutgers University, New Brunswick, NJ, USA.
He has been an Engineer with Baidu since 2014.
His research interests are machine learning and big
data analysis. He received the Excellent Students
Scholarship Award of BUPT in 2011 and 2012. He
has published 3 Ei papers and 1 SCI paper.
YAJUN FANG received the Ph.D. degree from the
Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology (MIT). She is currently a Research Scientist
with MIT Intelligent Transportation and Research
Center. She is also the Program Coordinator of
the MIT Universal Village (an advanced version
of Smart Cities) Program, and the Conference
Chair of International Conference on Universal
Village. Her research areas focus on machine
vision, machine learning, big data analysis, system theory, and their applications for robotics/autonomous vehicles, intelligent transportation systems,
intelligent healthcare, future smart cities/universal village.
STANISLAV SOBOLEVSKY received the Ph.D.
degree in 1999 and the D.Sc. (Habilitation) degree
in 2009 in mathematics in Belarus. He has been
an Associate Professor of practice with the Center for Urban Science and Progress, New York
University, since 2015. His former work experience includes research, faculty, and administrative
positions at Massachusetts Institute of Technology, Belarusian State University, and Academy of
Science of Belarus. He applies his fundamental
quantitative background to studying human behavior in urban context and
a city as a complex system through its digital traces-spatio-temporal big data
created by various aspects of human activity. He has authored over hundreds
of research papers in the top journals like PNAS, Scientific Reports, Physical
Review E, PLoS ONE, Royal Society Open Science, EPJ Data Science,
Applied Geography, Environment and Planning B, the International Journal
of GIS, Studies in Applied Mathematics, and others. His research is conducted
in close cooperation with city agencies and industrial partners from banking,
telecom, defense, insurance, and other areas.
24893
H. Cui et al.: Clustering Validity Index Based on Pairing Frequency
CARLO RATTI received the M.Sc. degree in engineering form the Politecnico di Torino, Italy, and
the Ecole des Ponts, France, and the M.Phil. and
Ph.D. degrees in architecture from the University
of Cambridge, U.K. He is currently a Professor
of practice of urban technologies with the Massachusetts Institute of Technology, USA, where he
directs the Senseable City Laboratory. He is also a
Founding Partner of the international design and
innovation office Carlo Ratti Associati. He then
moved to the Massachusetts Institute of Technology as a Fulbright Senior
Scholar. His research interests include urban design, human–computer interfaces, electronic media, and the design of public spaces.
24894
BERTHOLD K. P. HORN is currently an Academician and a Professor of computer science
and engineering with the Massachusetts Institute
of Technology. He is with the Computer Science
and Artificial Intelligence Laboratory, Department
of Electrical Engineering and Computer Science.
He published the book Robot Vision, and has been
a Translator for many national languages. He was
elected as a fellow of the American Association for
Artificial Intelligence for his significant contributions to the field of AI in 1990. He was elected to the National Academy of
Engineering for his contributions to computer vision, including the recovery
of three-dimensional geometry from image intensities in 2002. He received
rich awards, including Rank Prize for pioneering work leading to practical
vision systems (Rank Prize Funds) in 1989, and the Azriel Rosenfeld Lifetime Achievement Award (IEEE Computer Society) for pioneering work on
early vision including optical flow and shape from shading in 2009.
VOLUME 5, 2017