A Hash-Based Co-Clustering Algorithm For Categorical Data

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Expert Systems With Applications 64 (2016) 24–35

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

A hash-based co-clustering algorithm for categorical data


Fabrício Olivetti de França
Center of Mathematics, Computing and Cognition (CMCC), Universidade Federal do ABC (UFABC) – Santo André, SP, Brazil

a r t i c l e i n f o a b s t r a c t

Article history: Cluster analysis, or clustering, refers to the analysis of the structural organization of a data set. This anal-
Received 23 February 2015 ysis is performed by grouping together objects of the data that are more similar among themselves than
Revised 16 July 2016
to objects of different groups. The sampled data may be described by numerical features or by a symbolic
Accepted 17 July 2016
representation, known as categorical features. These features often require a transformation into numer-
Available online 20 July 2016
ical data in order to be properly handled by clustering algorithms. The transformation usually assigns a
Keywords: weight for each feature calculated by a measure of importance (i.e., frequency, mutual information). A
Co-clustering problem with the weight assignment is that the values are calculated with respect to the whole set of
Categorical data objects and features. This may pose as a problem when a subset of the features have a higher degree of
Data mining importance to a subset of objects but a lower degree with another subset. One way to deal with such
Text mining problem is to measure the importance of each subset of features only with respect to a subset of ob-
Biclustering
jects. This is known as co-clustering that, similarly to clustering, is the task of finding a subset of objects
and features that presents a higher similarity among themselves than to other subsets of objects and
features. As one might notice, this task has a higher complexity than the traditional clustering and, if
not properly dealt with, may present an scalability issue. In this paper we propose a novel co-clustering
technique, called HBLCoClust, with the objective of extracting a set of co-clusters from a categorical data
set, without the guarantees of an enumerative algorithm, but with the compromise of scalability. This is
done by using a probabilistic clustering algorithm, named Locality Sensitive Hashing, together with the
enumerative algorithm named InClose. The experimental results are competitive when applied to labeled
categorical data sets and text corpora. Additionally, it is shown that the extracted co-clusters can be of
practical use to expert systems such as Recommender Systems and Topic Extraction.
© 2016 Elsevier Ltd. All rights reserved.

1. Introduction (Feng, Chen, & Ye, 2007), frequent patterns of gene expres-
sions (de França & Von Zuben, 2010) and many more.
The abundance of data being collected nowadays demands a set In order to accomplish such task, the objects of the data set are
of tools to automatically extract useful information from them. If described by a set of features measured during the data collection.
the data are partially labeled, a possible information to be extract Each feature can be represented by a numerical quantity, such as
is in the form of a mathematical model that can deduce the label height of a person, amount of gas measured on a car tank, or de-
from a set of measured variables, this characterizes the supervised scriptive characteristic or category, such as the gender of a person
learning. On the other hand, if the data are unlabeled, the informa- or the research topics they are interested.
tion can be extracted by modeling a group structure of the objects The objects described by numerical features can be conve-
that may describe the generating process of the data or may give a niently represented as numerical vector and the objects can be
summarization of the information contained on it. This is referred naturally compared to each other by using distance metrics. It
as unsupervised learning and it is commonly studied by means of makes sense to say that one person has twice the height of an-
clustering algorithms. other.
Data clustering can refer to the task of dividing a data set On the other hand, categorical features lacks these properties
into subset of objects that are more similar to each other than and should be transformed into numerical features in order to be
to the remaining elements of the set. There is a wide range of compared among themselves. For example, describing one person
applications such as segmenting a surveyed population (Morgan as male and another as female does not imply that one is more
& Sonquist, 1963) for market purposes, image quantization than the other.

E-mail address: folivetti@ufabc.edu.br, fabricio.olivetti@gmail.com

http://dx.doi.org/10.1016/j.eswa.2016.07.024
0957-4174/© 2016 Elsevier Ltd. All rights reserved.
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 25

Some similarity metrics were proposed to quantify the differ- can be easily explained, thus improving the interpretability of the
ence between objects of categorical features (Boriah, Chandola, & model.
Kumar, 2008), mostly based on the matching features between two In many situations these relaxations can benefit the cluster
objects. One example is the Jaccard metric that, given the sets of analysis. When a cluster analysis is performed, it is expected that
features for two objects, it calculates the ratio between the cardi- the natural grouping of the data is related to the intended labeling
nality of the intersection between the two sets and the cardinality of the objective of the study. But, this expectation may not hold
of their union. true. For example, when trying to classify a set of animals, the
One problem that must be dealt with when using categorical clustering algorithm may correctly identify that lions, deers and
data is the high dimensionality. Since most clustering algorithms horses belong to the same class, but may incorrectly classify sea
requires a vectorial representation of the objects, the categorical lions, octopus and tuna as belonging to the same class if their com-
features are usually represented as a binary vector, with every po- mon traits are prevalent. The co-clustering of this data set would
sition representing whether the object has a given feature or not. still find these groups but, additionally, would find other groups re-
For example, if one measured feature is whether a person is male lating sea lions with mammals, tuna with other fishes and octopus
or female, it would be represented by a 2-dimensional vector. with invertebrates.
But, some categorical features may span into tens, hundreds Another possibility regards the topics extraction of a textual
or even thousands of vector dimensions. When describing a song data set. In this task it is sought to infer the set of words that
by its genre, each object would be represented as a vector with describes the topic of each document, based on the analysis of the
more than 1500 dimensions. This high dimension exponentially in- whole data set. The usual techniques applied to such task requires
creases the search space and it can cause a loss of precision on the a prior knowledge of how many different topics there are in the
similarity metrics. This is called the Curse of Dimensionality (Har- corpus and then tries to find the features responsible for the gen-
Peled, Indyk, & Motwani, 2012). erative process of each cluster of documents. Again, the restriction
This problem can be dealt with by reducing the dimensional- of a fixed generative process for each document (i.e., the generative
ity of the objects while preserving the similarity relationship be- process of the only cluster it belongs to) may fail to acknowledge
tween them. Probabilistic Dimension Reduction (Har-Peled et al., a second or third topic inside the text. With the co-clustering al-
2012) is the family of algorithms that exploits the probability that gorithm, the document can be grouped with different sets of doc-
two similar objects will be considered to be equal when a subset uments regarding different topics and, as this procedure also high-
of features is randomly sampled and used for comparison. lights the features used to create each cluster, it makes the topic
One of these algorithms, called Minhashing (Broder, 1997; of each cluster explicit.
Zamora, Mendoza, & Allende, 2016), approximates the Jaccard In- Finally, in Recommender Systems a clustering algorithm would
dex between two objects. This algorithm relies on the fact that the group together users with similar tastes. But then again, only the
probability of the first non-zero position of any random permu- prevalent common taste of each set of users will be taken into ac-
tation of the feature vector is the same for two objects is equal to count. If, for example, a given user rates positively many comedy
the Jaccard Index between them. The algorithm generates a smaller movies and just a few action movies, they would be likely grouped
dense representation of the data with this information. together with other users that share a taste for comedy. On the
The reduction of dimensionality has two drawbacks when ap- other hand, the co-clustering algorithm could also assign them to
plied prior the clustering procedure: i) the compact representation a group of users that like action movie. Besides, the taste of each
may hide some seemingly unimportant features that could be used user could be described by the combined set of features of every
to describe a smaller group and, ii) the new set of features will lose group they belong to.
its interpretability, since there is no clear relationship between the But, this flexibility comes with a price, the number of groups
original set and the reduced set. that can be found is usually large, given all the possible combina-
A more direct approach is to perform the data clustering in a tions of subset of objects and features. Some of these groups may
two-way manner by finding the clusters of objects conditioned to be irrelevant to the subject of analysis, rendering a burden to the
a subset of features. This defines the family of algorithms known post-analysis procedure.
as co-clustering. Some co-clustering algorithms coped with this problem by rein-
Data Co-Clustering (Dhillon, Mallela, & Modha, 2003; de França, troducing some of the constraints of the classical clustering, such
2012; Gao & Akoglu, 2014; Labiod & Nadif, 2011), also known as as the search for a pre-specified number of clusters and assign-
biclustering, tries to find subsets of objects and features that max- ing each object to only one group (Dhillon et al., 2003; Labiod &
imizes the similarity of the selected objects when considering the Nadif, 2011). These algorithms retain only the explicit description
chosen features. It exploits the fact that a given object may belong of which features were used as part of the clustering process.
to different categories when viewed by different aspects of its de- Despite theses difficulties, there are some co-clustering algo-
scription. For example, a given news text may report a story about rithms capable of dealing with such flexibility, the most recent be-
the economies of a football team. This document will have terms ing the HBLCoClust (de França, 2012) and CoClusLSH (Gao & Akoglu,
that are related to sports and other terms that relates to economy. 2014). They both have in common the use of a probabilistic dimen-
So, this object can be assigned to the group of sports related docu- sion reduction technique, named Locality Sensitive Hashing (LSH),
ments and the group of economy related document, depending of used to find promising regions with high probability of containing
the selected set of words. co-clusters.
This technique allows for many relaxations of the constraints In the original HBLCoClust algorithm, after the search for
imposed by traditional clustering1 . For example, each object may the promising regions, a graph partitioning technique, called
belong to more than one group, given a different subset of features. METIS (Karypis & Kumar, 2012), was used in order to generate
Additionally, one feature may be used to define different groups, meaningful results, but constraining the algorithm to a pre-defined
associated with different subset of features. number of clusters.2
Moreover, since each group is explicitly defined by a subset
of features, the reason for grouping together a subset of objects

2
It is also worth mentioning that the newest versions of METIS did not compile
1
Note: these relaxations are not present in every co-clustering algorithm. correctly in some systems, thus making the HBLCoClust inaccessible to many users.
26 F.O. de França / Expert Systems With Applications 64 (2016) 24–35

Fig. 1. Example of different quality measures: (a) constant bicluster, (b) constant rows, (c) constant columns, (d) coherent values (rows or columns should exhibit high
correlation).

This paper proposes a reformulation of HBLCoClust algorithm data. These data sets are composed by a set of documents de-
that does not depend of external algorithms while maintaining the scribed by the tokenized terms. A co-cluster of this data set would
scalability and automatically determining the number of groups to represent a subset of documents that exclusively share a subset of
return. The general idea is to pre-process the set of features, elim- these terms. Hopefully, this subset of terms describes the topic of
inating those possessed by the minority of the objects, then the the set of documents.
LSH algorithm is applied to the pre-processed data set in order to Formally, given the set O of objects and the set F of observed
identify promising regions to be explored. These regions are then features, a Co-Cluster C of a subset O ⊂ O, a subset F ⊂ F and a
explored by an enumerative algorithm, called InClose, returning ev- relation R ⊂ O × F can be described as:
 
ery co-cluster contained inside the given regions. Finally, the co- C (O , F  ) = O × F  | o ∈ O , f ∈ F  ∧ (o, f ) ∈ R , (1)
clusters sharing a percentage of their elements are merged in order
to reduce the number of groups to be analyzed. where O and F can also be denoted as the objects and features
The experimental results will evaluate the groups generated by clusters, respectively.
HBLCoClust through the consistency of the grouped objects accord- The relation R contains all the tuples (o, f) inside the data set.
ing to their labels and the information conveyed by the subset of Notice, though, that the constraint that every pair of object and
features of each group when considering each subset of objects, feature of the co-cluster must exist in R may lead to a large set of
and scalability with respect to the number of elements in the data very small groups.
set. So, in order to minimize the number of groups while maximiz-
Additionally, some practical implications of the interpretability ing the size of such groups, the previous equation can be reformu-
of the groups will be illustrated in order to show the usefulness of lated as:
 
Data Co-Clustering to different applications. C ( O , F  ) = O × F  | |R | ≥ ρ · |O × F  | , (2)
In Section 2 the co-clustering definition will be described
where |.| is the cardinality of a set and
in more details as well as some of its practical applications.  
Section 3 explains the proposed algorithm with all of its key as- R = (o, f ) ∈ R | o ∈ O , f ∈ F  . (3)
pects. Next, in Section 4, a complete set of experiments will be Notice that there is no restriction whether an object or a fea-
performed in order to assess how well the proposed algorithm per- ture must belong to only one group. There are other possible vari-
forms on real-world scenarios against other similar co-clustering ations of this problem such as the k, l−Co-Clustering, in which it
algorithms. Finally, Section 5 will give final comments on this work seeks to partition the data into k objects clusters and l features
along with some future perspectives. clusters by maximizing the number of non-null elements from the
submatrices induced by each combination of k and l. Specifically,
2. Co-clustering in this work, the focus will be on the formulation given in Eq. 2.

3. Hash-based linear co-clustering


Co-Clustering refers to the task of finding subsets of rows and
columns from a data matrix such as the values of the extracted
This section will describe some fundamental definitions and al-
submatrix presents a desired relationship (Cheng & Church, 20 0 0;
gorithms in order to better understand the proposed algorithm.
Dhillon et al., 2003; de França, 2012; de França & Von Zuben, 2010;
First, the basics of probabilistic data clustering through random
Gao & Akoglu, 2014; Hartigan, 1972; Labiod & Nadif, 2011; Mirkin,
hash functions will be introduced. Following, the enumerative al-
1996). Usually, the rows and columns are described as objects and
gorithm, called InClose, will be explained and, finally, the proposed
features, respectively, from a Data Mining perspective. The rela-
algorithm will be detailed.
tionship sought by this algorithm will depend of the nature of
the data set. Some possible relationships are the submatrices with 3.1. Locality sensitive hashing
constant values, with constant values along the rows or along the
columns, correlated values, values expressing any consistent order- A popular algorithm when dealing with high dimensional and
ing or dense submatrices from sparse data (see Fig. 1). high volume sets of data is a probabilistic algorithm called Locality
This technique was already applied to a diverse set of applica- Sensitive Hashing (LSH). This algorithm exploits the fact that two
tions such as gene expression analysis (Coelho, de França, & Von very similar samples will likely collide when mapped by a weak
Zuben, 2009; de França, Coelho, & Von Zuben, 2008; de França hash function. In fact, depending on how this hash function is cre-
& Von Zuben, 2010; Mitra & Banka, 2006), text mining (de Cas- ated, the probability of this collision is known to be proportional
tro, de França, Ferreira, Coelho, & Von Zuben, 2010; Dhillon et al., to their similarity.
2003), recommendation systems (de Castro, de França, Ferreira, & One of such hash functions is the Minwise Independent Permu-
Von Zuben, 2007; Symeonidis, Nanopoulos, & Manolopoulos, 2008) tation (Minhash) (Carter & Wegman, 1979; Har-Peled et al., 2012)
and data imputation (de França, Coelho, & Von Zuben, 2013). The that states that the probability of collision of two hashed objects
most popular application of the co-clustering algorithms is the is proportional to their Jaccard similarity. Jaccard Index or Jaccard
search for additive coherence (de França & Von Zuben, 2011) from Similarity is used when the data set is described through sets of
a large gene expression data. categorical features. This similarity can be calculated as:
Recently, there has been an increase of interest on the extrac-
|o1 ∩ o2 |
tion of information of categorical data sets, such as text document J= , (4)
|o1 ∪ o2 |
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 27

where o1 and o2 are the two objects being compared. The Jaccard • Every object in O has every feature in F ,
Index varies from 0 to 1, with the later meaning that the two ob- • For every o ∈ O, o ∈ O , there is one f ∈ F such that (o, f ) ∈ R,
jects are equal. • For every f ∈ F, f ∈ F , there is one o ∈ O such that (o , f) ∈ R.
Given a predefined order for the set of features, the Minhash
algorithm generates a random permutation π of this set and take In other words, this technique seeks the maximal subset of ob-
the first feature describing each object as its representative feature. jects O and the subset of features F that satisfies Eq. (1).
The probability that two objects have the same representative fea- In Andrews (2009) the enumerative algorithm InClose was pro-
ture is given by: posed and further improved in Andrews (2011). This algorithm
guarantees the exact set of formal concepts (i.e., co-clusters) with-
|o1 ∩ o2 |
P (oπ1 = oπ2 ) = , (5) out generating the same solution more than once (i.e., avoiding re-
|o1 ∪ o2 | dundant search).
For the sake of brevity and clarity, the InClose algorithm will
that is equal to the Jaccard Index.
be explained following the notations used in the previous sections
So, the expected value of the Jaccard Index is estimated by av-
(i.e., formal concepts will be called co-clusters from now on) and
eraging the number of collisions of two objects over k independent
with a small adaptation that will be explained later in this section.
random permutations. Since generating random permutations can
The general idea of the algorithm is to recursively enumerate
be computationally expensive, the permutation is approximated by
the Co-Clusters by following the lexicographical order of the fea-
an universal hash function such as the one proposed in Carter and
ture set F, expanding the current co-cluster until it becomes max-
Wegman (1979):
imal and branching the algorithm to explore new candidate solu-
h(x ) = a · x + b mod P, (6) tions.
Initially, the algorithm starts with the initial co-cluster (O , F )
where a and b are randomly chosen numbers from an uniform dis-
where O = O, the entire set of objects and F  = { f }, a subset con-
tribution, and P is a large prime number. The variable x is the value
taining the first feature of F when following the lexicographical or-
to be hashed, i.e., a number associated with the original index of
der.
the ordered set of features. Notice that, this prime number should
By following the lexicographical order of the features in F, the
be at least as large as the number of features. This hash function
algorithm sequentially creates a new set F  = F  ∪ f  for every fea-
will map each original index to an index in the range [0, P].
ture f ∈ F that is positioned after f. For each F  it generates a new
The Minhash of an object j for permutation i is simply the fea-
set O  that satisfies Eq. (1).
ture index x that minimizes the hash function hi (x):
  After this step, the subset F is merged with every generated
mhi (O j ) = arg min hi (x ) | ∀x ∈ O j . (7) set F  whenever the corresponding O = O , in other words, F will
x contain every feature that maintains the integrity of the set O ,
Given a data set of n objects, each object containing an aver- maximizing the co-cluster.
age of m̄ observed features, out of m possible features, and k hash The remaining pairs (O  , F  ) will be passed to a recursive call
functions. The complexity of this algorithm is O(n · m̄ · k ), with of the algorithm together with f  = F  ∩ F  as long as (O  , F  ) is
m̄ << m on sparse data sets. cannonical with respect to f .
This idea was extended in Har-Peled et al. (2012) as an scalable A Co-Cluster (O , F ) is considered cannonical with respect to a
algorithm for the Nearest-Neighbor problem named Locality Sen- feature f if and only if there is no feature f < f that can be inserted
sitive Hashing (LSH). This algorithm creates p hash signatures for into F such as Eq. (1) is still satisfied. This verification is required
each object by grouping together a sequence of k Minhashes. Each to avoid redundancy.
signature represents a bucket and the probability that two objects The InClose algorithm is summarized in Algorithm 1. Notice that
o1 and o2 will collide into the same bucket is given by: two other parameters are included in this algorithm: minObjs and
minFeats, that are used to discard co-clusters smaller than these
Pcol l ision (o1 , o2 ) = 1 − (1 − J k ) p ) (8) thresholds.
where J is their Jaccard Index, k is the number of minhashes used
to build one hash signature and p is the number of hash signa- 3.3. Hash-Based linear co-Clustering algorithm
tures.
The values of k and p can be adjusted to maximize the proba- The Hash-based Linear Co-Clustering algorithm was initially
bility of grouping together the objects with a desired similarity of proposed in de França (2012) as an scalable algorithm to find k Co-
J. Clusters from a categorical data set. This algorithm can be divided
It is interesting to notice that each bucket groups together a in three simple steps:
subset of objects and is described as a subset of k common fea-
1. Find a co-cluster set C composed of a set of tuples (O , F ) by
tures. So, the LSH can be used to find a variable number of co-
applying the LSH algorithm. The set O is composed of the ob-
clusters with k features. We will show in the next sections that
jects inside a bucket and the set F is the set of k features used
these co-clusters can be used to limit the search space prior to the
as a hash signature.
application of an enumerative algorithm.
2. Maximize the co-clusters in C by inserting features in F while
satisfying Eq. (1) and then inserting objects in O obeying the
3.2. Enumerative algorithm for co-clustering same constraint.
3. Induce a graph from the set C where the vertices are the ob-
In mathematics, there is a very similar field of study called jects and the edges connect two objects that belongs to the
Formal Concept Analysis (Andrews, 2011) (FCA). In this field, the same co-cluster. After that, perform the community detection
data set is described as a Formal Context composed of a triple algorithm METIS (Karypis & Kumar, 2012) to find a set of k co-
C = (O, F , R ) where O stands for the set of objects, F is the set of clusters.
features and R⊆O × F is the binary relationship of the objects with
respect to the set of features. If a given object o ∈ O contains the The third step is needed because the second step generates a
feature f ∈ F, then (o, f) ∈ R. In FCA, the objective is to find pairs high number of small co-clusters when dealing with a sparse data
(O , F ), called formal concept of context C such as: set. But, this steps introduced theoretical and technical drawbacks.
28 F.O. de França / Expert Systems With Applications 64 (2016) 24–35

   
Algorithm 1: InClose f)∈R|o ∈ O  ∧f ∈ F  }| ≤ τ sparse · |O  | · |F  |, for an specified 0
≤ τ sparse ≤ 1 [optional].
input : (global) data set of objects, features and relations
6. Merge related co-clusters: merge any two co-clusters that
(O, F , R ),minimum cardinality for the cluster of shares the same subset of features.
objects minObjs and the clusterof features minFeats.
input : (local) initial subset of objects O and features F  , In the first step, the algorithm optionally removes any feature
and currentfeature f . that appears in less than a percentage of the objects O. This step
output: set of Co-Clusters C avoids that, during the next stage, the LSH algorithm samples one
/* Insert new features in lexicographical order. */ of these rare features to compose the hash signature. This may cre-

ate a bucket containing a single object, thus generating an uninter-
Insertions ← ( f  , O ) | f  ∈ F ∧ f  < f ∧ |O | ≥ minObjs
 esting region.
with O = O ∩ {o ∈ O | (o , f  ) ∈ R} ; The second step is similar to the original algorithm. It first gen-
erates p · k random hash functions and then creates p buckets for
/* Generate the maximalco-cluster with the fixed set every object by grouping k keys generated by the hash functions
O . */
  (see Section 3.1). Additionally, it creates another set of p · k ran-
F ← F ∪ f  | ( f  , O ) ∈ Insertions ∧ O = O ; dom hash functions and then creates p buckets for every feature
by grouping the keys generated from the hash functions applied to
/* List of candidates to be recursively expanded. */
 the objects set. This step generates a candidate set of regions to be
Candidates ← (O , F  ∪ { f  }, f  ) | (O , f  ) ∈ explored.
 For every generated region, a new subset of the original data set
Insertions ∧ Cannonical(O , F  ) ;
is created with the subset of objects that contains a relationship
if |F  | ≥ minFeats then with every feature of the specified region and, similarly, a subset of
/* The operator ++ means concatenation. */ features that are related to every object of the region. This creates
  
a sparse subset that contains at least one co-cluster (as defined by
return (O , F  ) + + InClose(O , F  , f  ) | (O , F  , f  ) ∈
 (O , F )).
Candidates ; These regions are then used by the enumerative algorithm In-
Close to find the complete set of co-clusters. Since this algorithm
else   only enumerates co-clusters that satisfies Eq. (1), the next step op-
return InClose(O , F  , f  ) | (O , F  , f  ) ∈ Candidates ; tionally insert objects and features that do not violate Eq. (2). The
order of insertion is defined by the number of missing values each
element introduces to the co-cluster.
Finally, since every co-cluster was generated from a constrained
For the theoretical drawbacks, the complexity of the first two region of the original data set and further expanded by the intro-
steps is O(n) with n representing the cardinality of the set of re- duction of sparseness, some co-clusters may share the same set of
lations R. The third step introduces a complexity proportional to features but with a different set of objects. For this purpose, those
O(m3 ) Kernighan and Lin (1970), with m representing the num- co-clusters sharing the same set of features are merged together.
ber of vertices of the induced graph or, in this case, the cardinality The HBLCoClust algorithm requires a total of six parameters,
of the set of objects O. Also, this algorithms introduce a number some of them optional: the minimum relative frequency of fea-
of different parameters that should be adjusted, together with the tures (τ , optional), the number of buckets (p) and number of hash
inherent parameters of HBLCoClust. Besides, it constrains the co- functions per bucket (k), the minimum cardinality for the objects
clusters to a k−co-clustering algorithm, with a pre-specified k. and features subsets (minObjs and minFeats), required by InClose,
About the technical drawbacks, there were some issues for the and the maximum allowed percentage of missing values (τ sparse ,
compilation process under some Operational Systems that pre- optional). The sensitivity of these parameters will be shown in the
vented the use of the HBLCoClust and the reproduction of the orig- next Section.
inal experiments was compromised. The pseudo-algorithm for HBLCoClust is depicted in
In order to eliminate the dependency of an external implemen- Algorithm 2 and followed by some auxiliary routines in
tation and the constraint of defining the number of clusters, we Algorithms 3 and 4.
now propose a new version of HBLCoClust by following these six One important thing to notice regarding the proposed algorithm
steps: is that the first three steps have a linear complexity with respect to
the cardinality of the relations set R, the fourth step is exponential
1. Remove rare features: remove any feature f ∈ F such as |{o|(o, proportional to the number of clusters contained inside the Regions
f) ∈ R}| ≤ τ · |O|, for an specified 0 ≤ τ ≤ 1 [optional]. created in step 3. The last two steps are quadratic with respect to
2. Find the promising regions with LSH: apply the LSH algorithm the average co-cluster cardinality. We will show in the next Section
to find a set of 2 · p candidates tuples (O , F ) with k features empirical evidence that the whole algorithm scales linearly with
by bucketing the objects with hash signatures from the features respect to the cardinality of R for the tested data sets.
set and, also, bucketing a set of features with hash signatures
from the objects set.
3. Create subsets of the original data set: expand each candidate 3.4. Literature review
(O , F ) creating a set of promising regions (O  , F  ) such as O =
{o ∈ O | (o, f ) ∈ R ∧ f ∈ F  } and F  = { f ∈ F | (o, f ) ∈ R ∧ o ∈ O }. In the literature, the most similar algorithm to HBLCoClust is the
4. Enumerate inside the promising regions: apply the InClose al- one proposed in Gao and Akoglu (2014) and called CoClusLSH. This
gorithm to every promising region thus creating the co-clusters algorithm iteratively applies the LSH algorithm using the set of ob-
set C = {(O , F  )}. The duplicated regions can be avoided by jects and features alternately, and merging the groups defined by
storing the already explored regions on a hash table. the buckets using an entropic metric. The authors tested CoClusLSH
5. Expand the co-clusters by allowing sparseness: sequentially on a diverse set of real world data, presenting competitive numeri-
insert objects and features into every co-cluster such as |{(o, cal results measured by purity, mutual information and number of
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 29

co-clustering problem as a graph partitioning problem. It then op-


Algorithm 2:
timizes the modularity criteria (Newman, 2006) in order to find
input : data set of objects, features and relations a set of k, l-clusters. The obtained results were quantitatively bet-
(O, F , R ),minimum cardinality for the cluster of ter than some of the considered state-of-the-art co-clustering al-
objects minObjs and the clusterof features minFeats, gorithms. But, unlike HBLCoClust and CoClusLSH, this algorithm re-
threshold of minimum frequency of featuresτ , the quires a pre-specified number of clusters and it does not scale lin-
number of buckets p and hash functions per buckets early.
k and themaximum ratio of missing data τsparse . In the next section, we will compare the results obtained by
output: set of Co-Clusters C HBLCoClust with those obtained by CoClusLSH and SpecCo.
/* Step 1: Removal of rare features */
 
F← f | f ∈ F ∧ count ( f ) ≥ τ · |O| ;
4. Experimental results
/* Step 2: LSH */
B ← RandomBucketFunctions( p, k ); In the Section we will present some experiments with the
for o ∈ O do HBLCoClust algorithm on three different types of data sets: categor-
for b ∈ B do ical, textual and rating data. The categorical data sets, as previously
/* Bucket is an associative array where the key discussed, are sets comprised of objects described by one or more
is a set offeatures and the value is a set of features from a set F. The textual data are acquired from text doc-
objects */ uments extracted from newsgroups, in these data sets, each docu-
Bucket[b(o)] ← Bucket[b(o)] ∪ {o}; ment represents one object and every word is a feature from the
set.
/* Step 3: Candidate Regions */
  Finally, the rating data set was specifically created for this pa-
Regions ← Region(O , F  ) | (O , F  ) ∈ Bucket ; per by combining a data set of ratings given by users on a set of
movies and a set of descriptive features for the set of movies. The
/* Step 4: Enumeration
 */ objects for this set are the users and the features are composed
C← InClose(O , F  , f ) | (O , F  ) ∈ Regions ∧ f = head(F  ) ; of the name and keyword of the movies concatenated with words
describing whether the users liked or disliked the movie.
/* Step 5: Expansion */
  For the categorical and textual data sets, the obtained results
C ← Expand(O , F  ) | (O , F  ) ∈ C ; will be quantitatively compared to the algorithms InClose, Co-
/* Step 6: Merge */ ClusLSH and SpecCo, with some exceptions. The InClose algorithm
for c1 , c2 ∈ C × C do results will be available only for the data sets which the total pro-
if F1 = F2 then cessing time did not take more than 5 hours. For the CoClusLSH, we
c1 ← c1 ∪ c2 ; could only obtain the results for the categorical data sets, due to
Remove(c2 ); memory limitations. Finally, due to the unavailability of the SpecCo
source code, we will use the reported values in (Labiod & Nadif,
return C 2011).
The HBLCoClust and InClose algorithms were implemented in
Python 2.7 and it is readily available at https://github.com/folivetti/
Algorithm 3: function Region HBLCoClust. The experiments with the algorithm CoClusLSH were
input : subset of objects O and subset of features F  . performed using the Matlab implementation available at http://
output: subsets O , F  defining a search space. www.cs.sunysb.edu/∼leman/pubs.html. These algorithms were run
  under a Linux Debian 7.6 system on a i5-2450 @ 2.5 GHz machine
O ← o | o ∈ O if (o, f ) ∈ R for ∀ f ∈ F  ; with 6 GB of RAM.
 
F  ← f | f ∈ F if (o, f ) ∈ R for ∀o ∈ O ;
return (O , F  ) 4.1. Data sets and experiments

In order to have a concise discussion of the experiments, they


Algorithm 4: function Expand will be divided into five different subsections: Categorical Data
input : subset of objects O and subset of features F  . Clustering, Text Clustering, Topic Modeling, Recommender Systems,
output: subsets O , F  respecting maximum missing values Scalability and Sensitivity analysis.
rateτsparse . The Categorical Data Clustering subsection will have the goal
of finding groups of objects that share common features. For
O∗ ← sort O by number of sparse features ; this purpose we have selected four well known categorical data
F ∗ ← sort F by number of sparse objects ; sets named: Zoo, House Votes 84’, Soybean Small and Soybean
O ← {o | o ∈ O∗ if constraint is respected }; Large Bache and Lichman (2013). All four data sets are described
F  ← { f | f ∈ F ∗ if constraint is respected }; by categorical features and a label that classifying each object. The
return (O , F  ) features may be many-valued or binary, as described in Labiod and
Nadif (2011), the many-valued features are converted to binary. The
labels will be used to measure the quality of the co-clusters found
groups found (the closest to the number of labels, the better). The by the algorithms. Notice that, even though it is expected a set of
algorithm is linearly scalable regarding the number of objects and clusters grouping together objects with the same label, this does
features on the data set but, it does require that the whole data not imply the inexistence of groups with objects of different labels.
set resides in memory during the processing stage. The Text Clustering subsection will comprehend a similar ex-
Another co-clustering algorithm that recently presented good periment as described for the categorical data clustering. The only
results is the SpecCo (Labiod & Nadif, 2011) which reformulates the practical difference being the high dimensionality. For this set of
30 F.O. de França / Expert Systems With Applications 64 (2016) 24–35

experiments we chose the data sets Classic33 , containing docu- Table 1


Data set properties: number of objects, number of features,
ments from three different collections, and two subsets of the 20-
the number of relations (non-zero elements) and number of
newsgroups (Lang, 1995) data set, named Multi5 and Multi10. classes.
The Multi5 data set contains documents from the groups:
|O| |F| |R| Classes
comp.graphics, rec.sport.baseball, rec.motorcycle, sci.space, and
talk.politics.mideast. The Multi10 data set contains documents Zoo 101 16 738 7
extracted from: alt.atheism, comp.sys.mac.hardware, misc.forsale, Soybean Small 47 21 880 4
rec.autos, rec.sport.hockey, sci.electronics, sci.crypt, sci.med, Soybean Large 307 35 4865 19
House Vote 84 435 16 6568 2
sci.space, talk.politics.guns.
Classic 3 3891 15034 227355 3
Regarding the Topic Modeling subsection, we will just illus- Multi 5 50 0 0 44323 539933 5
trate one possible practical application of co-clustering algorithms. Multi 10 10 0 0 0 64444 987443 10
As such, we will show some evidence that the subset of features Movie-lens 943 4233 154628 -
found by HBLCoClust algorithm for each group can be used to
describe the topic concerning the corresponding subset of docu- Table 2
ments. For this purpose we will use the results from the Text Clus- Parameters used for HBLCoClust on each data set.
tering subsection. Data set minObjs minFeats p k τ τ sparse
Similarly, the Collaborative Filtering subsection will illustrate
how a co-clustering algorithm can be used to find some explicit Zoo 4 6 10 0 0 2 0.0 1.0
Soybean Small 4 8 10 0 0 2 0.1 0.8
information regarding subset of users sharing the same taste. For
Soybean Large 4 10 10 0 0 2 0.0 0.8
this experiment we will merge the information of two well known House Vote 84 10 10 10 0 0 3 0.4 0.8
data sets, named Movielens (Herlocker, Konstan, Borchers, & Riedl, Classic 3 50 4 20 0 0 3 0.2 0.5
1999) and IMDB ftp://ftp.fu-berlin.de/pub/misc/movies/database/ Multi 5 5 5 10 0 0 3 0.95 0.5
in order to generate a categorical data set of movies ratings. Multi 10 5 5 20 0 0 3 0.95 0.5
Movie-lens 2 2 50 0 0 4 0.0 0.8
The Movielens Data Set contains a set of 80,0 0 0 tuples (user,
movie, rating) with the users represented by their id, the movies
represented by their titles and the ratings as a numbered scale be-
tween 1 to 5. grid search with the combination of the values of p = 10 0, 20 0, 30 0
This data was complemented by using the keywords, genre, ac- and k = 2, 3, 4. This limited set of values was due to memory lim-
tors and actress, and directors describing each movie provided by itations imposed by the algorithm.
the IMDB data set. These keywords were used to form the relations Finally, the parameters for the SpecCo algorithm were the same
(user, feature, rating), where rating is the average rating given by as reported in Labiod and Nadif (2011) due to the unavailability of
the user to movies possessing this feature. the source code.
This data set was then converted to a set of tuples (user, fea-
ture_Y) or (user, feature_N), whether the rating was higher than 3 4.2. Metrics
or not, respectively.
The HBLCoClust algorithm was applied separately to the Y and In order to quantify the quality of the Co-Clusters obtained by
N relations in order to generate co-clusters of likes and dislikes for each algorithm, we have chosen three different metrics: Purity,
each subset of users. Normalized Mutual Information and Pointwise Mutual Information.
Every user was then described by means of the features exclu- Purity of a Co-Cluster measures the ratio between the number
sively contained in the clusters of the likes data set they belonged of the most frequent label inside the cluster by the number of ob-
to and the features in the dislikes data set as well. jects in the cluster. It essentially quantifies, for a given cluster, if
To assess the recommendation error, a test set of 20,0 0 0 tu- the majority of its objects agree with respect to their labels.
ples (user, movie, rating) was converted to a categorical data in the Normalized Mutual Information calculates how likely it is to
same way as the training data. For every (user, movie) in the test find an object of a given label if a given co-cluster is selected at
set, the Jaccard Index of the likes and dislikes profiles of this user is random. This metric is related to Purity, but it also verifies the
calculated against the movie features set. The profile with higher compactness of the Co-Clusters set, i.e. if the set has a minimum
similarity is used to decide whether to recommend the movie or number of Co-Clusters and is given by:
not. 1  P (c, l (o))
Besides exemplifying with the generated profiles, the results NMI = P (c, l (o)) × log , (9)
HC Hl P (c )P (l (o))
c∈C,o∈O
will be compared to two commonly used machine learning al-
gorithms: regularized SVD (Funk, 2006) and Naive Bayes (Lewis, where HC , Hl is the entropy of the Co-Clusters set and the labels,
1998). respectively, O is the subset of objects of the co-cluster, and l(o) is
In Table 1 the properties of each data set used on these experi- the label of object o.
ments are summarized. The adopted parameters for the HBLCoClust Finally, the Pointwise Mutual Information measures the likeli-
are depicted in Table 2 for each data set. These values were found hood that the co-occurrence of any two features of a Co-Cluster
by performing a grid search in order to maximize the coverage of was not by chance. This verifies if the subset of features selected
the objects in the data set, with the combination of the following by the Co-Cluster have significance regarding the corresponding
values for each parameter: minOb js = [2, 50], minF eats = [2, 50], objects:
p = {10 0 0, 20 0 0, 30 0 0, 40 0 0, 50 0 0}, k = {2, 3, 4, 5}, τ = [0, 1] with   log P ( f 1, f 2 ) − log P ( f 1 )P ( f 2 )
intervals of 0.05, τsparse = [0, 1] with intervals of 0.1. P MI = − . (10)
For the InClose parameters, we have adopted the same values log P ( f 1, f 2 )
O f ∈C f 1, f 2∈O f
of minObjs and minFeats as chosen for HBLCoClust. The parameters
of the CoClusLSH algorithm was fixed to p = 100 and k = 3 after a 4.3. Categorical data clustering

The results for the first set of experiments, categorical data,


3
ftp://ftp.cs.cornell.edu/pub/smart are depicted in Tables 3 and 4. These Tables show the average
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 31

Table 3 Table 5
Statistics of obtained Co-Clusters set for the Categorical Statistics of obtained Co-Clusters set for the Textual data
data sets. sets.

Zoo # Objs. Feats. Size Classic3 # Objs. Feats. Size

HBLCoClust 27.07 1.00 0.80 83.07 HBLCoClust 219.20 1.00 0.60 2807.23
CoClusLSH 37.00 1.00 1.00 52.00 SpecCo 3.00 1.00 −− −−
InClose 67.00 0.81 0.75 114.00 Multi5 # Objs. Feats. Size
SpecCo 7.00 1.00 −− −− HBLCoClust 1054.97 1.00 0.18 1051.23
Soybean S # Objs. Feats. Size SpecCo 5.00 1.00 −− −−
HBLCoClust 13.80 1.00 57.78 83.13 Multi10 # Objs. Feats. Size
CoClusLSH 10.00 1.00 1.00 225.00 HBLCoClust 2819.33 1.00 0.16 661.00
InClose 225.00 1.00 70.83 121.00 SpecCo 10.00 1.00 −− −−
SpecCo 4.00 1.00 −− −−
Soybean L # Objs. Feats. Size
HBLCoClust 42.40 1.00 0.74 205.73 Table 6
CoClusLSH 20.00 1.00 1.00 1089.00 Obtained results for the Textual data sets.
InClose 6470.00 0.98 0.93 109.00
Classic3 Purity NMI PMI
SpecCo 19.00 1.00 −− −−
House Votes # Objs. Feats. Size HBLCoClust 0.86 0.14 0.20
HBLCoClust 23.30 1.00 0.85 1173.80 SpecCo 0.86 0.73 −−
CoClusLSH 18.00 1.00 1.00 734.00
Multi5 Purity NMI PMI
InClose 124371 0.95 1.00 225.00
HBLCoClust 0.91 0.18 0.37
SpecCo 2.00 1.00 −− −−
SpecCo 0.59 0.53 −−
Multi10 Purity NMI PMI
Table 4 HBLCoClust 0.82 0.14 0.33
Obtained results for the categorical data SpecCo 0.57 0.55 −−
sets.

Zoo Purity NMI PMI


the advantage of generating a compact set of groups, thus maxi-
HBLCoClust 0.88 0.29 0.18
CoClusLSH 0.79 0.25 0.29
mizing this metric. In the Zoo Data Set, though the proposed al-
InClose 0.93 0.19 0.21 gorithm has not obtained the best results, it was close to the best
SpecCo 0.90 0.92 −− values. It is worth highlighting that, a consistent high PMI value
Soybean S Purity NMI PMI means that the algorithm is capable of selecting the significant fea-
HBLCoClust 0.89 0.40 0.25 tures.
CoClusLSH 0.56 0.30 0.08
InClose 0.59 0.09 -0.26
SpecCo 1.00 1.00 −− 4.4. Text clustering
Soybean L Purity NMI PMI
HBLCoClust 0.73 0.26 0.15 For the second set of experiments, regarding the textual data,
CoClusLSH 0.28 0.09 0.06
InClose 0.50 0.07 0.13
neither InClose algorithm nor CoClusLSH could be used due to lim-
SpecCo 0.67 0.78 −− itation in computational or memory complexity. Also, the compar-
House Votes Purity NMI PMI ison will be limited to the reported values in Labiod and Nadif
HBLCoClust 0.91 0.29 0.43 (2011) for the SpecCo algorithm. It should be noticed that, the data
CoClusLSH 0.85 0.24 0.45 sets used in Labiod and Nadif (2011) differs from the ones used in
InClose 0.93 0.10 0.26 our experiment. The Classic3 data set used only a setof 150 doc-
SpecCo 0.87 0.47 −−
uments and 3625 features, and the Multi5 and Multi10 data sets
used a selection of 500 documents and 20 0 0 words. These selec-
tions were performed by means of a supervised algorithm in order
results over 30 runs of HBLCoClust and CoClusLSH, a single run of to use only the most significant word-features regarding the docu-
InClose and the reported results for SpecCo. The first table reports ment labels.
the number of Co-Clusters found by each algorithm, the percent- Notice that, for our experiment with HBLCoClust we have used
age of the covered objects and features and the average size of the the entire data. So, the focus of this experiment will be on assess-
Co-Clusters. The second table reports the average value of Purity, ing if our proposed algorithm is capable of automatically selecting
NMI and PMI. the correct set of features without the use of a supervised algo-
From Table 3 we can notice some interesting properties of each rithm.
algorithm. First of all, InClose algorithm obtained the largest set In Table 5 we can see that, as explained by the aforementioned
of co-clusters with the largest size, something that should be ex- situation, the HBLCoClust algorithm returned a much larger set of
pected since it is an enumerative algorithm. Also, since this algo- clusters when compared to SpecCo. Also, even though dealing with
rithm only finds dense co-clusters, it was not always capable of a larger set of documents, it still managed to cover the entirety of
covering the entire set of objects. Comparing the HBLCoClust and the objects.
CoClusLSH algorithms, the later found a smaller number of groups Interesting, in Table 6 we can see that HBLCoClust obtained a
while covering the entire set of objects, except for the Zoo data set. purity value equal or much higher than the results obtained by
This means that CoClusLSH was more competent when maximizing SpecCo. This means that, even though it resulted in a larger set
the compactness. Since the number of groups is a parameter of of clusters, those clusters are accurate regarding the labels. The
SpecCo, it always returned the exact number of expected clusters. lower value of Purity obtained by SpecCo is a side-effect of the pre-
In Table 4 we can see that, except for the Zoo Data Set, the determined number of clusters, since there are clusters of docu-
HBLCoClust obtained the best, or close to the best, results regard- ments on these data sets pertaining to the same topic that do not
ing Purity and PMI. Regarding the NMI metric, the SpecCo has again share any feature. Concerning the PMI, the HBLCoClust algorithm
32 F.O. de França / Expert Systems With Applications 64 (2016) 24–35

Table 7
Terms extracted from features clusters of Multi5 data set.

sport.baseball reds, houston, standings, cincinnati, colorado, mets, scores, marlins, including, milwaukee, oakland, pirates, city, expos, los, west, indians,
minnesota, ocf, rangers, joseph, white, angels, texas, giants, toronto, pittsburgh, phillies, cardinals, cubs, atlanta, mariners, orioles, mlb,
lost, braves, louis, detroit, teams, athletics, streak, hernandez, san, boston, cleveland, dodgers, sox, seattle, astros, blue, diego, jays,
jtchern, rockies, twins, brewers, tigers, red, francisco, philadelphia, kansas, yesterday, royals, california, padres, berkeley, league, chicago,
florida, angeles, april, montreal, yankees, baltimore, york
motorcycles handlebars, motorcycle, speed, countersteering, foward, handle, faq, awful, turns, debating, ummm, uiuc, happens, turning, pushing, fgc,
convert, unb, explain, duke, unbvm, acpub, slack, zkcl, infante, cbr, eric, cso, csd, methinks, push
politics.mideast later, muslims, ohanus, vol, turkish, involved, roads, argic, hand, muslim, armenian, document, russian, armenians, including, army, sahak,
proceeded, serdar, soul, killed, among, children, published, blood, appressian, mountain, often, exists, turks, armenia, general, soviet,
serve, escape, genocide, melkonian, ways, extermination, passes, closed
sci.space six, rigel, aurora, wings, mary, dfrf, alaska, dryden, spin, military, facility, speak, unknown, digex, prb, flight, pilot, fly, kotfr, air, nsmca, pat,
edwards, mig, wire, fighter, shafer
comp.graphics plot, recommend, pascal, hidden, routines, object, basic, cost, address, bob, cad, mac, info, frame, short, offer, animation, price, sites,
across, package, low, building, directory, removal, documentation, robert, built, recommendations, libraries, various, tasks, shading, fast,
files, code, objects, tools, handle, demo, library, contact, book

Table 8
Profile generated for one of the user on Movielens data set.

Movies Jaws, Back to the Future, Twelve Monkeys, Dumb & Dumber, ...
Like disaster, infidelity, horse, gunfight, USA, automobile, hospital, bathroom, jealousy, racism, elevator, fight, beer, male-nudity, helicopter, impalement,
good-versus-evil, outer-space, murder, washington-d.c., fire, shot-to-death, los-angeles-california, independent-film, small-town, train, drunkenness,
one-man-army, baby, teenage-boy, lifting-someone-into-the-air, redemption, f-word, photograph, tough-guy, gangster, main-character-dies
Dislike second-part, beaten-to-death, haunted-by-the-past, Washington,DistrictofColumbia,USA, cell-phone, vengeance, bulletproof-vest, obsession, book,
die-hard-scenario

still managed to obtain a high positive value, thus assessing the


quality of the select subsets of features for each cluster.

4.5. Topic modeling

The PMI metric is a very popular metric for assessing the qual-
ity of Topic Modeling algorithms. These algorithms try to search
for a set of words of a corpus that correctly describes the topic of
each document. A value of PMI around 0.5 is considered good for
the newsgroup data sets, as we can see in Anandkumar, Valluvan
et al. (2013), while the most well-known algorithm for Topic mod-
eling, Latent Dirichlet Allocation (Blei, Ng, & Jordan, 2001), obtains
a value around 0.2.
To illustrate the quality of the obtained subsets of features, in
Table 7 we report the terms belonging to the set of features of 5
different clusters, one from each topic of Multi5 data set, chosen at
random. It is easy to see that most of these terms are meaningful
words revolving around the corresponding topic.

4.6. Recommender system

Regarding the Movielens data set, the HBLCoClust algorithm ob-


tained a total of 1320 co-clusters covering 60% of the users and Fig. 2. Time complexity estimation.
only 63% of the movies. As such, from the total of 20,0 0 0 test
ratings, we could generate a recommendation (or a not recom-
mended classification) to 12,862 ratings. From this total, the co-
clustering based recommendation obtained an accuracy of 83.4% a regression line. This experiment corroborates with the claim that
against 79.68% and 69.32% obtained by SVD and Naive Bayes, re- the algorithm scales linearly proportional to the cardinality of the
spectively. set of relations.
Additionally, the information given by the profiles of each user Regarding the sensitivity of the parameters, we have devised a
may enrich the post-analysis of the recommender system by giving short experiment in order to assess how the values of each pa-
directions of what can be more relevant to a given user. In Table 8, rameter affects the Purity and the number of clusters found by the
the profiles of a randomly selected user is depicted. algorithm.
For this purpose we have chosen the House Vote84’ data set,
4.7. Scalability and sensitivity of the algorithm the largest of the categorical data sets used in this paper. In
this experiment, we have tested 10 different values for every pa-
As a final experiment, we have measured the time taken by rameter around the values described in Section 4.1 while fix-
HBLCoClust for each one of the experimental data sets with a fixed ing the remaining parameters with the same values from this
set of parameters: p = 10 0 0, k = 3, minOb js = minF eats = 4, τ = table.
0.0, τsparse = 1.0. The results will be reported through line plots where the left
In Fig. 2 we can see the correlation between the time, in sec- y-axis represents the Purity value associated to each value of the
onds, with the number of relations inside each data set, fitted with parameter, and the right y-axis is the number of clusters found by
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 33

Fig. 3. Parameter sensitivity for (a) minObjs and (b) minFeats.

Fig. 4. Parameter sensitivity for (a) k and (b) p.

the algorithm. Notice that the y-axis of all the plots are fixed be- Finally, in Fig. 5, we can see that the pre-processing stage, con-
tween [0.6, 1.0] in order to highlight the significance of the varia- trolled by τ , affects only the number of clusters found, the higher
tions, while the right axis varies according to the results, in order the cutoff, less features will remain in the data set and less clus-
to verify the tendency of generating more or less clusters. ters can be found. The sparse parameter, on the other hand, does
In Fig. 3 we can see that the parameters minObjs and minFeats affect the Purity and the number of clusters. When this parame-
do not seem to affect the Purity of the clusters found by the al- ter is set to 1, it will only find dense clusters and, thus, the purity
gorithm. On the other hand, the more restricted the number of tends to be maximized, but it is less likely to find few larger clus-
objects and features, a smaller number of clusters are found. This ters. When the parameter is set to 0, the whole data set can be
means that the algorithm will maintain the quality of the clusters assigned to the same cluster, thus minimizing the Purity and re-
regardless of how many of them are found. turning a single cluster.
In Fig. 4, we can see the sensitivity analysis for the parameters
pertaining to the LSH algorithm. We can still observe that the Pu- 5. Conclusion
rity is not much affected by these parameters as well. The number
of clusters seems to be more affected by the number of buckets. In this paper a new algorithm for the co-clustering of categor-
More buckets mean that we will create more possibilities for find- ical data was presented. This algorithm consists of six sequential
ing different groups, thus leading to a higher number of clusters at steps and scales linearly with the number of non-empty entries of
the final stage. the data set. This is achieved by using a probabilistic algorithm to
34 F.O. de França / Expert Systems With Applications 64 (2016) 24–35

Fig. 5. Parameter sensitivity for (a) τ and (b) τ sparse .

approximate the partial similarity between objects, thus creating Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent dirichlet allocation. In Advances
seed co-clusters. These seed co-clusters are used to define regions in neural information processing systems (pp. 601–608).
Boriah, S., Chandola, V., & Kumar, V. (2008). Similarity measures for categorical
of the data set to be searched by an enumerative algorithm called data: A comparative evaluation. Red, 30(2), 3.
InClose, finally merging the Co-Clusters found inside these different Broder, A. Z. (1997). On the resemblance and containment of documents. In Com-
regions. pression and complexity of sequences 1997. proceedings (pp. 21–29). IEEE.
Carter, J., & Wegman, M. N. (1979). Universal classes of hash functions. Journal of
The experiments performed on categorical and textual data Computer and System Sciences, 18(2), 143–154.
showed that this algorithm is at least on par with other co- de Castro, P. A. D., de França, F. O., Ferreira, H. M., Coelho, G. P., & Von
clustering algorithms from the literature, and in many times signif- Zuben, F. J. (2010). Query expansion using an immune-inspired biclustering al-
gorithm. Natural Computing, 1–24.
icantly better. Most noticeably, the algorithm was capable of select-
Cheng, Y., & Church, G. M. (20 0 0). Biclustering of expression data. In Proceedings
ing subsets of features that maximized the point-wise mutual in- of the 8th international conference on intelligent systems for molecular biology
formation, leading to a meaningful set of features describing each (pp. 93–103).
Coelho, G. P., de França, F. O., & Von Zuben, F. J. (2009). Multi-objective biclustering:
cluster.
When non-dominated solutions are not enough. Journal of Mathematical Mod-
Some examples were provided illustrating the practical impli- elling and Algorithms.
cations of having an explicit subset of features. In one of these de Castro, P. A. D., de França, F. O., Ferreira, H. M., & Von Zuben, F. J. (2007). Apply-
examples, the features of the text data clusters were used to de- ing Biclustering to Perform Collaborative Filtering. In Proceedings of the 7th inter-
national conference on intelligent systems design and applications (pp. 421–426).
scribe the topic of a group of documents. In another example, the Rio de Janeiro, Brazil
descriptive features were used to generate a profile of what a given Dhillon, I. S., Mallela, S., & Modha, D. S. (2003). Information-theoretic co-cluster-
user likes or not in a movie. ing. In Proceedings of the ninth acm sigkdd international conference on knowledge
discovery and data mining (pp. 89–98). ACM.
Despite the good results, the proposed algorithm has a ten- Feng, H.-M., Chen, C.-Y., & Ye, F. (2007). Evolutionary fuzzy particle swarm optimiza-
dency of dividing the data sets into a larger set of clusters, when tion vector quantization learning scheme in image compression. Expert Systems
compared to other algorithms, though this may imply a more nat- with Applications, 32(1), 213–222.
de França, F. O., Coelho, G. P., & Von Zuben, F. J. (2008). bicaco: An ant colony in-
ural segmentation of the data. Also, the number of parameters may spired biclustering algorithm. In Ant colony optimization and swarm intelligence
require an additional effort to adjust, in some situations. (pp. 401–402). Springer.
For future research, we intend to propose heuristic algorithms de França, F. O., Coelho, G. P., & Von Zuben, F. J. (2013). Predicting missing val-
ues with biclustering: A coherence-based approach. Pattern Recognition, 46(5),
to automatically suggest the values of some of the required param-
1255–1266.
eters by using entropic metrics and knowledge extracted from ba- de França, F. O. (2012). Scalable overlapping co-clustering of word-document data.
sic statistics of the data set. This will be performed together with In Machine learning and applications (icmla), 2012 11th international conference
on: Vol. 1 (pp. 464–467). IEEE.
a more detailed sensitivity analysis.
de França, F. O., & Von Zuben, F. J. (2010). Finding a high coverage set of 5-biclusters
Additionally, we will explore in more details the idea of us- with swarm intelligence. In Evolutionary computation (cec), 2010 ieee congress on
ing co-clustering algorithms for Topic Modeling and Recommender (pp. 1–8). IEEE.
Systems, as well as some other possible applications. de França, F. O., & Von Zuben, F. J. (2011). Extracting additive and multiplicative
coherent biclusters with swarm intelligence. In Evolutionary computation (cec),
2011 ieee congress on (pp. 632–638). IEEE.
Funk, S. (2006). Netflix update: Try this at home.
References Gao, T., & Akoglu, L. (2014). Fast information-theoretic agglomerative co-clustering.
In H. Wang, & M. Sharaf (Eds.), Databases theory and applications. In Lecture notes
Anandkumar, A., Valluvan, R., et al. (2013). Learning loopy graphical models with in computer science: Vol. 8506 (pp. 147–159). Springer International Publishing.
latent variables: Efficient methods and guarantees. The Annals of Statistics, 41(2), Har-Peled, S., Indyk, P., & Motwani, R. (2012). Approximate nearest neighbor:
401–435. Towards removing the curse of dimensionality.. Theory OF Computing, 8(1),
Andrews, S. (2009). In-close, a fast algorithm for computing formal concepts. Inter- 321–350.
national conference on conceptual structures. Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American
Andrews, S. (2011). In-close2, a high performance formal concept miner. In Concep- Statistical Association (JASA), 67(337), 123–129.
tual structures for discovering knowledge (pp. 50–62). Springer.
Bache, K., & Lichman, M. (2013). UCI machine learning repository. URL http://archive.
ics.uci.edu/ml
F.O. de França / Expert Systems With Applications 64 (2016) 24–35 35

Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic frame- Mirkin, B. (1996). Mathematical classification and clustering. Nonconvex optimization
work for performing collaborative filtering. In Proceedings of the 22nd annual and its applications. Springer.
international acm sigir conference on research and development in information re- Mitra, S., & Banka, H. (2006). Multi-objective evolutionary biclustering of gene ex-
trieval (pp. 230–237). ACM. pression data. Pattern Recognition, 39, 2464–2477.
Karypis, G., & Kumar, V. (2012). Metis-serial graph partitioning and fill-reducing matrix Morgan, J. N., & Sonquist, J. A. (1963). Problems in the analysis of survey data, and
ordering. a proposal. Journal of the American statistical association, 58(302), 415–434.
Kernighan, B. W., & Lin, S. (1970). An efficient heuristic procedure for partitioning Newman, M. E. (2006). Modularity and community structure in networks. Proceed-
graphs. Bell System Technical Journal, 49(2), 291–307. ings of the National Academy of Sciences, 103(23), 8577–8582.
Labiod, L., & Nadif, M. (2011). Co-clustering for binary and categorical data with Symeonidis, P., Nanopoulos, A., & Manolopoulos, Y. (2008). Providing justifications
maximum modularity.. In Icdm (pp. 1140–1145). in recommender systems. Systems, Man and Cybernetics, Part A: Systems and Hu-
Lang, K. (1995). Newsweeder: Learning to filter netnews. In Proceedings of the twelfth mans, IEEE Transactions on, 38(6), 1–1272.
international conference on machine learning (pp. 331–339). Zamora, J., Mendoza, M., & Allende, H. (2016). Hashing-based clustering in high di-
Lewis, D. D. (1998). Naive (bayes) at forty: The independence assumption in infor- mensional data. Expert Systems with Applications, 62, 202–211.
mation retrieval. In Machine learning: Ecml-98 (pp. 4–15). Springer.

You might also like