Unit-3
Unit-3
•Definition:
•Association Rule Mining aims to discover interesting relations
between variables in large datasets.
•Example:
•"People who buy bread also tend to buy butter."
•Key Terms:
•Item: Individual product or service.
•Itemset: Collection of one or more items.
Basic Algorithms for Association
Rule Mining
• Apriori
• It applies a two-stage approach to discover frequent itemsets and
confident association rules.
• Frequent Itemset Discovery. To find all frequent itemsets, Apriori
introduces a candidate generation and test strategy. The basic idea is
that it first generates the candidate k-itemsets (i.e., k is 1 at the
beginning and is incrementd by 1 in the next cycle), then these
candidates will be evaluated whether frequent or not.
Association Rule Mining. Given all frequent itemsets, finding all frequent and
confident association rules are straightforward. The approach is very similar to the
frequent itemset mining algorithm. Because the cost on finding frequent itemsets is
high and accounts for most of the whole performance on discovering associate rules,
almost all research so far has been focused on the frequent itemset generation stage.
Eclat
There are many algorithms had been proposed based on Apriroi idea, in which Eclat
[270, 268] is distinct that it is the first one which proposed to generate all frequent
itemsets in a depth-first manner, while employs the vertical database layout and uses
the intersection based approach to compute the support of an itemset.
Eclat - Frequent Itemset Mining
Input: A transaction database D, a user specified threshold min sup, a set of
atoms of
FP-growth -The algorithm applies the divide and conquer approach
Apriori-like Algorithms:
PrefixSpan
AprioriALL
Sequential pattern mining was first introduced by Agrawal in [11] where three Apriori based
algorithms were proposed. Given the transaction database with three attributes customer-id,
transaction-time and purchased-items, the mining process were decomposed into five
phases:
Sort Phase: The original transaction database is sorted based on the customer and transaction
time.
L-itemsets Phase: The sorted database is first scanned to obtain those frequent (or large) 1-
itemsets based on the user specified support threshold. Suppose the minimal support is 70%. In
this case the minimal support count is 2, and the result of large 1-itemsets is listed.
Transformation Phase: All the large itemsets are mapped into a series of integers and the
original database is converted by replacing the itemsets. For example, with the help of the
mapping table.
Sequence Phase: Mine the transformed database and find all frequent sequential patterns.
Maximal Phase: Prune those patterns which are contained in other sequential patterns. In
other words, only maximum sequential patterns are remained.
Since most of the phases are straightforward, researches focused on the sequence phase.
AprioriAll [11] was first proposed based on the Apriori algorithm in association rule mining
[9]. There are two steps to mine sequential patterns, i.e., candidate generation and test
GSP
where Pr(s) is the priori probability of the sample s, Pr(D) is the priori probability of the
training data D, Pr(s|D) is the probability of s given D, and Pr(D|s) is the probability of D
given s.
The sample generation process can be modeled as: (1) each class c has an associated prior
probability Pr(c), with ∑c Pr(c) = 1. The sample s first randomly chooses a class label with
the corresponding probability; and (2) based on the class-conditional distribution Pr(s|c)
and the chosen class label c, we can obtain the overall probability to generate the sample,
i.e., Pr(c)Pr(s|c). The posterior probability of s generated from class c is thus can be
deduced by using Bayes’s rule
where γ crosses all the classes. The probability Pr(s|c) can be determined by
two kinds of parameters: (1) the priori domain knowledge unrelated to the
training data; and (2) the parameters which belong to the training samples.
For simplicity, the overall parameters are denoted as Θ and thus, Equation
3.3 can be extended as
The advantages of Bayesian classifiers are: (1) it can outperform other classifiers when the
size of training data is small; and (2) it is a model based on probabilistic theory, which is
robust to noise in real data. The limitation of Bayesian classifiers is that they are typically
limited to learning classes that occupy contiguous regions of the instance space.
Naive Bayes Learners
Naive Bayesian classifier is one basic type of the Bayesian learners with an assumption that
all the attributes are conditionally independent given the value of the class C. By
independence we mean probabilistic independence, that is, A is independent of B given C
whenever Pr(A|B,C) = Pr(A|C) for all possible values of A, B and C, whenever Pr(C)>0. The
pseudo code of Naive Bayesian classifier is shown in Algorithm
Neural Networks Classifier
Neural network (NN) classifier, or artificial neural network (ANN) classifier, is one of the classic
methods used for supervising learning . A Neural network can be commonly represented as a
format of graph, which consists of densely interconnected elements. The elements are called
neurons
When training a neural network, the inputs are [3]: (a) a series of examples of the items
to be classified; and (2) their proper classifications or targets. Such input can be viewed
as a vector: < X1,X2,...,Xn,θ,t >, where t is the target or true classification. The neural
network uses these to modify its weights, and it aims to match its classifications with the
targets in the training set. Weight modification follows a learning rule. The pseudo code
of training neural network is shown in Algorithm
The advantage of Neural nets is that it performs very well on difficult, non-linear
domains, where it becomes more and more difficult to use Decision tree, or Rule
induction systems, which cut the space of examples parallel to attribute axes. One
of disadvantages in using Neural nets for data mining is a slow learning process,
compared top for example Decision trees. This difference can very easily be several
orders of magnitude.
In addition to the above commonly used supervised learning methods, there are
many others developed, such as Discriminative Classification, Maximum Entropy
Learners, SVM, and so forth. Refer [33] for a more comprehensive survey
Unsupervised Learning
In this section, we will introduce major techniques of unsupervised learning (or clustering).
Among a large amount of approaches that have been proposed, there are three representative
unsupervised learning strategies, i.e., k-means, hierarchical clustering and density based
clustering.
There are mainly two types of algorithms for hierarchical clustering: one is in an
agglomerative bottom-up manner that the algorithm starts with all the objects
and successively combines them into clusters; the other is in a divisive top-down
manner that the algorithm starts with one whole cluster which includes all objects
and recursively splits the clusters into smaller ones. The second type can be seen
as a reverse procedure of the first one and hence, we mainly introduce the former
in this section
Hierarchical Agglomerative Clustering
The DBSCAN algorithm was first introduced by Ester et al. . The authors grouped the objects by
recognizing the density of them. Given a pre-defined density threshold E ps, those areas with a
higher density compared with E ps are considered as qualified clusters and the others are
treated as outliers or noise. With the help of the inherent characteristic of cluster density,
DBSCAN can discover any kind of arbitrary shape of clusters.
Semi-supervised Learning
semi-supervised learning (or semi-supervised classification), which aims to address the
problem by using large amount of unlabeled data, together with the labeled data, to build
better classifiers. There are many approaches proposed for semi-supervised classification, in
which the representatives are self-training, co-training, generative models, graph-based
methods. We will introduce them in the next few sections.
Self-Training
The basic idea of self-training is that: a classifier c is first trained based on the labeled data L
(which has small size). Then we use the classifier c to classify the unlabeled data U. Those
confident (judged by a threshold) unlabeled data U with the corresponding label class, are
extracted from U and inserted into L. The two dataset, labeled and unlabeled, are updated and
the procedure is repeated. The pseudo code of self-training is shown in Algorithm
The advantages of the self-training approach are:
(1) the method is intuitive and simple can be
wrapped to other more complex classifiers; and (2) it
can be applied to many real applications, i.e., word
sense disambiguation.
The advantages of the co-training approach are: (1) the method is simple and
based on self-training, and it can be wrapped to other more complex classifiers;
and (2) it is not so sensitive to the early error compared with self-training.
There is an assumption that both the labeled and the unlabeled datasets have the
same kind of model with similar parameters. One common strategy used is to
employ EM-like (i.e., expectation maximization ) algorithm to estimate the most
like values of the parameters.
(3) the appropriate model is difficult to choose and a wrong model may
weaken the effectiveness of the approach;
Markov model is one of the well-known approaches because of its broad applications, such
as speech recognition, financial forecasting, gene prediction, cryptanalysis, natural language
processing, data compression, and so forth.
The common point of these applications is that their goal is to predict or recover a data
sequence that is not immediately observable. For a concrete example, i.e., stock price
prediction, people always want to “guess” the trend of some stock, i.e., whether its price will
go up in the next day.
In fact, Markov model may be the simplest and most effective mathematical model for
analyzing such kind of time-series processes, while the class of Markov model makes it rich
enough to tackle all of these applications.
Regular Markov Models
•Key Concept: The future is dependent only on the most recent observation.
•Example: If today's price goes up, it’s more likely tomorrow’s price will also go up.
•Markov Assumption: Future prediction depends only on the previous event.
Homogeneous HMM
Variants of HMM
K-Nearest-Neighbor (kNN) approach is the most often used recommendation scoring algorithm in
many recommender systems, which is to compare the current user activity with the historic
records of other users for finding the top k users who share the most similar behaviors to the
current one.
The found neighboring users are then used to produce a prediction of items that are potentially
rated or visited but not done yet by the current active user via collaborative filtering approaches.
Therefore, the core component of the kNN algorithm is the similarity function that is used to
measure the similarity or correlation between users in terms of attribute vectors, in which each
user activity is characterized as a sequence of attributes associated with corresponding weights.
Content-based Recommendation
In a content-based recommendation, a user is associated with the attributes of the items that
rated, and a user profile is learned from the attributes of the items to model the interest of the
user.
The recommendation score is computed by measuring the similarity of the attributes the user
rated with those of not being rated, to determine which attributes might be potentially rated by
the same user. As a result of attribute similarity comparison, this method is actually a
conventional information processing approach in the case of recommendation.
The learned user profile reflects the long-time preference of a user within a period, and could be
updated as more different rated attributes representing users interest are observed. Content-
based recommendation is helpful for predicting individuals preference since it is on a basis of
referring to the individuals historic rating data rather than taking others preference into
consideration.
Collaborative Filtering Recommendation
Collaborative filtering recommendation is probably the most commonly and
widely used technique that has been well developed for recommender systems.
Here Rj,i denotes the rating on item i voted by user j, and only k most similar users (i.e. k nearest
neighbors of user i) are considered in making recommendation
In contrast to user-based kNN, item-based kNN algorithm is a different
collaborative filtering algorithm, which is based on computing the similarity
between two columns, i.e. two items. In item-based kNN systems, mutual item-
item similarity relation table is constructed first on a basis of comparing the item
vectors, in which each item is modeled as a set of ratings by all users.
To produce the prediction on an item i for user u, it computes the ratio of the sum
of the ratings given by the user on the items that are similar to i with respect to
the sum of involved item similarities as follows:
Here Ru, j denotes the prediction of rating given by user u on item j, and only the k
most similar items (k nearest neighbors of item i) are used to generate the
prediction.
Model-based Recommendation
A model-based collaborative filtering algorithm is to derive a model from the historic rating
data, and in turn, uses it for making recommendation. To derive the hidden model, a variety of
statistical machine learning algorithms can be employed on the training database, such as
Bayesian networks, neural networks, clustering and latent semantic analysis and so on.
The centroids of the user session clusters can be considered as access patterns/models learned
from web usage data and used to make recommendation via referring to the web objects
visited by other users who share the most similar access task to the current target user.
Although there exist of different recommendation algorithms in recommender systems,
however, it is easily found that these algorithms are both executing in a collaborative manner,
and the recommendation score is dependent on the significant weight.
Social Network Analysis
A social network is a social structure consisting of individuals (or organizations) called nodes and
ties that connecting nodes by different types of interdependency, such as friendship, affiliation
relationship, communication mode or relationship of religions, interest or prestige.
Social network analysis aims to reveal or visualize the relationships resided in the network via
network theory. Basically the graph theory is a common principle that used to conduct the
analysis.
Since the social network compounds the complicated interaction behaviors between various
individuals from the different aspects, the analysis is not a easy job with only the traditional
means, and the resulting graph-based structure is often very complex.
Social network analysis is studied in social science areas such as psychology, sociology, behavior
science and organization science discipline in several decades ago.
Nowadays the emergence of web-based communities and hosted services such as social
networking sites, Wikis and folksonomies, brings in tremendous freedom of Web autonomy and
facilitate collaboration and knowledge sharing between users.
Along with the interaction between users and computers, social media are rapidly becoming an
important part of our digital experience, ranging from digital textual information to diverse
multimedia forms.
These aspects and characteristics constitute of the core of second generation of Web. The close
involvement of computing intelligence science has injected the social network analysis research a
lot of new methodologies and technologies, resulting in the social network analysis became a
very active and hot topic in the web research.
Newman and Girvan has developed a new approach of community structure detection. They
introduced a procedure of two stages. First it runs the iterative removal of edges from the
network to split it into communities, the removed edges are identified by using any one of a
number of possible “betweenness” measures. Second, these measures are, iteratively updated
after each removal.
The partition operation ceases until the optimization condition is reached. In order to choose an
appropriate community number to ensure the strength of community structure, a modularity
function Q for monitoring community structure is proposed to provide a effective means
quantitatively measuring the ‘goodness’ of the clusters discovered.
The modularity of a particular division of a network is computed based on the differences
between the actual number of edges within a community in the division and the expected
number of such edges if they are partitioned randomly. Hence, finding the underlying
community structure hidden in a network is equivalent to a process of determining the
optimal modularity value over the all possible divisions of the network.
Updated Modularity Function using Fuzzy C-means Clustering
Zhang et al. proposed an updated version of modularity function Q. The main idea in the
work is to replace the “hard division” with the “soft division” by using fuzzy c-means
clustering. Given k communities existing in the network, we define a corresponding n × k
soft division matrix Uk = [u1,···,uk] with 0 ≤ uic ≤ 1 for each c = 1,···,k and ∑k c=1 uic = 1 for
each i = 1,···,n.
With the introduction of fuzzy assignment, we can further define the membership of each
community as Vc = {i|uic > λ,i ∈V }, where λ is a threshold which is used to determine a
soft division to a final clustering The new modularity function Q is reformulated as
With the updated modularity function Q, Zhang et al. proposed a new algorithm for
detecting the community structure by combining the spectral and fuzzy c-mean
clustering. The detailed algorithm description is stated as follows.
• Spectral mapping: (i) Compute the diagonal matrix D = (dii), where dii = ∑k aik (ii) Form
the eigenvector matrix Ek = [e1, e2,···,ek] by computing the top K eigenvectors of the
generalized eigensystem Ax = tDx.
• Fuzzy c-means: for each value of k, 2 ≤ k ≤ K: (i) Form the matrix Ek = [e2, e3,···,ek] from
the matrix Ek. (ii) Normalize the rows of Ek to unit length using Euclidean distance norm.
(iii) Cluster the row vectors of Ek using fuzzy c-means or any other fuzzy clustering
method to obtain a soft assignment matrix Uk.
• Maximizing the modular function: Pick the k and the corresponding fuzzy partition Uk
that maximizes Q(Uk).
The main purpose of the updated modularity function is to detect the overlapping of
the community structure in a network. In Fig.3.20, an example is presented to illustrate
the detection of community structure overlapping along with calculated modularity
function Q values by using k-means and fuzzy spectral clustering.
The Evolution of Social Networks
Another important aspect of social networks is the dynamic and evolutionary changes of the
network structure over time. Since the social network is an explicit demonstration of
relationships or interactions of individuals within a network environment, the networked
communities may evolve over time, due to changes to individual roles, social status even the
personal preferences in the network. Hence the analysis of these communities (memberships,
structure and temporal dynamics) is emerging as an interesting and important research issue
within the social network analysis research areas.
There are a variety of approaches and algorithms proposed to address the dynamic evolution of
social networks. These algorithms are usually categorized into two main families. The first one is
using an additional temporal dimension along with the traditional networked graphs.
In these approaches, the ordinary two-way graph model is expanded into a three-way tensor
model by adding a temporal dimension. The networks at different timestamps are modeled as an
individual networked graph, which is expressed as a matrix. As such a set of networked graphs at
a sequence of time stamps consists of a three-way data cube, in which each slice of the data
cube is corresponding to a networked graph at a specific time stamp.
For a given tensor expression modeling the sequence of network structures, tensX, the
process is to find a best rank k approximation tensor X , which satisfies the criterion of
minimization of X−X F, where X ∈ RN1×···×Nm and X ∈ RR1×···×Rm . The best
approximation tensor, X, is decomposed into the product of a core tensor Y of two
projection matrices of row and colume avriables U1 and U2, as shown in Figure
The second category of temporal analysis of dynamic community structures is the
simultaneous discovery of community structure and dynamic evolution. Apart from the classic
community detection and dynamic evolution analysis algorithms, such kind of approaches
take both two requests together into account when conducting social network analysis.
The hypnosis in these approaches is that the evolution of community structures in a network
from one stage to its consecutive stage will not change dramatically, instead, follows the
routine that the incurring cost will be minimized. This assumption is quite reasonable in real
world that the evolution of real observations is gradually and properly in a longitudinal
manner.
In their approach, they first introduced a cost function to measure the quality of community
structure at a certain time t and assumed the forming of stable community structure is really
dependent on the minimization of the cost function. In particular, the cost function consists of
two aspects - a snapshot cost and a temporal cost:
The snapshot cost is then determined by the distribution difference between the real
similarity matrix of nodes within the network and the calculated similarity matrix of the
formed community structures.
That is, the snapshot cost is the KL-divergence between the above mentioned similarity
matrix. On the other hand, the temporal cost indicates the distribution difference between
the similarity matrices at two consecutive time stamps using the same formula.
Combining these two types of cost gives the whole cost measure, which is used to guide the
community structure detection.