Mining Graphs
Mining Graphs
Mining Graphs
PRESENTED
PRESENTED BY
BY
HARSHITHA
HARSHITHA MERLIN
MERLIN R
R
INTRODUCTION
Graphs represents a more general class of structures than sets, sequences, lattices, and trees.
There is a broad range of graph applications on the Web and in social networks, information
networks, biological networks, bioinformatics, chemical informatics, computer vision, and
multimedia and text retrieval.
Hence, graph and network mining have become increasingly important and heavily researched.
THEMES
We overview the following major themes:
graph pattern mining
statistical modeling of networks
data cleaning, integration, and validation by network analysis
clustering and classification of graphs and homogeneous network
clustering, ranking, and classification of heterogeneous networks
role discovery and link prediction in information networks
similarity search and OLAP in information networks and
evolution of information networks.
GRAPH PATTERN MINING
WHY ARE GRAPHS USED?
Most of existing data mining algorithms are based on Flat transaction representation, i.e., sets of
items.
Datasets with structures, layers, hierarchy and/or geometry often do not fit well in this flat
transaction setting. For example:
Numerical simulations
3D protein structures
Chemical compounds
Generic XML files
Motivation
• Finding subgraphs capable of compressing the data by abstracting
instances of the substructures
• Identifying conceptually interesting patterns
Alternatively, we can mine the set of closed graphs where a graph g is closed if there exists no proper
supergraph g' that has the same support count as g.
Moreover, there are many variant graph patterns, including approximate frequent graphs, coherent
graphs, and dense graphs.
METHODS FOR MINING GRAPH PATTERNS
Apriori based approach Growth based approach
APPLICATIONS
Mining biochemical structures
The scale-free model assumes that the network follows the power law distribution (also known as the Pareto
distribution or the heavy-tailed distribution)
Social networks exhibit some evolutionary characteristics
Densification power law, which states that networks become increasingly dense over time
Shrinking diameter, where the effective diameter decreases as the network grows.
Data Cleaning, Integration, and
Validation by Information Network
Analysis
DATA CLEANING
Real-world data are often incomplete, noisy, uncertain, and unreliable
Information redundancy can be explored in such networks to perform quality data cleaning, data
integration, information validation, and trustability analysis
For example, we can distinguish authors who share the same names by examining the networked connections
with other heterogeneous objects such as coauthors, publication venues, and terms
In addition, we can identify inaccurate author information by exploring a network built based on author
information provided by multiple booksellers.
TRAINING SET
in many cases, portions of the data serve as the “training set.” That is, relatively clean and reliable data
from information providers can be used to help consolidate the remaining, unreliable portions of the
data.
This reduces the costly efforts of labeling the data by hand and of training on massive, dynamic, real-
world data sets.
CLUSTERING AND CLASSIFICATION OF GRAPHS AND
HOMOGENEOUS NETWORKS
Large graphs and networks have cohesive structures, which are hidden among their massive,
interconnected nodes and links.
For this reason cluster analysis is developed to uncover network structures, discover hidden communities,
hubs, and outliers
Network clustering methods are categorized as either partitioning, hierarchical, or density-based
algorithms.
CLUSTERING, RANKING, AND CLASSIFICATION OF
HETEROGENEOUS NETWORKS
A heterogeneous network contains interconnected nodes and links of different types. Such interconnected
structures contain rich information, which can be used to mutually enhance nodes and links, and
propagate knowledge from one type to another
Clustering and ranking of such heterogeneous networks can be performed hand-in-hand. Highly ranked
nodes/links in a cluster may contribute more than their lower-ranked counterparts in the evaluation of the
cohesiveness of a cluster.
Clustering may help consolidate the high ranking of objects/links dedicated to the cluster. Such mutual
enhancement of ranking and clustering prompted the development of an algorithm called RankClus.
ROLE DISCOVERY AND LINK PREDICTION IN INFORMATION
NETWORKS
There exist many hidden roles or relationships among different nodes/links in a heterogeneous network
To discover such hidden roles, experts can specify constraints based on their background knowledge.
Similarly, link prediction can be performed based on the expected relationships among the candidate
nodes/links.
For example, we may predict which papers an author may write, read, or cite, based on the author’s recent
publication history and the trend of research on similar topics
people refer to link prediction as link mining
object reconciliation (which predicts whether two objects are the same)