Mining Graphs

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

MINING GRAPHS AND NETWORKS

PRESENTED
PRESENTED BY
BY
HARSHITHA
HARSHITHA MERLIN
MERLIN R
R
INTRODUCTION
 Graphs represents a more general class of structures than sets, sequences, lattices, and trees.
 There is a broad range of graph applications on the Web and in social networks, information
networks, biological networks, bioinformatics, chemical informatics, computer vision, and
multimedia and text retrieval.
 Hence, graph and network mining have become increasingly important and heavily researched.
THEMES
We overview the following major themes: 
 graph pattern mining
 statistical modeling of networks
 data cleaning, integration, and validation by network analysis
 clustering and classification of graphs and homogeneous network
 clustering, ranking, and classification of heterogeneous networks
 role discovery and link prediction in information networks
 similarity search and OLAP in information networks and
 evolution of information networks.
GRAPH PATTERN MINING
WHY ARE GRAPHS USED?
 Most of existing data mining algorithms are based on Flat transaction representation, i.e., sets of
items. 
 Datasets with structures, layers, hierarchy and/or geometry often do not fit well in this flat
transaction setting. For example:
 Numerical simulations
 3D protein structures
 Chemical compounds
 Generic XML files

3D structure of sodium bicarbonate


EXAMPLES

Yeast protein interaction network Internet


WHAT IS GRAPH PATTERN MINING?
 Graph pattern mining is the mining of frequent
subgraphs (also called (sub)graph patterns) in one or a
set of graphs
 Graph Mining (GM) is essentially the problem of
discovering repetitive subgraphs occurring in the input
graphs
 Sub graphs

 graph structures that occur a significant number of times


across a set of graphs 
 Ex.: Common occurrences of hydroxide-ion 

 A (sub)graph is frequent if its support


(occurrence frequency) in a given dataset is no less than
a minimum support threshold
METHODS FOR MINING GRAPH
PATTERNS
o Apriori-based approach
o Growth–based approach

Motivation 
• Finding subgraphs capable of compressing the data by abstracting
instances of the substructures ​
• Identifying conceptually interesting patterns​

 Alternatively, we can mine the set of closed graphs where a graph g is closed if there exists no proper
supergraph g' that has the same support count as g.
 Moreover, there are many variant graph patterns, including approximate frequent graphs, coherent
graphs, and dense graphs.
METHODS FOR MINING GRAPH PATTERNS
Apriori based approach Growth based approach
APPLICATIONS
 Mining biochemical structures

 Program control flow analysis

 Mining XML structures or Web communities

 Building blocks for graph classification, clustering, compression, comparison, and


correlation analysis
STATISTICAL MODELING
OF NETWORKS
WHAT IS A NETWORK?
 A network consists of a set of nodes, each corresponding
to an object associated with a set of properties, and a set of
edges (or links) connecting those nodes, representing
relationships between objects. 
 A network is 
 homogeneous if all the nodes and links are of the same type,
such as a friend network, a coauthor network, or a web page
network.
 heterogeneous if the nodes and links are of different types,
such as publication networks (linking together authors,
conferences, papers, and contents), and health-care networks
(linking together doctors, nurses, patients, diseases, and
treatments). 
WHAT IS A NETWORK?
 Researchers have proposed multiple statistical models for modeling homogeneous networks.

 The most well-known generative models are


 the random graph model (i.e., the Erdos-Renyi model)

 the Watts-Strogatz model

 the scale-free model

 The scale-free model assumes that the network follows the power law distribution (also known as the Pareto
distribution or the heavy-tailed distribution)
 Social networks exhibit some evolutionary characteristics
 Densification power law, which states that networks become increasingly dense over time

 Shrinking diameter, where the effective diameter decreases as the network grows.
Data Cleaning, Integration, and
Validation by Information Network
Analysis
DATA CLEANING
 Real-world data are often incomplete, noisy, uncertain, and unreliable

 Information redundancy may exist 

 Information redundancy can be explored in such networks to perform quality data cleaning, data
integration, information validation, and trustability analysis
 For example, we can distinguish authors who share the same names by examining the networked connections
with other heterogeneous objects such as coauthors, publication venues, and terms
 In addition, we can identify inaccurate author information by exploring a network built based on author
information provided by multiple booksellers.
TRAINING SET
 in many cases, portions of the data serve as the “training set.” That is, relatively clean and reliable data
from information providers can be used to help consolidate the remaining, unreliable portions of the
data.
 This reduces the costly efforts of labeling the data by hand and of training on massive, dynamic, real-
world data sets.
CLUSTERING AND CLASSIFICATION OF GRAPHS AND
HOMOGENEOUS NETWORKS
 Large graphs and networks have cohesive structures, which are hidden among their massive,
interconnected nodes and links.
 For this reason cluster analysis is developed to uncover network structures, discover hidden communities,
hubs, and outliers
 Network clustering methods are categorized as either partitioning, hierarchical, or density-based
algorithms.
CLUSTERING, RANKING, AND CLASSIFICATION OF
HETEROGENEOUS NETWORKS
 A heterogeneous network contains interconnected nodes and links of different types. Such interconnected
structures contain rich information, which can be used to mutually enhance nodes and links, and
propagate knowledge from one type to another
 Clustering and ranking of such heterogeneous networks can be performed hand-in-hand. Highly ranked
nodes/links in a cluster may contribute more than their lower-ranked counterparts in the evaluation of the
cohesiveness of a cluster.
 Clustering may help consolidate the high ranking of objects/links dedicated to the cluster. Such mutual
enhancement of ranking and clustering prompted the development of an algorithm called RankClus.
ROLE DISCOVERY AND LINK PREDICTION IN INFORMATION
NETWORKS
 There exist many hidden roles or relationships among different nodes/links in a heterogeneous network

 For example, advisor–advisee and leader–follower relationships in a research publication network.

 To discover such hidden roles, experts can specify constraints based on their background knowledge.

 Such constraints help cross-checking and validation in large interconnected networks

 Similarly, link prediction can be performed based on the expected relationships among the candidate
nodes/links. 
 For example, we may predict which papers an author may write, read, or cite, based on the author’s recent
publication history and the trend of research on similar topics
 people refer to link prediction as link mining

 however, link mining covers additional tasks like


 link-based object classification

  object type prediction, 

 link type prediction, 

 link existence prediction, 

 link cardinality estimation, and 

 object reconciliation (which predicts whether two objects are the same)

 group detection (which clusters objects), 

 subgraph identification (which finds characteristic subgraphs within networks)

 metadata mining (which uncovers schema-type information regarding unstructured data).


SIMILARITY SEARCH AND OLAP IN INFORMATION NETWORKS
 Similarity search is a primitive operation in database and web search engines

 A heterogeneous information network consists of multityped, interconnected objects. Examples include


bibliographic networks and social media networks, where two objects are considered similar if they are
linked in a similar way with multityped objects
 object similarity within a network can be determined based on network structures and object properties,
and with similarity measures
 similarity can be defined differently per user. By considering different linkage paths, we can derive
various similarity semantics in a network, which is known as path-based similarity.
 By organizing networks based on similarity and clusters, we can generate multiple hierarchies within a
network. Online analytical processing (OLAP) can then be performed. 
EVOLUTION OF SOCIAL AND INFORMATION NETWORKS
 Networks are dynamic and constantly evolving

 Detecting evolving communities and evolving regularities or anomalies in homogeneous or heterogeneous


networks can help people better understand the structural evolution of networks and predict trends and
irregularities in evolving networks
 For homogeneous networks, the evolving communities discovered are subnetworks consisting of objects
of the same type such as a set of friends or coauthors
 For heterogeneous networks, the communities discovered are subnetworks consisting of objects of
different types, such as a connected set of papers, authors, venues, and terms, from which we can also
derive a set of evolving objects for each type, like evolving authors and themes.
THANK YOU

You might also like