Link-Analysis-AH

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Link Analysis

Sukomal Pal
CSE, IIT (BHU)
spal.cse@itbhu.ac.in
HTML webpage

Introduction to IR 2
Bow-tie structure of the Web
A small web graph
Ranking
 Keywords and Proximity-based retrieval
 Traditional models
- Boolean
- VSM
- Probabilistic (BM25, Lang. modelling)
 Link Analysis
 Anchor Text
 PageRank
 Authority and Hub
Other features
- User behavior (Click logs)

Introduction to IR 4
Introduction to IR 5
Use of PageRank
 Google displays the PageRank of each page in the Google
toolbar

 PageRank scores are converted into a logarithmic-like scale and


assume values between 0 and 10


PageRank scores are continuously updated by Google, but they
are exported and made availble to the Google toolbar periodically,
typically every few months.
Use of PageRank


PageRank can be used to raise the weight of important pages:
weight (t , d ) TFIDF (t , d ) PR (d )
where t is an indexed term and d is an indexed page
TFIDF = term-frequency*inverse document frequency
PR(d) = PageRank score of document d

The actual use of PageRank by Google is confidential

Recently Google announced the use of «Hummingbird», a new
class of algorithms which makes use of knowlege bases and of an
improved NLP understanding

To learn more see some relevant pages.
 1. https://en.wikipedia.org/wiki/PageRank
 2. https://www.google.com/search/howsearchworks/
Hubs and Authority
HITS: Hypertext Induced Topic Search

Query dependent. Link analysis carried out over query induced graph

Two kind of important pages:
– hubs are pages that point to good authorities
– authorities are pages that are pointed to by good hubs

Two scores (authority and hub score) for each page
– Hub's value is in the links which exit the node, as it is used in order to select the
pages containing relevant information
– Authority's value is in the links which enter the node, as it is used to describe
contents

Authority: WHO-COVID-update page


Hub: IT blog pointing to good pages hub authority

– Good hubs point to good authorities. Good authorities are pointed by good hubs
HITS: Basics

Given a query, every web page has 2 scores
- hub score
- authority score

For any query, 2 ranked lists

Circular relation:
- A good hub points to many good authorities
- A good authority is pointed to by many good hubs
Hubs and Authority


Perform these updates iteratively
1. Initialize hub & authority scores
2. Compute hub scores
3. Recompute authority scores (based on hubs)
...so on.
HITS: Formulation
⃗h= hub vector
⃗a = authority vector

A = adjacency matrix s.t

where AT denotes the transpose of the matrix A.

Then,

Here, again, h and a on LHS correspond to (t+1)-th iteration when on RHS (t)-th iteration

The eqns are similar to pair of eigen vector equations when <-- are replaced suitably

λh is the eigenvalue of AAT and λa is the eigenvalue of AT A.


HITS: Hub & Authority scores

Both hub and authority scores will tend to
converge

At steady state, the hub and authority scores
will not change in successive iterations
i.e. h(t+1) = h(t)
& a(t+1) = a(t)
Computing Hub & Authority scores
HITS: how to get a subset of pages

Send query q (say leukemia) to text-based IR system to
obtain the root set of pages, i.e. using a boolean or a vector
space model
Expand the base set by radius one to obtain an expanded
graph (base set)
– add pages with outgoing links to pages in the root set
– add pages with incoming links from pages in the root set
This enriches the root set with good authority & hub pages
Example of HITS on query:
japan elementary schools
Issues with HITS

non-unique ranking
- if the graph is not strongly connected, as it is the case for the Web
graph, HITS might not converge to a unique solution

disconnected webs
- upon convergence, HITS assigns non-negative scores only to the largest
connected component

sensitivity to local topology
- subtle changes in the local graph topology might significantly alter HITS
scores

nepotistic links
- a group of sites can endorse each other by adding cross links in order to
artificially increase the HITS scores

topic drift/contamination
-the expansion step might add pages to the base set whose topic might
differ from the original root set
PR vs HITS

HITS computes two scores (hub and authority) for each
page

PageRank computes one score for each page

PageRank scores are similar (but not identical) to authority
scores computed by HITS

In PageRank, page i confers authority to page j if there is a
link between i and j (cfr. matrix A)

In HITS, page i confers authority to page j if there is another
page that links to j (cfr. matrix ETE)

PageRank does not identify top hub pages
HITS vs PR

HITS is query dependent, thus scores can be
computed only at query time. PageRank is
query independent and can be computed offline

HITS computes two scores per page. Returning
top ranking hubs is useful starting point for
users exploring a topic

PageRank is more robust to node
insertion/deletion

You might also like