Link-Analysis-AH
Link-Analysis-AH
Link-Analysis-AH
Sukomal Pal
CSE, IIT (BHU)
spal.cse@itbhu.ac.in
HTML webpage
Introduction to IR 2
Bow-tie structure of the Web
A small web graph
Ranking
Keywords and Proximity-based retrieval
Traditional models
- Boolean
- VSM
- Probabilistic (BM25, Lang. modelling)
Link Analysis
Anchor Text
PageRank
Authority and Hub
Other features
- User behavior (Click logs)
Introduction to IR 4
Introduction to IR 5
Use of PageRank
Google displays the PageRank of each page in the Google
toolbar
●
PageRank scores are continuously updated by Google, but they
are exported and made availble to the Google toolbar periodically,
typically every few months.
Use of PageRank
●
PageRank can be used to raise the weight of important pages:
weight (t , d ) TFIDF (t , d ) PR (d )
where t is an indexed term and d is an indexed page
TFIDF = term-frequency*inverse document frequency
PR(d) = PageRank score of document d
●
The actual use of PageRank by Google is confidential
●
Recently Google announced the use of «Hummingbird», a new
class of algorithms which makes use of knowlege bases and of an
improved NLP understanding
●
To learn more see some relevant pages.
1. https://en.wikipedia.org/wiki/PageRank
2. https://www.google.com/search/howsearchworks/
Hubs and Authority
HITS: Hypertext Induced Topic Search
●
Query dependent. Link analysis carried out over query induced graph
●
Two kind of important pages:
– hubs are pages that point to good authorities
– authorities are pages that are pointed to by good hubs
●
Two scores (authority and hub score) for each page
– Hub's value is in the links which exit the node, as it is used in order to select the
pages containing relevant information
– Authority's value is in the links which enter the node, as it is used to describe
contents
– Good hubs point to good authorities. Good authorities are pointed by good hubs
HITS: Basics
●
Given a query, every web page has 2 scores
- hub score
- authority score
●
For any query, 2 ranked lists
●
Circular relation:
- A good hub points to many good authorities
- A good authority is pointed to by many good hubs
Hubs and Authority
●
Perform these updates iteratively
1. Initialize hub & authority scores
2. Compute hub scores
3. Recompute authority scores (based on hubs)
...so on.
HITS: Formulation
⃗h= hub vector
⃗a = authority vector
Then,
Here, again, h and a on LHS correspond to (t+1)-th iteration when on RHS (t)-th iteration
The eqns are similar to pair of eigen vector equations when <-- are replaced suitably