GRP 11_page rank algorithms
GRP 11_page rank algorithms
GRP 11_page rank algorithms
Presented by :
RAJ SAWANT 3021147
MOHD.NUMAN SHAIKH 3021148
PRATHAMESH SHEDALE 3021149
TANISHKA SHIRWADKAR 3021150
CONTENT’S
PageRank Overview
Conclusion
PAGE RANK OVERVIEW
PageRank was Developed by Larry Page & Sergey Brin (1996) as part of Google’s foundation to solve the problem of ranking web pages based on
relevance and authority. It was an innovative approach that measured the importance of web pages by analyzing the links between them.
How Does PageRank Work?
It operates on the principle that a webpage's value is determined by the quantity and quality of links pointing to it, treating each link as a "vote" of
confidence. he formula used to calculate the PageRank PR of a page P is:
In this formula, d is the damping factor (typically set to around 0.85), Ti represents each page that links to P, PR(Ti) is the PageRank score of page Ti,
and C(Ti) is the total number of outbound links on page Ti.
Pages with higher PageRank scores are ranked higher in search engine results, indicating their greater relevance and authority, ultimately influencing
the visibility of content on the web.
Example: Wikipedia vs. Personal Blog
Wikipedia: Consider a page on Wikipedia about "Artificial Intelligence." It has numerous backlinks from reputable sources, including academic
journals, news articles, and tech websites. Because it’s linked by many high-authority pages, its PageRank score is high, which means it appears at the
top of search engine results when users search for "Artificial Intelligence."
Personal Blog: In contrast, a personal blog post about "Artificial Intelligence" may have only a few backlinks from lesser-known websites or social
media. Despite potentially having quality content, its lack of authoritative backlinks results in a lower PageRank score, causing it to rank lower in
search results.
POPULARITY CONTEST
)
ES
S)
E
OT
T
O
S(V
(V
KS
NK
IN
LI
L
LIN
S) KS
TE (VO
VO TE
S( S)
NK
LI
Transition Matrices
Transition Matrix in PageRank
A transition matrix (stochastic matrix) represents the probabilities of moving
from one node (webpage) to another in a directed graph.
Each entry PijP_{ij}Pij signifies the probability of transitioning from node iii to
node jjj.
Each row sums to 1, ensuring valid probability distribution.
Essential for computing the importance of webpages in the PageRank algorithm.
Efficient Representation of
Transition Matrices
.Efficient representation speeds up computations by reducing memory usage and computational time.
Sparse matrices are used to store only non-zero entries, which is common in large graphs.
Advantages of Sparse Matrices:
Memory Efficiency: Specialized formats reduce memory usage.
Speed: Faster operations due to skipping zero entries.
Scalability: Enables analysis of larger datasets.
Better Algorithms: Optimized routines enhance performance.
Efficient Computation of PageRank
Iterative Methods to Compute PageRank
The PageRank algorithm is most commonly computed using iterative methods, which approximate the rank values over multiple
iterations until convergence.
Gauss-Seidel Method:
An optimization of the power iteration is the Gauss-Seidel method. In this approach, as soon as a component of the PageRank
vector is updated, it is used immediately in the next calculation, rather than waiting for the next iteration. This can lead to faster
convergence.
PageRank is a widely used algorithm for ranking web pages based on their importance. Implementing PageRank using MapReduce
allows for efficient processing of large-scale web graphs. Here’s a step-by-step overview of the process:
Mapper
Initial PageRank assignment: Each mapper is responsible for assigning an initial PageRank value to each node (web page) in the
graph. This value is typically set to a random value or a uniform distribution.
Emitting outlinks and PageRank: For each node, the mapper emits a key-value pair containing the outlinks (incoming links) and the
initial PageRank value.
Example: If a page has a rank of 10 and links to 5 other pages, the mapper assigns 2 rank points (10 ÷ 5) to each linked page.
Reducer
Aggregating PageRanks: The reducer receives the key-value pairs from the mapper and aggregates the PageRank values for each
node based on its outlinks.
Calculating new PageRank: The reducer applies the PageRank formula to calculate the new PageRank value for each node:
PR(x) = (1 - damping factor) + damping factor * sum(PR(a) * weight(a), a in in_links)
Example: If a page gets contributions from three other pages (say, 2, 4, and 3 rank points), the reducer sums them up to get the new rank
(2 + 4 + 3 = 9).
** Emitting updated PageRank**: The reducer emits the updated PageRank value for each node, along with its outlinks.
PageRankImplementation Using
MapReduce
Convergence
Iterative calculation: The reducer continues to iterate, recalculating PageRanks based
on the updated values from previous iterations.
Convergence check: The algorithm checks for convergence by comparing the differences
between new and old PageRank values. If the difference falls below a threshold, the
algorithm terminates.
Key considerations
Damping factor: A value between 0 and 1 that controls the influence of outgoing links
on a node’s PageRank. A higher damping factor means more emphasis on outgoing
links.
Weighting: The weights assigned to outgoing links can be used to represent the
importance of each link.
Scalability: MapReduce’s parallel processing capabilities make it well-suited for large-
scale web graph processing.
PageRankImplementation Using
MapReduce
In summary:
Mapper spreads the rank from each page to the pages it links to.
Reducer combines the ranks for each page.
Convergence means the algorithm has stabilized, and the ranks are
accurate enough to stop the process.
Conclusion
PageRank is a powerful algorithm used to rank web pages by measuring the importance of nodes in a
network based on their link structure. Efficient representation of transition matrices, such as using
sparse matrices, reduces storage and computational complexity, making large-scale implementations
feasible. Efficient computation of PageRank, leveraging methods like power iteration and matrix-vector
multiplication, further accelerates the process. PageRank's implementation using MapReduce enables
parallelization, distributing the computation across multiple machines for handling massive datasets,
making the algorithm scalable for web-scale graphs.
HARSH PATIL’S GROUP
TOPIC ALLOTMENT
NANDAN’S GROUP
US
AFTER REPEATING THE SAME TOPIC IN 3 PPT’S
LE US: