Page Rank With 13 Cases

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 72

PAGERANK

GROUP 3

ANSHURAJ KAWLECHA
205123021
ARADHYA GUPTA
205123023
ASHISH
205123025
BHARAT BISHNOI
205123027
TOPICS
• INTRODUCTION OF PAGERANK
• HISTORY
• MATHEMATICAL REPRESENTATION
• STRATEGIES FOR PAGERANK
• TOOLS FOR PAGERANK
• ADVANTAGES AND DISADVANTAGES
• DIFFERENT CASE STUDIES
Introduction

 PageRank is a ranking algorithm developed by


Google founders Larry Page and Sergey Brin to
measure the importance of web pages on the
Internet.

 It assesses pages based on their incoming links.


 Key idea: A page is more valuable if linked to by
other important pages.
 Revolutionized web search by improving
relevance and reliability.
Historical Context

PageRank was developed in the late 1990s during the rapid rise of
the internet ,as the number of web pages increased significantly.

•Existing search engines relied on keyword frequency for


ranking, which led to manipulation through techniques like
keyword stuffing.
•PageRank was introduced in the research paper
"The PageRank Citation Ranking: Bringing
Order to the Web," proposing a new way to rank
pages based on their links.

This approach marked a significant shift in


information retrieval, enabling search engines to
provide more relevant and trustworthy results.
The Problem Before PageRank

 Before PageRank came along, search engines had


several problems:
1. Irrelevant Results: Search results were often
filled with pages that didn’t really match what
users were looking for.
2. Ranking Tricks: Some website owners used
sneaky methods to boost their page rankings,
making it hard to trust the results.
3. Handling Growth: As more websites appeared,
traditional search methods struggled to keep up
and deliver useful results.
4.No Trust Evaluation: There wasn’t a good way to tell
which pages were reliable or important, making it difficult
for users to find quality information.

These issues showed that a new approach was needed,


leading to the creation of PageRank, which improved how
search engines ranked pages by looking at their links.
Graph
Representation
of Social  Nodes: Individuals or
Networks entities in the network.
 Edges: Relationships
between nodes (e.g.,
friendships).
 PageRank models the
network as a directed
graph, with edges
representing flow of
influence.
Centrality Measures

 PageRank is a global centrality measure based on


connectivity.
 Other related centrality measures:
 Degree Centrality: Number of direct connections a
node has.
 Betweenness Centrality: How often a node
appears on the shortest path between others.
 Closeness Centrality: How close a node is to all
other nodes.
Random Walk and Markov
Chains

 PageRank is based on the idea of a random walk,


where a 'surfer' moves randomly between nodes.
 The probability of landing on a node represents its
PageRank score.
 This can be modeled using Markov chains with
transitions representing links.
Damping Factor

 The damping factor is a critical component of the


PageRank algorithm, influencing how the algorithm
ranks web pages based on their link structure. It
addresses the problem of infinite loops that can
occur in the context of the web’s interconnected
nature.
 The damping factor is a probability value typically
set between 0 and 1, with a common default value of
around 0.85.
 It represents the likelihood that a user, when
navigating through the web, will continue clicking on
links rather than randomly jumping to a new page.
Purpose of the Damping Factor

•Preventing Infinite Loops: In a web context, a user could


theoretically get stuck in a cycle of pages (e.g., two pages that
link to each other). The damping factor helps mitigate this by
introducing a random jump, ensuring the algorithm does not get
trapped in such loops.

•Incorporating Randomness: The damping factor introduces a


degree of randomness to the user's navigation behavior, which
reflects more realistic browsing patterns. This randomness helps
ensure that even less popular or poorly linked pages have a
chance to be considered in the ranking.
Mathematical
Representation

 In mathematical terms, the PageRank of a page P


can be expressed as:

PageRank (PR) = (1 - d) / + d * Σ (PR(j) / L(j))

Let's break down this formula:

● "d" (the damping factor) represents the probability


that a user will click on a link.
● "PR(j)" is the PageRank of a specific page. "L(j)"
signifies the number of outbound links on page "j".
● "L(j)" signifies the number of outbound links on
page "j".
Basic Concepts

 PageRank form a probability distribution over web


pages, so the sum of all web pages' PageRank will be
one.
 The algorithm determines the initial PageRank value for
each web page. This initial value can be uniform or
based on certain factors, such as the number of
incoming links to the page. The algorithm then
repeatedly calculates the PageRank value of each page,
taking into account the PageRank value of the pages
that are related to the pages. During each iteration, the
PageRank value of the page is updated based on the
sum of the PageRank values of the incoming links.
Pages with more inbound links have a more significant
impact on the landing page's PageRank
Continue…

 PageRank is divided by its number of outbound


links, which divides its influence on the pages it
links to. The iterative process continues until the
PageRank values converge, indicating that the
algorithm has reached a stable rank. The resulting
PageRank scores describe the relative importance
of each web page. Pages with a higher PageRank
score are considered more influential and likely to
appear higher in search engines. PageRank was a
breakthrough in Internet search because it provided
a quantitative measure of the importance of a page
that was largely independent of keyword searches
and other traditional ranking factors.
PSEUDO-CODE

Iterative Algorithm:
for iteration in range(max_iterations):
new_rank = {}
for node in G.nodes:
new_rank[node] = (1 - d) / n # Initialize with (1 - d)/n, where n is the total number of nodes
for node in G.nodes:
for neighbor in G.neighbors(node):
num_neighbors = len(G.neighbors(neighbor))
new_rank[neighbor] += d * (current_rank[node] / num_neighbors) # Update the rank
of the neighbor
# Check for convergence
max_diff = max(abs(new_rank[node] - current_rank[node])
if max_diff < tol:
break
current_rank = new_rank # Update the current rank with the newly calculated ranks
Some Example Based on
damping factor
CONTINUE…
CONTINUE…
STEPS FOR CALCULATION

 1.Rank Distribution:
As the algorithm progresses, the PageRank of the page is
distributed among the outgoing links. For example, if a page has
a high PageRank and many outbound links, each link will
contribute to the overall impact of the page. This division ensures
that the importance of linked pages is shared.
CONTINUE…

 2.Convergence:
The iterative process continues until the PageRank score
stabilizes or converges. Convergence occurs when the difference
in PageRank scores between successive iterations falls below a
certain threshold. At this point, the algorithm has reached a
stable ranking, and the PageRank scores indicate the relative
importance of each web page.
CONTINUE…

 3.Ranking and Display:


Pages are ranked based on their final PageRank scores. Pages
with a higher PageRank score are considered more influential or
essential Search engines can use these points to display search
results, so pages with higher rankings are usually shown closer to
the top. By considering the link structure and updating the
PageRank score iteratively, the algorithm effectively measures
the importance of web pages relative to others. It allows ranking
pages based on their popularity and influence, helping to develop
more accurate and relevant search engines.
Adjacency Matrix
Representation

 The network can be represented as an adjacency


matrix, with elements indicating links between
nodes.
 PageRank uses a stochastic matrix derived from
this adjacency matrix, normalized by each node's
out-degree.
Power Iteration Method

 An iterative algorithm used to compute PageRank


scores.
 Process:
• Initialize all PageRank scores equally.
• Update each node's score based on incoming links
and their respective scores.
• Repeat until the scores converge.
Influence and Diffusion
Models

 PageRank can be used in influence models to


identify influential nodes that affect information
spread.
 Common models:
• Linear Threshold Model: Influence is adopted
based on neighbor influence once a threshold
is reached.
• Independent Cascade Model: Each influenced
node has a probability of influencing its
neighbors.
Variations of PageRank

 Personalized PageRank: Bias random walk


towards specific nodes for personalized relevance.
 Topic-Sensitive PageRank: Adjust the algorithm
to give more weight to nodes relevant to specific
topics.
Multi-layered Networks

 Networks often consist of multiple layers (e.g.,


social and professional).
 PageRank can analyze complex interactions
across these layers, such as how professional
networks influence social relationships.
DIFFERENT PAGERANKING
STRATEGIES
1.PAGERANK
Google's original algorithm that measures page importance based
on incoming links.
•Example:

•Page B only receives a link from Page A, while Page C receives


links from both Page A and Page D.

•Page A links to both Page B and Page C.

•Because Page C has more inbound links, it is considered more


important and thus receives a higher PageRank than Page B.
2.HITS (Hyperlink-Induced Topic Search)
Description: Identifies two roles – hubs (pages linking
to many others) and authorities (pages linked by
many).
• Example :Page X links to several important pages
like Page Y and Page Z, making Page X a hub.
• Page Z is an authority because it is linked by
multiple pages, including Page X and Page Y.
3.SALSA (Stochastic Approach for Link-Structure Analysis)

•Description: Combines aspects of PageRank and HITS, using a


bipartite graph structure to rank pages.

•Example:

• In an e-commerce website graph, User 1 visits Product Page A


and Product Page B.
• These user visits make Product Page A and Product Page B
more authoritative in the overall ranking system.
4.TrustRank
Description: Trust is propagated from manually
selected “trusted” seed pages to combat spam.
• Example : A trusted site like TechCrunch links to
Page K. Since TechCrunch is highly trusted, Page
K receives a boost in PageRank and trustworthiness.
• On the other hand, an untrusted or spammy site
linking to Page K wouldn’t give it any significant
boost.
5.Topic-Sensitive PageRank
Description: Computes different PageRank scores
based on the relevance of specific topics.
• Example : Page M specializes in "healthy eating,"
while Page N focuses on general food recipes.
• For a query related to "healthy eating tips," Page M
ranks higher due to its topic-specific relevance,
even if Page N has more overall links.
6.Time-Aware PageRank
Description: Newer pages receive more weight to
rank time-sensitive content higher.
• Example : A user searches for "latest AI
advancements."
• A new article published last month (let’s call it Page
T) ranks higher than an older article from two years
ago, ensuring the most up-to-date content is
displayed first.
Comparison of PageRank Strategies

Strategy Best Use Case Strengths Weaknesses


PageRank General web ranking Simple, widely used, Can be manipulated
global metric (link farms, spam)

HITS Identifying hubs and Differentiates hubs from Computationally


authorities authorities intensive, query-
dependent
SALSA User-page bipartite Combines HITS and Complex to
networks PageRank features implement

TrustRank Reducing spam, trust- Boosts trusted sources, Relies on trusted seed
based reduces spam page selection

Topic-Sensitive Personalized, topic- More relevant results Higher computational


PageRank specific ranking for specific queries cost

Time-Aware PageRank Time-sensitive content Prioritizes newer pages Disadvantages older


ranking important content
Comparison of Tools to Calculate
PageRank

Tool Best For Advantages Disadvantages

NetworkX (Python) Small to medium Easy to use, good for Slow for large datasets
datasets quick prototyping

Gephi Visualizing data Great for visualizing Limited for large


networks datasets

GraphX (Apache Spark) Large datasets Distributed computing Requires knowledge of


for large graphs Spark and distributed
systems
Neo4j Graph databases Efficient real-time Requires graph
querying database setup

SNAP Large-scale datasets High-performance graph Requires programming


analysis knowledge (Python/C+
+)
GraphTool Large graphs Fast, optimized with C+ More complex than
+ backend NetworkX
Applications of PageRank in
Social Networks

 Community Detection : PageRank helps in detecting and ranking


important members within social network communities by analyzing the
link structure between users.
 Recommendation Systems: By ranking users or content based on
link structures, PageRank can be used to suggest relevant people or
content in social media platforms, improving personalization.
 Influence Analysis: Identifies the most influential users in a network
by ranking nodes based on their connections and how often they are
referenced by others.
 Fraud Detection: Detects suspicious behavior by analyzing patterns of
connections between entities, useful in financial or corporate networks.
 Epidemic Spread Modeling: Predicts how diseases or information
spread in networks by ranking individuals or groups most likely to
contribute to the spread.
ADVANTAGES

 The PageRank algorithm, originally designed for ranking


web pages, offers several advantages when applied to
social network analysis. It helps identify influential
individuals and understand the flow of information or
influence within a network.
1. Considers Both Quantity and Quality of
Connections
Quality of Connections: Unlike simple centrality
measures like degree centrality, which only count the
number of connections (edges), PageRank takes into
account the importance of the nodes that a particular
node is connected to. This means that a person
connected to a few influential individuals may have a higher
rank than someone with many less influential connections.
Balanced Evaluation: This results in a more accurate
reflection of influence in a social network, giving a balanced evaluation
based on both the number and importance of connections.

2.Resilient to Spam and Fake Accounts


• Prevention of Influence Manipulation: In social media or other
networks, individuals might try to artificially inflate their influence by
creating many fake accounts (or spam nodes) to boost their centrality.
However, PageRank is harder to manipulate since it places more value
on being connected to already influential nodes, not just the total
number of connections.
• Focus on Authentic Connections: The algorithm reduces the
impact of such "spammy" connections, making it ideal for analyzing real
influence in a network.
3. Detects Key Influencers
• Finding Influential People: PageRank effectively identifies people
who play critical roles in the network, such as opinion leaders,
community organizers, or key connectors who have influence over
others.

• Applications in Marketing and Outreach: In marketing, this is


particularly useful for targeting influencers who can amplify messages
and content through their connections, helping companies maximize
outreach in social networks

4. Accounts for Indirect Influence


• Influence Beyond Immediate Connections: PageRank takes
into account not only direct connections but also indirect ones. A node’s
importance is influenced by the rank of the nodes it connects to, which
means that indirect relationships can still contribute to a node’s
influence.
• Greater Network Influence Awareness: This allows for a
deeper understanding of influence that goes beyond simply
counting direct relationships, giving a fuller picture of someone's
role within a larger social structure.

5 Effective in Large-Scale Networks


• Scalability: PageRank is scalable and works well with large
networks. Since it was designed for the web, it can handle networks
with millions or even billions of nodes and edges, making it suitable
for analyzing large social networks like Facebook, Twitter, or
LinkedIn.
• Efficient Ranking: The algorithm can efficiently process these
networks to provide accurate rankings without excessive
computational cost.
6.Adaptable to Weighted Networks

• Weighted Connections: PageRank can be adapted to weighted


networks, where certain connections carry more importance than others
(e.g., frequent interactions between two people could be weighted more
heavily than infrequent interactions). This flexibility allows the algorithm
to better model real-world social interactions.
DISADVANTAGES

1. Bias Toward Older or Established Nodes


• Tends to favor older, well-connected nodes,
making it hard for new or emerging nodes to gain
influence quickly.
2. Damping Factor Sensitivity
• The choice of damping factor (commonly set at
0.85) can affect results significantly, and finding
the optimal value is not always straightforward.
• Different networks may require different damping
factors, making the algorithm less adaptable
without fine-tuning.
3. Difficulty in Handling Dynamic Networks
•PageRank struggles with rapidly changing networks, where new
connections are formed or old ones are lost frequently.
•Constant recalculation is computationally expensive for
dynamic, large-scale social networks.

4.Assumes Links Are Positive


• PageRank assumes all connections between nodes are positive
and contribute to influence. It does not consider negative
relationships (e.g., competitors or adversaries), which can also
affect social dynamics.
5.Difficulty in Representing Multilayer Networks
• In complex social networks with multiple types of relationships
(e.g., family, professional, and casual ties), PageRank struggles
to accurately reflect influence across these different layers of
interaction.
6. Vulnerability to Manipulation in Some Networks
• In some contexts, users can still manipulate the algorithm by
creating artificial or "spam" connections to boost their
ranking, especially in smaller networks where a few
connections can have a large impact.
LIMITATIONS OF PAGERANK

1. Focus on Direct Connections


PageRank only considers the structure of
connections and does not account for other
important factors like the nature, strength, or
context of those connections (e.g., the difference
between close friends and casual acquaintances).
2. Bias Toward Highly Connected Nodes
Nodes that have been in the network for a long
time or have many connections tend to dominate,
even if their current level of influence has
decreased over time. This can prevent newer or
emerging nodes from being accurately ranked.
3. Inability to Handle Negative Relationships

The algorithm assumes that all connections (edges)


between nodes are positive or neutral. It doesn’t
account for negative or competitive relationships,
which can significantly impact influence in real-world
social networks.

4. Equal Weight for All Links

Without modification, PageRank treats all links as


equal in importance. In real-world social networks,
some relationships are more influential than others,
which is not captured by the basic algorithm.
5. Static Nature

PageRank was originally designed for static networks,


meaning it struggles with dynamic networks where
nodes and connections change frequently (e.g., fast-
growing social media networks).
Challenges in Applying PageRank to Social
Networks

 Computational Complexity
Calculating PageRank for large-scale social networks (e.g.,
millions or billions of users) is computationally expensive
and time-consuming, requiring iterative processing for
convergence.
 Choosing the Right Damping Factor
The choice of the damping factor (usually set to 0.85) can
greatly impact the results, but determining the optimal
value for different networks is not straightforward. It can
require experimentation and tuning.
 Handling Weighted and Multilayer Networks
Social networks often involve different types of relationships
(e.g., professional, personal, casual), and PageRank does not
natively support multilayer networks. Adapting the algorithm
to weight connections based on their strength or importance
can be complex.

 Manipulation in Smaller Networks

In smaller or niche networks, users may still be able to


manipulate their rank by creating artificial or spammy
connections, thereby inflating their perceived influence.

 Integration with Other Metrics

While PageRank is useful, it is often not sufficient on its own.


Combining it with other centrality measures (e.g., degree
centrality, betweenness centrality) is necessary to get a fuller
picture of network influence, but integrating multiple metrics
can be challenging

 Slow Adaptation to Changes

When the network evolves (e.g., new influential people


emerge), PageRank may be slow to adjust, especially in
networks where influence shifts rapidly. It takes time for the
algorithm to converge after significant structural changes.
CASE STUDIES:

 CASE 1:
• Page D is not having any incoming link , But still it is having
some PR this is because of Concept of Random Surfer.
• We are having Damping factor = 0.85. Which means a surfer
can visit a page via a link with probability of 0.85 and can
teleport to a random page with a probability of 1 - 0.85 =
0.15
• Formula used for page rank = (1-d) + d*(pr(t1)/c(t1) +
pr(t2)/c(t2) + . . . . .)
• Pr(D) = (1-0.85) + 0.85 * 0 = 0.15
CASE 2: A simple hierarchy with some outgoing links

• Home page is having highest PR as it is having most number of


incoming links.
• Average has been dropped , because in this case we are having
External sites that are not having any outgoing link(sink pages).
• Since PageRank relies on the distribution of rank from one page to
another through links, pages that don’t have outgoing links stop the
flow of PageRank. The rank that accumulates on these pages doesn’t
contribute to the overall ranking of other pages
CASE 3:

• In this scenario, instead of having sink pages that trap the


PageRank, those pages link back to the home page or a central
page. This creates a feedback loop where the PageRank that would
have been lost is returned to a key page (the home page),
ensuring that it doesn't disappear but rather stays within the
system.
CASE 4: Writing page reviews

• When a review page links to the reviewed page, it acts as a "vote" in


the PageRank algorithm. The more quality links a page receives, the
higher its PageRank will be.
CASE 5: Concentration of PR in Hierarchy

• When pages are linked in levels or hierarchical manner , PageRank


accumulates more at the higher levels.
• Home page is on the top in hierarchy so it has highest PageRank.
CASE 6: Looping

• Since each page in the loop links to the same number of other
pages, the PageRank is distributed equally among all the
pages.
• No page receives more or less PageRank than the others, resulting
in an equilibrium where every page has the same rank (assuming
there are no external backlinks from other sites or pages).
CASE 7: Extensive linking or Fully Meshed linking

• Since every page links to every other page, the PageRank gets evenly
distributed among all the pages in the network.
• Each page effectively passes a portion of its PageRank to all other
pages, resulting in a situation where, after multiple iterations, each
page has nearly the same amount of PageRank (assuming there
are no external factors or links from outside this group).
CASE 8: Hierarchical - but with a link in and
one out.

• Here we’re assuming that there’s an external website (A)


with many pages and links, and one of its pages holds an
average PageRank of 1.0. We also imagine that the
webmaster has a strong preference for our site, and
there’s a single link from that page directing traffic to our
homepage.
Observations :

 The external site (A) with a high PR (1.0) linking directly to the
home page has a strong influence on the overall site. This single
link increases the PR of not just the home page but also indirectly
boosts the PR of linked internal pages ("About," "Product," and
"More") compared to previous (Case - 05) through a feedback
effect.

 The hierarchical structure of internal pages ensures that incoming


PR is amplified throughout the network. This means that when the
home page PR increases, the PR of child pages increases
proportionally, enhancing the site’s overall visibility and rank

 The way the pages within the site are linked together helps boost
the PR from the external link. This increase in PR gets shared
among other pages on the site, raising their PR as well.

Principle: a well structured site will amplify the effect of any


CASE 9: Looping - but with a link in and a
link out

 In this case , PR of our “Home” page has gone up a little, but


“More” page has gone down
Observation :

 In this case, by introducing an external link


reduces the PR of one of the internal pages (the
"More" page), as the vote from the "Product"
page is split between it and the external site.

 The home page’s PR rises slightly due to the


external link, but the PR for internal pages can
decrease when outgoing links to external sites
dilute the PR passed between internal pages.
Case 10:

Fully meshed, but it has one Vote in and one Vote out.
(here Vote is Name given to the Outgoing edge from a Node).
It is similar to the Case 7 but the only difference is these extra two votes. In a
fully meshed network with one vote in and one vote out, every page ends up
with an equal PageRank due to the circular nature of the links. This illustrates
how the PageRank algorithm evaluates link structure and the influence of
pages, leading to equal ranking in this specific setup
Observation:

1. Increasing the internal links in your site can minimize the


damage to the PageRank when the Votes are given by linking
external Nodes.

2. When all the node are having the same PageRank then this is
used by the Firms to make sure that the ranking of the page
is maintained.

3. “404-Page Not Found” error page can be replaced by a better


error page by using the above Page Ranking
method
Observations:
1. It is much worse than an ordinary hierarchy. Since, the
Pages C and D have very weak incoming links and does
not contribute anything for determining the PageRank
for the Page A.

Calculation of PageRank is very different from what we


observe
CASE 12:
When the documentation of the web Page is very long and it need to be
stored on the web, inorder to do so we split the long documentation into
various small documentation Pages. And this pages are linked with each
other using Circular Doubly linked list data structure, where a page will be
pointing to it’s “Previous” and the “Next” page and the last Page is point to
the home page
Observation:

1. In this Case we can see that there’s a possibility that there can exit a
Page which is having the PageRank greater than the Home Page. The
reason here is that the Page B is getting all the Vote from Page A but,
Page A is getting fraction of Pages from B,C and D.

2. As the number of the Pages to the site is increased then the


possibility is that the Page Rank for the Pages will be distributed
more evenly.

3. So to give the user a better site experience we need to take a hit


against our PR.
CASE 13:
Getting high PageRank the wrong, and the right way.
Here we do an experiment, we see weather we can get
1,000 spam pages pointing to the Home Page, but have
only one link leaving it.
Observation:

1. It does not matter how many pages do we use to make our site but the
average of the PageRank of all the combine will always be 1.000

2. A hierarchical layout can strongly concentrate votes, and therefore the


PageRank is irregular for all the Pages i.e. some have very high while
the other have very less.

3. This technique is used by some malicious sites to improve the page


rank.
REFERENCES

• Wikipedia

• GeeksForGeeks

• “Social Network Analysis” by “Tanmoy Chakraborty”


The End

Thank
You !

You might also like