PageRank Algorithm Journal
PageRank Algorithm Journal
PageRank Algorithm Journal
net/publication/314235791
PageRank Algorithm
CITATIONS READS
0 1,848
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Albi Dode on 05 March 2017.
PageRank Algorithm
Albi Dode1, Silvester Hasani2
1
(Student, School of Innovation, Design and Engineering, Mälardalens Högskola, Sweden)
2
(Student, Computer Science Department, University of New York Tirana, Albania)
Abstract: The way in which the displaying of the web pages is done within a search is not a mystery. It involves
applied math and good computer science knowledge for the right implementation. This relation involves vectors,
matrixes and other mathematical notations. The PageRank vector needs to be calculated, that implies
calculations for a stationary distribution, stochastic matrix. The matrices hold the link structure and the
guidance of the web surfer. As links are added every day, and the number of websites goes beyond billions, the
modification of the web link’s structure in the web affects the PageRank. In order to make this work, search
algorithms need improvements. Problems and misbehaviors may come into place, but this topic pays attention to
many researches which do improvements day by day. Even though it is a simple formula, PageRank runs a
successful business. PageRank may be considered as the right example where applied math and computer
knowledge can be fitted together.
Keywords: Algorithms, Distribution, Google, Math, PageRank.
I. Introduction
PageRank is a topic widely discussed by Search Engine Optimization (SEO) experts. The heart of
PageRank is a mathematical formula which seems to be very complicated, but it is actually simple to understand
if simplified. It is necessary for the web developers to understand the concept of PageRank to make the site
more high ranked. There is a majority of websites available in the market today. If a person sits and calculates
the PageRank for different websites, this may be a tedious work. Since automation is making the life of a person
easier, automating PageRank will have a better effect on the World Wide Web. PageRank is an algorithm that
drives Google. It was invented by Larry Page and Sergey Brin while they were graduate students at Stanford
University and later it became a trademark of google in 1998. PageRank does not have an impact only in the
programming industry, but also have an effect in the economic sector. It is also considered as a business goal of
every firm to be ranked higher in the web page display, that is considered as a SEO strategy. Hyperlink structure
along with SEO strategy plays a vital role in the Page ranking.
There is a math behind every algorithm. Matrix and vector can be considered as the main source of
many achievements. Since the amount of websites is growing day by day, the web sites have to be ranked using
certain algorithms. The specialty of Google’s page rank is that it does not allow spams, which are webpages that
are coded in such a way to manipulate the ranking results and that go against the guidelines established by
Google. It also focuses on the importance of the page when it is pointed by other important nodes. The
functioning of PageRank algorithm also realizes the importance of linear equations and graphs. This report will
provide a short summary of math behind the page ranking algorithm. And also explains the problems in the
current scenario and suggestions for it.
The matrix H for the directed graph with the vertex pointing from i to j is given as below.
Fig.2: Matrix H
The below table demonstrates the number of inlinks for each and every node.
Table 1: Illustration of the number of links of each node of the diagram
Node A B C D E
Number of Inlinks 2 2 2 2 4
In this case, the problem stands on getting spams when a page backlink to other pages. But still, the rank of a
page totally relies on backlinks.
This is determined by:
eq.(1), where the rank value r(p) linking to page p is calculated by rank of the page to the out
degree of that page[3].
It is also thought that a link to a page coming from an important page should boost the page`s importance more
than a link coming from non-important one.
To illustrate that with our example in the figure 1, we will have these steps:
Step 1. Take a 0.85 * a page’s PageRank, and divide it by the number of out links on the page. Step 2. Add that
amount onto a new total for each page it’s being passed to.
Step 3. Add 0.15 to each of those totals.
As we start at zero, we will have 0.85*0=0. This leads that each page will get 0.15 as 0.15+0=0.15. But still we
have the importance based on links. So, now calculations become:
Page A links to pages B, C and E. Page A’s PageRank is 0.15 so it will add 0.85* 0.15 = 0.1275 to the new
PageRank scores of the pages it links to. There are three of them so they each get 0.0425.
Page B links to page A,C,E. Page B’s PageRank is 0.15 so it will add 0.85 * 0.15 =0.1275 to the new PageRank
score of the pages it links to. Since it links to page A,C,E, each gets 0.0425. Page C links to Page A,E , each
0.06375. Page D links to Page B,C,E. each 0.0425.Page E links to none.
DOI: 10.9790/0661-1901030107 www.iosrjournals.org 2 | Page
PageRank Algorithm
As a result,
Page A: 0.15 (base) + 0.1275 +0.0425(from Page C,B) = 0.35
Page B: 0.15 (base) + 0.0425+ 0(from Page A,D) = 0.1925
Page C: 0.15 (base) + 0.0425 (from Page A) + 0.0425 (from Page D) = 0.235
Page D: 0.15 (base) + 0.0425 (from Page B) = 0.1925
Page E: 0.15 (base)
eq.(2),
where di holds the number of the outgoing links of a page i. This recursive function gives a rank to the
page which is weighted by the out-degree number. In order to numerically calculate the above equation, it is
necessary that the graph should possess convergence. Strongly connected graph always has the convergence
property, approaching to a limit, and it is achieved by adding some set of outgoing transitions with probability in
each page. This makes the above equation to be modified as,
eq.(3),
where e is the vector of all 1’s and A is the matrix if there exists a link between i and j and 0 otherwise.
Usually, the value of c is considered as 0.85 or 0.9. It is also necessary to remove all rows and columns from the
matrix A which has 0 entries. This means that we are removing pages with zero out degree i.e., dangling pages
or sink pages. There are cases when the percentage of “follow link” reaches 0, then all pages will nearly have
the same PageRank. The intention behind the above PageRank algorithm can be better understood by a random
surfer method. When a surfer starts browsing from one web page to the other, the navigation to the other page
can be done by following the link provided in the current page uniformly at random. There might be a case
when the surfer reaches a page without any outgoing links then he/she may jump randomly to any other page.
The PageRank value of a page is determined by the probability that at some time the surfer has reached that
page [5].
The matrix for the above diagram using its outlinks is given as:
Let us consider x1, x2, x3 and x4 as the importance of every page. Analyzing the above problem, we can give
the importance of each page as follows:
Among the eigenvectors of the matrix A, we pick the one which consists of only non-negative real values. This
is the eigenvector corresponding to the eigenvalue 1=1, whose entries are:
Since the PageRank has to calculate the relative importance between nodes and the eigenvalues is just a
scalar multiples of each other, we can choose a PageRank vector. This should be the unique values that should
make all the entries sums to 1. It is often referred as probabilistic eigenvector corresponding to 1. The
eigenvector below is the PageRank Vector.
eq.(4),
where PageRank pi is termed as the proportion of time our random surfer spends on a webpage i when the surfer
is allowed to move freely over the webpages.
Inspired by this problem, researchers are focusing in a new area called Personalization. It is an area yet
to be explored as the mathematical calculations are hard. A best known technique is that of TrustRank which
only used the good pages. They rely on the expansion of a set of good pages. The idea is to gain trust from the
good pages and recursively go to the outgoing links. But the problem that arises here is that intruders can leave
bad links somewhere in the page. The solution for this is found by inserting a threshold value, if a page is below
threshold then it is considered as spam [4].
XIII. Conclusion
So far, we have seen how Google calculates PageRank and how its matrix is a combination of
stochastic matrices of link structure and behavior of web surfer. A strong property in the success of this
algorithm is that it uses information not on the page content itself but on the linkage connection between pages.
As the inspiration was from the ranking of scientific articles, PageRank itself can be thought as a link review.
The formula is iteratively calculated and as the web pages keep growing in number and not in a connected
manner, a damping constant is used to ease the duty. The mathematic that goes on with eigenvectors and
eigenvalues makes it such that it converges. There are many factors that make it work. The best value 0.85, as
analyzed in our article, also this convergence speed gives the best ranking. Another good thing besides
removing spamming is that PageRank is query independent. It is an open topic for further improves as storage
algorithms and speed improvements are always welcomed, as it is an area for both computer scientists and
numerical analysts. Further improvements for the future can be of a PageRank algorithm that can understand
the semantic of the sites as more and more complex questions can be answered in a matter of second.
References
[1]. Internet Stats, 2016). Retrieved 2016, from https://internetlivestats.com
[2]. Damping Factor, 2016). Retrieved 2016, from Forum DigitalPoint: https://forums.digitalpoint.com/threads/what-is-the-damping-
factor-in-page-rank-and-whats-the-concept-behind-using-it.1885967/
[3]. BIANCHINI, M., GORI, M., & SCARSELLI, F. (2005). Inside PageRank. ACM Transactions on Internet Technology, 92-128.
[4]. Srivastava, S., & Aggrawal, R. R. (2012). A Modified Algorithm to Handle Dangling Pages using. International Journal of
Computer Applications.
[5]. Chen, Y.-Y., Gan, Q., & Suel, T. (2002). I/O Efficient Techniques for Computing Pagerank.
[6]. Langville, A. N., & Meyer, C. D. (2004). Deeper Inside PageRank. Retrieved from http://meyer. math.ncsu. edu/meyer/ps_ files/
deeperinsidepr.pdf
[7]. Tibshirani, R. (2013). Retrieved from PageRank: http://www.stat.cmu.edu/~ryantibs/datamining/lectures/03-pr.pdf
[8]. Hareshkumar, N., & Garg, D. D. (2011). Random Web Surfer PageRank Algorithm. International Journal of Computer
Applications.
[9]. Xie, Y., Huang, T.-Z., Wen, C., & Wu, D.-A. (2013). An Improved Approach to the PageRank Problems. Journal of Applied
Mathematics.