Ranking of Closeness Centrality For Large-Scale Social Networks
Ranking of Closeness Centrality For Large-Scale Social Networks
Ranking of Closeness Centrality For Large-Scale Social Networks
Abstract. Closeness centrality is an important concept in social network analysis. In a graph representing a social network, closeness centrality measures how close a vertex is to all other vertices in the graph. In this paper, we combine existing methods on calculating exact values and approximate values of closeness centrality and present new algorithms to rank the top-k vertices with the highest closeness centrality. We show that under certain conditions, our algorithm is more ecient than the algorithm that calculates the closeness-centralities of all vertices.
Introduction
Social networks have been the subject of study for many decades in social science research. In recent years, with the rapid growth of Internet and World Wide Web, many large-scale online-based social networks such as Facebook, Friendster appear, and many large-scale social network data, such as coauthorship networks, become easily available online for analysis [Ne04a, Ne04b, EL05, PP02]. A social network is typically represented as a graph, with individual persons represented as vertices, the relationships between pairs of individuals as edges, and the strengths of the relationships represented as the weights on edges (for the purpose of nding the shortest weighted distance, we can treat lower-weight edges as stronger relationships). Centrality is an important concept in studying social networks [Fr79, NP03]. Conceptually, centality measures how central an individual is positioned in a social network. Within graph theory and network analysis, various measures (see [KL05] for details) of the centrality of a vertex within a graph have been proposed to determine the relative importance of a vertex within the graph. Four measures of centrality that are widely used in network analysis are degree centrality, betweenness centrality 4 , closeness centrality, and eigenvector centrality 5 . In this paper, we focus on shortest-path closeness
4
For a graph G = (V, E), the betweenness centrality CB (v) for a vertex v is CB (v) = v (s,t) s,t:s=t=v (s,t) where (s, t) is the number of shortest paths from s to t, and v (s, t) is the number of shortest paths from s to t that pass through v. Given a graph G = (V, E) with adjacency matrix A, let xi be the eigenvector centrality of the ith node vi . Then vector x = (x1 , x2 , , xn )T is the solution of equation
centrality (or closeness centrality for short) [Ba50, Be65]. The closeness centrality of a vertex in a graph is the inverse of the average shortest-path distance from the vertex to any other vertex in the graph. It can be viewed as the eciency of each vertex (individual) in spreading information to all other vertices. The larger the closeness centrality of a vertex, the shorter the average distance from the vertex to any other vertex, and thus the better positioned the vertex is in spreading information to other vertices. The closeness centrality of all vertices can be calculated by solving allpairs shortest-paths problem, which can be solved by various algorithms taking O(nm + n2 log n) time [Jo77, FT87], where n is the number of vertices and m is the number of edges of the graph. However, these algorithms are not ecient enough for large-scale social networks with millions or more vertices. In [EW04], Eppstein and Wang developed an approximation algorithm to calculate the closeness centrality in time O( log n (n log n + m)) within an additive error of for 2 1 the inverse of the closeness centrality (with probability at least 1 n ), where > 0 and is the diameter of the graph. However, applications may be more interested in ranking vertices with high closeness centralities than the actual values of closeness centralities of all vertices. Suppose we want to use the approximation algorithm of [EW04] to rank the closeness centralities of all vertices. Since the average shortest-path distances are bounded above by , the average dierence in average distance (the inverse of closeness centrality) between the ith-ranked vertex and the (i + 1)th-ranked vertex (for any i = 1, . . . , n 1) is O( ). To obtain a reasonable ranking result, n we would like to control the additive error of each estimate of closeness central1 ity to within O( ), which means we set to ( n ). Then the algorithm takes n 2 O(n log n(n log n + m)) time, which is worse than the exact algorithm. Therefore, we cannot use either purely the exact algorithm or purely the approximation algorithm to rank closeness centralities of vertices. In this paper, we show a method of ranking top k highest closeness centrality vertices, combining the approximation algorithm and the exact algorithm. We rst provide a basic ranking algorithm TOPRANK(k), and show that under certain conditions, the algorithm ranks all top k highest closeness centrality vertices (with 1 2 high probability) in O((k + n 3 log 3 n)(n log n + m)) time, which is better than
Ax = x, where is the greatest eigenvalue of A to ensure that all values xi are positive by the Perron-Frobenius theorem. Googles PageRank [BP98] is a variant of the eigenvector centrality measure. The PageRank vector R = (r1 , r2 , , rn )T , where ri is the PageRank of webpage i and n is the total number of webpages, is the solution of the equation R= 1d 1 + dLR. n
Here d is a damping factor set around 0.85, L is a modied webpage-adjacency matrix: li,j = 0 if page j does not link to i, and normalised such that, for each j, ai,j n i=1 li,j = 1, i.e., li,j = dj where ai,j = 1 only if page j has link to page i, and dj = n ai,j is the out-degree of page j. i=1
O(n(n log n + m)) (when k = o(n)), the time needed by a brute-force algorithm that simply computes all average shortest distances and then ranks them. We then use a heuristic to further improve the algorithm. Our work can be viewed as the rst step toward designing and evaluating ecient algorithms in nding top ranking vertices with highest closeness centralities. We discuss in the end several open problems and future directions of this work.
Preliminary
We consider a connected weighted undirected graph G = (V, E) with n vertices and m edges (|V | = n, |E| = m). We use d(v, u) to denote the length of a shortest-path between v and u, and to denote the diameter of graph G, i.e., = maxv,uV d(v, u). The closeness centrality cv of vertex v [Be65] is dened as cv = n1 . uV d(v, u) (2.1)
In other words, the closeness centrality of v is the inverse of the average (shortestpath) distance from v to any other vertex in the graph. The higher the cv , the shorter the average distance from v to other vertices, and v is more important by this measure. Other denitions of closeness centralities exist. For example, some dene the closeness centrality of a vertex v as uV1 d(v,u) [Sa66], and some dene the closeness centrality as the mean geodesic distance (i.e the shortest path) between a vertex v and all other vertices reachable from it, i.e., uV d(v,u) , n1 where n 2 is the size of the networks connected component V reachable from v. In this paper, we will focus on closeness centrality dened in equation (2.1). The problem to solve in this paper is to nd the top k vertices with the highest closeness centralities and rank them, where k is a parameter of the algorithm. To solve this problem, we combine the exact algorithm [Jo77, FT87] and the approximation algorithm [EW04] for computing average shortest-path distances to rank vertices on the closeness centrality. The exact algorithm iterates Dijkstras single-source shortest-paths (SSSP for short) algorithm n times for all n vertices to compute the average shortest-path distances. The original Dijkstras SSSP algorithm [Di59] computes all shortest-path distances from one vertex, and it can be eciently implemented in O(n log n + m) time [Jo77, FT87]. The approximation algorithm RAND given in [EW04] also uses Dijkstras SSSP algorithm. RAND samples vertices uniformly at random and computes SSSP from each sample vertex. RAND estimates the closeness centrality of a vertex using the average of shortest-path distances from the vertex to the sample vertices instead of to all n vertices. The following bound on the accuracy of the approximation is given in [EW04], which utilizes the Hoedings theorem [Ho63]: Pr{| 1 2 1 | } , 2 2 log n ( n1 )2 cv cv n n (2.2)
for any small positive value , where cv is the estimated closeness centrality of vertex v. Let av be the average shortest-path distance of vertex v, i.e., av = 1 uV d(v, u) = . n1 cv
(2.3)
where av is the estimated average distance of vertex v to all other vertices. If the algorithm uses = log n samples ( > 1 is a constant number) which will 2 1 cause the probability of error at each vertex to be bounded above by n2 , the probability of error anywhere in the graph is then bounded from above 1 1 by n ( 1 (1 n2 )n ). It means that the approximation algorithm calculates the average lengths of shortest-paths of all vertices in O( log n (n log n + m)) time 2 1 within an additive error of with probability at least 1 n , i.e., with high probability (w.h.p.).
Ranking algorithms
Our top-k ranking algorithm is based on the approximation algorithm as well as the exact algorithm. The idea is to rst use the approximation algorithm with samples to obtain estimated average distances of all vertices and nd a candidate set E of top-k vertices with estimated shortest distances. We need to guarantee that all nal top-k vertices with the exact average shortest distances are included in set E with high probability. Thus, we need to carefully choose number k > k using the bound given in formula (2.3). Once we nd set E, we can use the exact algorithm to compute the exact average distances for all vertices in E and rank them accordingly to nd the nal top-k vertices with the highest closeness centralities. The key of the algorithm is to nd the right balance between sample size and the candidate set size k : If we use a too small sample size , the candidate set size k could be too large, but if we try to make k small, the sample size may be too large. Ideally, we want an optimal that minimizes + k , so that the total time of both the approximation algorithm and the computation of exact closeness centralities of vertices in the candidate set is minimized. In this section we will show the basic algorithm rst, and then provide a further improvement of the algorithm with a heuristic. 3.1 Basic ranking algorithm
We name the vertices in V as v1 , v2 , . . . , vn such that av1 av2 avn . Let av be the estimated average distance of vertex v using approximation algorithm based on sampling. Figure 1 shows our basic ranking algorithm TOPRANK(k), where k is the input parameter specifying the number of top ranking vertices
the algorithm should extract. The algorithm also has a conguration parameter , which is the number of samples used by the RAND algorithm in the rst step. We will specify the value of in Lemma 2. Function f () in step 4 is dened as follows: f () = log n (where > 1 is a constant number), such that the probability of the estimation error for any vertex being at least f () is 1 bounded above by 2n2 , based on inequality (2.3) (when setting = f ()).
Algorithm TOPRANK(k) 1 Use the approximation algorithm RAND with a set S of sampled vertices to obtain the estimated average distance av for each vertex v. // Rename all vertices to v1 , v2 , . . . , vn such that av1 av2 avn . 2 Find vk . 3 Let = 2 minuS maxvV d(u, v). // d(u, v) for all u S, v V have been calculated at step 1 and is determined in O(n) time. 4 Compute candidate set E as the set of vertices whose estimated average distances are less than or equal to avk + 2f () . 5 Calculate exact average shortest-path distances of all vertices in E. 6 Sort the exact average distances and nd the top-k vertices as the output. Fig. 1. Algorithm for ranking top-k vertices with the highest closeness centralities.
Lemma 1. Algorithm TOPRANK(k) given in Figure 1 ranks the top-k vertices with the highest closeness centralities correctly w.h.p., with any conguration parameter . Proof. We show that the set E computed at step 4 in algorithm TOPRANK(k) contains all top-k vertices with the exact shortest distances w.h.p. Let T = {v1 , . . . , vk } and T = {1 , . . . , vk }. Since for any vertex v, the v probability of the estimate av exceeding the error range of f () is bounded 1 1 above by 2n2 , i.e., Pr ({av f () av av + f () }) 2n2 , we have Pr {
vT
av av + f () avk + f () }
k The latter inequality means that, with error probability of at most 2n2 , there are at least k vertices whose real average distances are less than or equal to avk + f () , which means avk avk + f () with error probability bounded k above by 2n2 . Then av avk + f () avk + 2f () for all v T with error k probability bounded above by n2 . Moreover, we have , because for any u S, we have
Pr {
v T
av av + f () avk + f () }
Lemma 2. If the distribution of estimated average distances is uniform with 2 range c (c is a constant number), then TOPRANK(k) takes O((k + n 3 1 1 2 log 3 n)(n log n + m)) time, when we choose = (n 3 log 3 n). Proof. TOPRANK(k) takes O((n log n + m)) time at step 1 because SSSP algorithm takes O(n log n+ m) time and TOPRANK(k) iterates SSSP algorithm times. Since the distribution of estimated average distances is uniform with range () () c, there are n 2f c vertices between avk and avk + 2f () , and n 2f c is O(nf ()) because = 2 minuS maxvV d(u, v) 2 maxu,vV d(u, v) = 2. So, the number of vertices in E is k + O(nf ()) and TOPRANK(k) takes O((k + O(nf ()))(n log n + m)) time at step 5. Therefore, we select an that could minimize the total running time at step 1 and 5. In other words, we choose an to minimize + nf (), which implies 1 1 2 2 = (n 3 log 3 n). Then TOPRANK(k) takes O(n 3 log 3 n(n log n + m)) time 1 2 at step 1, and takes O((k + n 3 log 3 n)(n log n + m)) at step 5. Obviously 1 2 TOPRANK(k) takes O(n 3 log 3 n(n log n + m)) time at the other steps. So, 1 2 TOPRANK(k) takes O((k + n 3 log 3 n)(n log n + m)) total running time. 2 Combining Lemmata 1 and 2, we arrive at the following theorem. Theorem 1. If the distribution of estimated average distances is uniform with range c (c is a constant number), then algorithm TOPRANK(k) given in Figure 1 ranks the top-k vertices with the highest closeness centralities in O((k + 1 1 2 2 n 3 log 3 n)(n log n + m)) time w.h.p., when we choose = (n 3 log 3 n). Theorem 1 only addresses the case when the distribution of estimated average distances is uniform. In this case, the complexity of TOPRANK(k) is better than a brute-force algorithm that simply computes all average shortest distances and ranks them, which takes O(n(n log n + m)) time (assuming k = o(n)). Even
though the theorem is only for the case of uniform distribution, it could be applied to more general situations, as explained now. Given an estimated average distance x, its density d(x) is the number of vertices whose estimate average distance is around x. The uniform distribution means that the density d(x) is the same anywhere in the range of x. For any other distribution, it has an average density of d, which is the average of d(x) over all xs. Suppose that the distribution is such that when x is suciently small, d(x) d (this property requires further investigation but we believe it is reasonable for social networks). Let x0 be the largest value such that for all x x0 , d(x) d. Then, in our algorithm, as long as avk + 2f () x0 , the number of vertices between avk is at most n 2f () , as given in the proof of Lemma 2. Thus, and avk + 2f () c Lemma 2 uses a conservative upper bound for this number, and it will still hold for the distributions with the above property. Even with the above generalization, however, the savings from O(n(n log n + 1 2 m)) to O((k + n 3 log 3 n)(n log n + m)) is still not very signicant. Ideally, we would like a ranking algorithm that is O(poly (k)(n log n+m)), where poly(k) is a polynomial of k, which means the number of SSSP calculations is only related to k, not n. This is possible for small k when the distribution of estimated average distances is not uniform but other distributions like the normal distribution. In this case, the number of additional SSSP computations for vertices in the range from avk to avk + 2f () could be small and not related to n. We leave this as a future research work (see more discussion in Section 4). 3.2 Improving the algorithm with a heuristic
Algorithm TOPRANK2(k) 1 Use the approximation algorithm RAND with a set S of sampled vertices to obtain the estimated average distance av for each vertex v. // Rename all vertices to v1 , v2 , . . . , vn such that av1 av2 avn . 2 Find vk . 3 Let = 2 minuS maxvV d(u, v). 4 Compute candidate set E as the set of vertices whose estimated average distances are less than or equal to avk + 2f () . 5 repeat 6 p |E| 7 Select additional q vertices S + as new samples uniformly at random. 8 Update estimated average distances of all vertices using new samples in S + (need to compute SSSP for all new sample vertices). 9 S S S + ; + q; min(, 2 minuS + maxvV d(u, v)) // Rename all vertices to v1 , v2 , . . . , vn such that av1 av2 avn . 10 Find vk . 11 Compute candidate set E as the set of vertices whose estimated average distances are less than or equal to avk + 2f () . 12 p |E| 13 until p p q 14 Calculate exact average shortest-path distances of all vertices in E. 15 Sort the exact average distances and nd the top-k vertices as the output. Fig. 2. Improved algorithm for ranking vertices with top k closeness centralities
The algorithm in Figure 1 spends its majority of computation on the following two steps: (1) step 1 computing SSSP for samples, and (2) step 5 computing SSSP for all candidates in set E. The key in reducing the running time of the algorithm is to nd the right sample size to minimize + |E|, the total number of SSSP calculations. However, this number is dicult to obtain before running the algorithm, especially when the distribution of average distances is unknown. In this section, we improve the algorithm by a heuristic to achieve the above goal. The idea of the heuristic is to incrementally add new samples to compute more accurate average distances of all vertices. In each iteration, q new sample vertices are added. After computing the new average distances with these q new vertices, we obtain a new candidate set E. If the size of the candidate set E decreases more than q, then we know that the savings by reducing the number of candidates outweigh the cost of adding more samples. In this case, we continue the next iteration of adding more samples. This procedure ends when the cost of adding more samples outweighs the savings obtained by the reduced number of candidates. Figure 2 provides this heuristic algorithm. Essentially, this is the dynamic way of nding the optimal to minimize + |E| (or to make = |E|, where is the small change in and |E| is the corresponding change in |E|). The initial value of the in step 1 can be obtained based on Theorem 1 if we know that the distribution of the estimated average distances is uniform. Otherwise, we can choose a basic value, for example k, since we need to compute at least k SSSP in step 14 in any case. The incremental unit q could be a small value, for example, log n. However, we do not know yet if |E| strictly decreases when the sample size for estimating average distances increases, and if the rate of decrease of |E| slows down when adding more and more sample vertices. Therefore, it is not guaranteed that the heuristic algorithm will always stop at the optimal sample size . An open problem is to study the conditions under which the change of |E| with respect to the change of sample size indeed has the above properties, and thus the heuristic algorithm indeed provides the most ecient solution.
This paper can be viewed as the rst step towards the design of more ecient algorithms in obtaining highest ranked closeness centrality vertices. By combining the approximation algorithm with the exact algorithm, we obtain an algorithm that has better complexity than the brute-force exact algorithm. There are many directions to extend this study. First, as mentioned in the previous section, we are interested in more ecient algorithms such that the number of SSSP computations is only related to k, not to n. This may be possible for some classes of social networks with certain properties on their average distance distributions.
Second, the condition under which the heuristic algorithm results in the least number of SSSP computation is an open problem and would be quite interesting to study. Third, we may be able to obtain faster algorithm if we can relax the problem requirement. Instead of nding all top-k vertices with high probability, we may allow the output to have an error bound t, which is the number of vertices that should be ranked within top-k vertices but are missed in the output. Generally, given an algorithm for top-k query of various centralities, we dene the hit-ratio, denoted as (A), of an output A = {v1 , v2 , , v } (not necessarily of size k) as |AQk | , where Qk is the actual set of top-k k vertices. Let r(vi ) denote the the actual rank of vi . Here, we require that r(vi ) < r(vi+1 ), for i [1, 1]. Thus we have r(vi ) i. We dene the accuracy of A, denoted as (A), as 2 (+1) i ) (other denitions of accuracy i=1 r(v are possible). Clearly, (A) [0, 1] and (A) [0, 1]. The hit-ratio and accuracy of an algorithm are then its worst performance over all possible inputs. We then may only require that Pr ((A) 1 , (A) 1 ) 1, for suciently small , and . Such relaxations in the requirement may allow much more ecient algorithms, since we observe that the complexity of our algorithm is mainly because we need to include all vertices in the extra range from avk to avk + 2f () in order to include all top-k vertices with high probability. Fourth, we would like to study the stability of our algorithms for top-k query using random sampling. Costenbader et al. [CV03] studied the stability of various centrality measures. Their study shows that, with a 50% sample of the original network nodes, average correlations for the closeness measure ranged from 0.54 to 0.71. Finally, we can look into other type of centralities and see how to rank them eciently using the technique in this paper. For example, Brandes [Br00] presents an algorithm for betweenness centrality of weighted graph with time-complexity n(m + n log n). Brandes et al. [BP00] present the rst approximation algorithm for betweenness centrality. Improvements over their method were presented recently in [GS08, BK07].
To conclude, we would like to provide a brief comparison of our algorithms with the well known PageRank algorithm [BP98]. PageRank is an algorithm used to rank the importance of the webpages, which are viewed as vertices connected by directed edges (hyperlinks). As explained in Footnote 5, PageRank is a variant of the eigenvector centrality. Thus, it is a dierent measure from closeness centrality. More importantly, by denition the PageRank of a vertex (a webpage) depends on the PageRanks of other vertices linking to it, so the PageRank calculation requires computing all PageRank values of all vertices, even if only the top-k PageRank vertices are desired. However, for closeness centrality measure, our algorithms do not need to compute closeness centralities for all vertices. Instead, we may start with rough estimates of closeness centralities of vertices, and through rening the estimates we reduce the candidate set containing the top-k vertices. This results in reduced time complexity in our computation.
References
[BK07] Bader, D. A., Kintali, S., Madduri, K., Mihail, M.: Approximating Betweenness Centrality. The 5th Workshop on Algorithms and Models for the Web-Graph (2007) 124137 [Ba50] Bavelas, A.: Communication patterns in task-oriented groups. The Journal of the Acoustical Society of America 22(6) (1950) 725730 [Be65] Beauchamp, M. A.: An improved index of centrality. Behavioral Science 10(2) (1965) 161163 [Br00] Brandes, U.: Faster Evaluation of Shortest-Path Based Centrality Indices. Konstanzer Schriften in Mathematik und Informatik 120 (2000) [BP00] Brandes, U., Pich, C.: Centrality Estimation in Large Networks. Intl. Journal of Bifurcation and Chaos in Applied Sciences and Engineering 17(7) 23032318 [BP98] Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30 (1998) 107-117 [CV03] Costenbader, E., Valente, T. W.: The stability of centrality measures when networks are sampled. Social Networks 25 (2003) 283307 [Di59] Dijkstra, E. W.: A note on two problems in connexion with graphs. Numerische Mathematik 1 (1959) 269271 [EL05] Elmacioglu, E., Lee, D.: On six degrees of separation in DBLP-DB and more. ACM SIGMOD Record 34(2) (2005) 3340 [EW04] Eppstein, D., Wang, J.: Fast Approximation of Centrality. Journal of Graph Algorithms and Applications 8 (2004) 3945 [FT87] Fredman, M. L., Tarjan, R. E.: Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM 34(3) (1987) 596615 [Fr79] Freeman, L. C.: Centrality in social networks conceptual clarication. Social Networks 1(3) (1978/79) 215239 [GS08] Geisberger, R., Sanders, P., Schultes, D.: Better Approximation of Betweenness Centrality. Proceedings of the 10th Workshop on Algorithm Engineering and Experiments (ALENEX08) SIAM (2008) 90100 [Ho63] Hoeding, W.: Probability inequalities for sums of bounded random variables. Journal of the ACM 58(1) (1963) 1330 [Jo77] Johnson, D. B.: Ecient algorithms for shortest paths in sparse networks. Journal of the ACM 24(1) (1977) 113 [KL05] Koschutzki, D., Lehmann, K. A., Peeters, L., Richter, S., Tenfelde-Podehl, D., Zlotowski, O.: Centrality indices. Network Analysis (2005) 1661 [Ne04a] Newman, M. E. J.: Coauthorship networks and patterns of scientic collaboration. Proceedings of the National Academy of Sciences 101 (2004) 52005205 [Ne04b] Newman, M. E. J.: Who is the best connected scientist? A study of scientic coauthorship networks. Complex Networks (2004) 337370 [NP03] Newman, M. E. J., Park, J.: Why social networks are dierent from other types of networks. Physical Review E 68 (2003) 036122 [PP02] Potterat, J. J., Phillips-Plummer, L., Muth, S. Q., Rothenberg, R. B., Woodhouse, D. E., Maldonado-Long, T. S., Zimmerman H. P., Muth, J. B.: Risk network structure in the early epidemic phase of HIV transmission in Colorado Springs. Sexually Transmitted Infections 78 (2002) i159i163 [Sa66] Sabidussi, G.: The centrality index of a graph. Psychometrika 31(4) (1966) 581603