Academia.eduAcademia.edu

Routing betweenness centrality

2010, Journal of The ACM

Betweenness centrality measure is often used in social and computer communication networks to estimate the potential monitoring and control capabilities a vertex may have on data flowing in the network. In this paper we define the Routing Betweenness Centrality (RBC) measure which generalizes previously well known Betweenness measures such as the Shortest Path Betweenness, Flow Betweenness, and Traffic Load Centrality by considering network flows created by arbitrary loop-free routing strategies.

Routing Betweenness Centrality by Shlomi Dolev, Yuval Elovici, and Rami Puzis Technical Report #2009-09 August 2009 1 Routing Betweenness Centrality Shlomi Dolev, Yuval Elovici, and Rami Puzis Ben-Gurion University of the Negev August 10, 2009 Abstract Betweenness centrality measure is often used in social and computer communication networks to estimate the potential monitoring and control capabilities a vertex may have on data flowing in the network. In this paper we define the Routing Betweenness Centrality (RBC) measure which generalizes previously well known Betweenness measures such as the Shortest Path Betweenness, Flow Betweenness, and Traffic Load Centrality by considering network flows created by arbitrary loop-free routing strategies. We present algorithms for computing RBC of all the individual vertices in the network and algorithms for computing the RBC of a given group of vertices, where the RBC of a group of vertices represents their potential to collaboratively monitor and control data flows in the network. Two types of collaborations are considered: (i) conjunctive – the group is a sequences of vertices controlling traffic where all members of the sequence process the traffic in the order defined by the sequence and (ii) disjunctive – the group is a set of vertices controlling traffic where at least one member of the set processes the traffic. The algorithms presented in this paper also take into consideration different sampling rates of network monitors, accommodate arbitrary communication patterns between the vertices (traffic matrices), and can be applied to groups consisting of vertices and/or edges. For the cases of routing strategies that depend on both the source and the target of the message, we present algorithms with time complexity of O(n2 m) where n is the number of vertices in the network and m is the number of edges in the routing tree (or the routing directed acyclic graph (DAG) for the cases of multi-path routing strategies). The time complexity can be reduced by an order of n if we assume that the routing decisions depend solely on the target of the messages. Finally we show that a preprocessing of O(n2 m) time, supports computations of RBC of sequences in O(kn) time and computations of RBC of sets in O(k3 n) time, where k in the number of vertices in the sequence or the set. 1 Introduction Networks are commonly used to represent a domain, a problem, or a complex dynamic system in a large variety of scopes [31]. Representing, for example, social networks [33], protein interactions [9], urban structure [27], and computer communication networks [15, 35]. Various centrality measures such as Degree, Closeness, and Betweenness [8, 17] were introduced in order to analyze networks and understand both the global dynamics of the networks and the roles played by individual nodes. Many naturally evolved complex networks are characterized by a power-law distribution of Degree and Betweenness-Centrality measures of their nodes [5, 15]. Such networks, referred to as Scale-Free Networks [3, 4], are highly resistant to random damages but are easily partitioned by removing the most central nodes [6]. The centrality characteristics of nodes are also important in the field of computational epidemiology, where it has been shown that immunizing central nodes can significantly reduce the impact of epidemics [26]. In the scope of the Internet, Jackson et al. [22] suggest placing monitors on links of the autonomous system level topology of the Internet, with end nodes having the highest Degree. The computation of Group Betweenness-Centrality was suggested in [30] to identify groups of autonomous systems which can collaborate to trace the communications of as many Internet users as possible. 2 In this paper we concentrate on Betweenness-Centrality measures [2, 16], originally defined to estimate the control an individual may have over communication flows in social networks. Betweenness-Centrality measures may be used to estimate the monitoring capabilities, control capabilities, and/or functionality importance of nodes in communication networks. The concept of Betweenness-Centrality evolved into a broad class of diverse measures that consider different types of network flows [7]. Betweenness, which is now referred to as Shortest Path Betweenness-Centrality (SPBC) assumes that only shortest paths are used to transfer the network flow. Traffic Load Centrality (TLC) [11, 19, 24] is a variant of betweenness that also assumes that traffic flows over shortest paths, but uses a different routing mechanism. When devising a routing strategy in a commercial communication network, factors such as load balancing, fault tolerance, and service level agreements must be considered. Unfortunately, these factors may lead to traffic flows that are not routed along shortest paths to the target and, therefore, ignored by SPBC and TLC. Table 1: Time and space complexity of the proposed algorithms. Section / Alg. 4.1 / 1 4.2 / 2 4.3 / 3 5.1 / 4 5.2 / 5 5.2 / 6 5.3 / 7 5.3 / 8 5.4 / 9 5.4 / 10 Scope All nodes Sequence Set (nodes) All nodes Sequence Sequence Set (nodes) Set (nodes) Set (links) Set (mixed) Routing depends on source and target source and target source and target only target only target only target only target only target only target only target Space O(m) O(m) O(m) O(m) O(m) O(n3 ) O(m) O(n3 ) O(n3 ) O(n3 ) Preproc. time O(n2 m) O(n2 m) O(n2 m) O(n2 m) Query time O(n2 m) O(n2 m) O(n2 m) O(nm) O(nm) O(nk) O(nm) O(k3 n) O(k3 n) O(k3 n) n – number of nodes in the network; m – maximal number of edges in routing trees (or a routing directed acyclic graphs (DAG) for multi-path routing schemes); k – number of nodes in a sequence or a set. Flow Betweenness-Centrality (FBC) proposed by Freeman et al. [18] equally considers routes of all lengths and assumes that routes are simple (containing no cycles). While simple routes is a reasonable assumption for communication networks, routing strategies in computer network usually do prefer shorter paths over longer paths. Random Walk Betweenness-Centrality (RWBC), proposed by Newman [25], assumes that shorter paths are used more than longer ones. However, RWBC assumes that routes may contain cycles, which is not the case in most communication networks. Besides the path length issue, each one of the above Betweenness-Centrality measures assumes a fixed communication model that does not fully match routing strategies used in communication networks such as the Internet. In this paper we propose a more flexible and realistic measure called Routing Betweenness Centrality, that accommodates a wide class of routing strategies. Most routing protocols create routing tables that match the destination address of a packet with one output port. Occasionally routing tables are changed if one of the links, attached to a router, is unavailable due to malfunctions or congestions. During the time period when routing tables do not change they create spanning trees rooted at every target node in the network. Some routing protocols maintain shortest paths to the target while others balance the traffic load on the network by forwarding superfluous traffic though less loaded routes which are not necessarily shortest [1, 23]. Routing protocols may utilize multiple paths from source to target which are not necessarily shortest, but it is important to note that, in the stable state they do not contain loops. Routing Betweenness-Centrality (RBC), as defined in this paper accommodates arbitrary loop-free routing schemes where routing decisions depend on the packet target alone or on both the source and the target of the packet. It is easy to show that for computing SPBC, TLC, and FBC are particular cases of RBC. We elaborate on SPBC, TLC, and FBC in Section 2 and show how to define a routing strategy that will match their communication model in Section 3. We also add the notion of sampling rate which was not considered by prior algorithms for computing Betweenness-Centrality measures. We present a set of algorithms for computing RBC of individual nodes, sequences of nodes (e.g. links), and sets of nodes and/or links. Table 1 summarizes the algorithms discussed in this paper. In Section 4 we present algorithms that, given a loop-free routing scheme, compute RBC by topologically sorting all nodes between each source-target pair. 3 In Section 5 we reduce the complexity of RBC computations for routing schemes where the routing decisions are affected only by the target of the packet and how to efficiently compute RBC of sets consisting of both links and nodes. Conclusions appear in Section 6. Symbols and notation principles used in this paper are summarized in the Appendix. 2 Preliminaries on Betweenness-Centrality Shortest Path Betweenness Centrality (SPBC) was introduced in social sciences to measure the potential influence of an individual over the information flow in a social network [2, 16]. SPBC is defined as the sum of fractions of all shortest paths between each pair of nodes in a network which traverse a given node: X σs,t (v) σs,t SP BC(v) = s6=v6=t , where σs,t is the number of shortest paths connecting s and t and σs,t (v) is the number of paths between s and t that traverse v. Assume, for example, that the network uses a shortest-path routing scheme where the route is randomly chosen out of all shortest paths from source to target. Assume also that every node sends one packet to every other node. In this case, SPBC of a node v is the expected number of packets that traverse v. Brandes has shown in [10] that shortest paths from a single source s to all other nodes can be efficiently aggregated by traversing the nodes in the order of a non-increasing distance from s. Efficient aggregation of shortest paths is used to compute SPBC of all nodes in a network in O(|V ||E|) time, where V is the set of nodes and |E| is the set of links in the network. SPBC can naturally be extended to Group Betweenness-Centrality (GBC) [14]. GBC of a set of nodes M is defined as the sum of fractions of all shortest paths which traverse at least one node in M . GBC of a single group can be computed in O(|V ||E|) [11, 30] or in O(|M |3 ) time following a preprocessing that takes O(|V |3 ) time. The definition of GBC resembles the definition of the effectiveness of a group of distributed network monitors which is defined as the probability that a random packet is sampled at least once by at least one of the monitors [12]. Moreover, Holme has shown in [212], that the SPBC of a node is highly correlated with the fraction of time that the node is occupied by traffic. SPBC was also used by Yan et al. [34] for predicting and avoiding congestions. These findings indicate that SPBC can be used as a heuristic in many network related tasks such as designing routing protocols, optimizing deployment of network monitors, finding bottlenecks in the network, etc. Unfortunately, in practice, not all the shortest paths between source and target have the same probability to transfer a packet as assumed by SPBC. Traffic Load Centrality (TLC) assumes a more realistic routing strategy, where every node forewords the packet to a neighbor chosen randomly out of the neighbors that are closest to the target. like SPBC, TLC can be computed in O(|V ||E|) time for all nodes in the network [11, 24]. The group variant of TLC can also be computed in O(|V ||E|) as was indicated in [11]. To the best of our knowledge, there are no algorithms that reduce the time required to compute TLC of a group of nodes using preprocessing. SPBC and TLC possess similar statistical properties, however, normalized SPBC and TLC can differ up to 30% for individual nodes in large networks [36]. The drawback of SPBC and TLC measures is that they are both limited to shortest-path routes while in practice traffic flows may deviate from the shortest paths to increase the network performance. There are some Betweenness-Centrality measures that are not limited to shortest paths. Freeman et al. [18] introduce Maximal-Flow Betweenness-Centrality (FBC) that equally considers all paths from source to target. Roughly speaking, FBC of the node v is the sum of fractions of the maximal flows betweens each pair of nodes, that is transferred by the node v: F BC(v) = X φs,t (v) φs,t s6=v6=t , where φs,t is maximal flow between s and t and φs,t (v) is the portion of this flow that is transferred by the node v. Since the maximal flow can utilize different routes from s to t, φs,t (v) should be averaged over all 4 the possibilities. The main drawback of FBC when applied to communication networks is that it does not prioritize routes according to their lengths, while in practice, most of the traffic is routed through shortest paths. The work of Borgatti and Everett [8] categorizes Betweenness-Centrality measures according to the types of routes assumed and provides valuable insights into the formulation and computation of generic Betweenness-Centrality measures. Routing Betweenness-Centrality (RBC) defined in this paper is a generalization of SPBC, FBC, and TLC. We present algorithms for computing RBC of individual nodes as well as sets and sequences of nodes. Algorithms presented in this paper are applicable to a general case of loop-free routing schemes where the routing decisions depend on both the source and the target of a packet, and to source-oblivious schemes. For the latter routing schemes we show that RBC can be computed with the same time complexity as SPBC and TLC (namely in O(|V ||E|) time). We also show how these times can be reduced using preprocessing when the size of the evaluated group is small compared to the size of the network. 3 Routing scheme representation Throughout this paper we assume a loop-free routing scheme. We ignore temporary loops created by routing oscillations and treat routing oscillations as an unavoidable noise in the system. Instead, we are interested in a superposition of all stable state routing tables. Each routing decision made along a network path is dictated by the network topology and the status of the network. Link failures and congestions cause routing decisions across the network to change from time to time. We assume that either the routing decisions are deterministic or the probabilities for specific routing decisions can be determined (for example, by analyzing historical behavior of the network). Formally, let G = (V, E) be a communication network topology where V is a set of n nodes and E is a set of links between the nodes. We do not allow self loops such as (v, v) ∈ E. Let T be a traffic matrix where T (s, t) is the number of packets sent from a source node s to a target node t. In general, T (s, t) can represent any quantity of interest such as the number of bytes, number of sessions, or the importance of communication between s and t. Assume, for example, that a group of monitors is installed on nodes in a network. The total number of bytes or the total importance of the communication passing through this group can be regarded as its monitoring potential. However the actual volume of information being monitored depends on the sampling rates of the monitors (0 ≤ ρv ≤ 1). Our goal is to compute the total expected number of packets sampled by groups of collaborating monitors. We distinguish between two types of groups: sequences – packets should be sampled by all the members in the order defined by the sequence – and sets – packets should be sampled by at least one member. Let R(s, u, v, t) = p be a quaternary function representing the averaged routing scheme where p is the probability that u will forward to v a packet with source address s and target address t. Note that we assume that all routing decisions (such as (s, u, v, t)) are independent. We will use “don’t care”  to indicate any value. For example, R(, u, v, ) = 0 if there is no link from u to v and R(, v, v, ) = 1 by convention. R defines a directed acyclic graph DAG for each source-target pair. Complexity of the algorithms described in this paper depends on the number of links in these DAGs. We denote the maximal number of links in all routing DAGs relevant to the network as m. The routing scheme R can represent various policies of message or flow transfer methods. We can embed in R some message transfer methods assumed by different Betweenness-Centrality measures. In these cases RBC will produce the same values as would be produced by the original Betweenness-Centrality measure. We will use Figure 1 as a sample network for the following examples. TLC nodes forward packets to one of the neighbors which are closest to the target with equal probability. In this case R(s, u, v, t) is equal to one divided by the number of v’s neighbors that are closest to t. For example in Figure 1 R(s, v1 , v2 , t) = 0.5. SPBC nodes forward packets to one of the neighbors which are closest to the target. The probability of 5 v1 1/3 v2 v1 v3 s v3 v1 t v2 0.5 0.5 1/3 v3 0.25 0.25 0.25 s t s t 0.5 1/3 v4 v2 0.25 1/3 2/3 0.25 v5 2/3 v4 (a) SPBC v5 0.75 (b) TLC 0.5 v4 v5 (c) FBC Figure 1: Sample network with traffic flowing from s to t according to SPBC, TLC, and FBC flow models. u to forward to v a packet targeted at t is equal to the fraction of shortest paths from u to t that σ (v) pass through v R(s, u, v, t) = u,t σu,t . For example in Figure 1 R(s, s, v1 , t) = 2/3 since there are three shortest paths between s and t, two of which pass through v1 . FBC For each s, t pair nodes forward packets from s to one of their neighbors to produce maximal flow between s and t. The probability of u to forward to v a packet from s to t is proportional to the portion φ ((u,v)) of the s-t-flow carried by the undirected link (u, v): R(s, u, v, t) = s,t φs,t (u) . For example, if we assume that in Figure 1 the capacity of all links is 0.5, then R(s, v1 , v2 , t) = 0.5 since the link (v1 , v5 ) is not utilized by the maximal flow between s and t. 4 Routing Betweenness-Centrality In this section we define Routing Betweenness-Centrality (RBC), focusing on routing schemes where the routing decisions depend on the source and the target of a packet. In Section 5 we will show how the computation of RBC can be optimized when the routing decisions are source-oblivious. In the next three subsections we present algorithms for computing RBC of individual nodes, sequences of nodes, and sets of nodes. 4.1 RBC of individual nodes Assume that a packet is introduced to the network by source node s and destined to leave the network at target node t. Let δs,t (v) be the probability that this packet will pass through the node v. We will refer to δs,t (v) and its variants as pairwise dependency of s and t on the intermediate v. δs,t (v) · T (s, t) is the expected number of packets sent from s to t that pass through v. Note that for special cases where v equals s, t, or both it holds that δs,t (s) = δs,t (t) = 1. δs,t (v) can be recursively computed for arbitrary v ∈ V based on the loop-free routing strategy R(s, u, v, t). Let P reds,t (v) be a set of all immediate predecessors of v on the way to t: P reds,t (v) = {u|R(s, u, v, t) > 0}. Let u be a predecessor of v on the way from s to t. The probability that a packet will pass through v after visiting u is R(s, u, v, t). Hence, the pairwise dependency of s and t on v can be computed using pairwise dependency of s and t on v’s predecessors. δs,t (v) = X δs,t (s) = 1 (1) δs,t (u) · R(s, u, v, t) u∈P reds,t (v) Since we assume loop-free routing, P reds,t defines a directed acyclic graph (DAG) [20] as shown in Figure 2a. Therefore, we can compute δs,t (v) for all v ∈ V in O(m) in the worst case. All we need to do is topologically sort the DAG induced by P reds,t and iteratively apply Equation 1 on all nodes starting from s. 6 s2 v2 t2 s2 v2 v1 2.5 v3 5 2.5 t2 s2 v2 v1 2.5 v3 2.5 5 2.5 v1 2.5 5 v4 7.5 s1 t1 s1 v1 t2 v3 v4 s2 v2 t1 delta 10 5 0 2.5 7.5 0 0 10 5 v4 s1 t1 s1 v1 t2 v3 v4 s2 v2 t1 delta 10 5 0 2.5 0 0 0 2.5 (a) Single nodes. (b) Sequence (v1 , v3 ) t2 v3 5 5 v4 5 s1 t1 s1 v1 t2 v3 v4 s2 v2 t1 delta 10 5 0 0 5 0 0 5 (c) Set {v1 , v3 } Figure 2: Example of a routing DAG from s1 to t1 (dashed gray arrows). In this example we assume T (s1 , t1 ) = 10. The numbers on the arrows in sub-figures (a), (b), and (c) indicate the delta values contributed by the topologically sorted nodes to their successors in Algorithms 1, 2, and 3 respectively. Let RBC of a node v (δ•,• (v)) be the expected number of packets that pass through v. X δ•,• (v) = δs,t (v) · T (s, t) (2) s,t∈V δ•,• (v) can be regarded as the potential of v to inspect or alter communications in the network. Equation 2 resembles the original definition of SPBC with two exceptions. First, each δs,t (v) is multiplied by the number of packets sent from s to t to compute the traffic load on v. Second, end points are included in the summation to accommodate communications originating from (or destined to) the investigated node. Algorithm 1 computes the RBC of all individual nodes in O(n2 m) time using Equations 1 and 2. Algorithm 1: RBC of nodes Input: G(V, E), R, T Output: RBC[1..|V |] Data: delta[1..|V |] ∀v∈V , RBC[v] = 0 for s, t ∈ V do H topological sort E ′ = {(u, v)|R(s, u, v, t) > 0} D = directed acyclic graph (V, E ′ ) {s = v0 ¹ v1 ¹ . . . ¹ vn = t} topologically sorted nodes of D H init delta ∀v∈V , delta[v] = 0; delta[s] = T (s, t) H accumulate δ•,• (v) for i = 0 to n do for vj ∈ successors(vi ) do delta[vj ]+ = delta[vi ] · R(s, vi , vj , t) for v ∈ V do RBC[v]+ = delta[v] return RB Algorithm 1 is composed of an outer loop that iterates over all s-t pairs of nodes and of three inner stages. In the first stage the algorithm creates the routing DAG with single source s and single sink t. In the second stage the delta array is initialized (bold number in Figure 2). Entry delta[v] of this array represents the expected number of packets from s to t that pass through v: δs,t (v) · T (s, t). Finally in the third stage 7 the expected number of packets from s to t that pass through each one of the nodes is computed and these probabilities are accumulated according to Equation 2 to form RBC values of all nodes. Most of the following algorithms will use the same template and similar content. 4.2 RBC of ordered sequences In this subsection we define RBC of ordered sequences of nodes. A link is a private case of a sequence of size two where the members of the sequence are connected. Betweenness-Centrality of a sequence measures the extent to which packets traverse all the nodes in the sequence in a given order. For example, RBC of a sequence of monitors can reveal the level of redundant traffic inspection. The SPBC of sequences was first mentioned in [28] as a technique to speed up the computation of shortest-path group Betweenness-Centrality (GBC) in an order of magnitude. We will also use the concept of sequence Betweenness-Centrality to speed up the computation of RBC of sets in Section 5.3. Let S = (s1 , . . . , sk ) be a sequence of nodes. Let δ̃s,t (S) be the probability that a single packet emanating from s and targeted at t will pass through all nodes in the sequence S, first through s1 then through s2 and so on until sk . δ̃s,t (S) · T (s, t) is the expected number of packets sent from s to t that pass through S. The sequence S can be any finite sequence of nodes. If the same node appears more than once, all successive appearances of the node can be reduced to one instance, for example δ̃s,t ((u, v, v, v, w)) = δ̃s,t ((u, v, w)). On the other hand, if two appearances of a node in the sequence S are separated by a different node this will create a cycle and δ̃(S) will be equal to zero according to the assumption of loop-free routing. For the same reason, δ̃s,t (S) is equal to zero if S contains s following some other nodes, for example δ̃s,t ((v, . . . , s, . . .)) = 0. The following set of equations recursively computes the probability that a packet will pass through the sequence S: δ̃s,t ((s)) = 1 (3) (vk−1 = vk ) δ̃s,t ((. . . , vk−1 , vk )) = δ̃s,t ((. . . , vk−1 )) X δ̃s,t ((. . . , vk−1 , u)) · R(s, u, vk , t) (vk−1 6= vk ) δ̃s,t ((. . . , vk−1 , vk )) = u∈P reds,t (vk ) The set of predecessors (P reds,t (r)) remains the same as in previous subsection. Therefore, the Equation 3 can also be solved in O(m) time similarly to Equations 1. Let Sρ = (s1 , . . . , sk ) be a sequence of nodes with sampling rates ρs1 , . . . , ρsk respectively. For simplicity of the following discussion we assume that all nodes in S are different. We will denote by S the same sequence of nodes disregarding their sampling rates. The probability that a packet from s to t will be sampled by all nodes in Sρ is the probability that it will pass through S multiplied by the product of sampling rates of all nodes in the sequence. RBC of an ordered sequence of nodes Sρ (denoted by δe•,• (Sρ )) is defined as the expected number of packets sampled by all nodes in Sρ in a given order. Y X δ̃s,t (S) · T (s, t) (4) δe•,• (Sρ ) = ρr · r∈S s,t∈V Note that RBC of a directed link (u, v) ∈ E and a single node w ∈ V is simply RBC of the sequences (u, v) and (w) respectively. Equations 3 and 4 can be used to compute RBC of one sequence of nodes in O(n2 m). δe•,• (Sρ ) is computed by Algorithm 2, by propagating only the portion of traffic that was sampled by the monitors in Sρ . 4.3 RBC of sets In this subsection we define the set variant of RBC. Generally, Betweenness-Centrality of a group of nodes measures the extent to which packets traverse at least one of the nodes in the group. The concept of centrality was first applied to groups and classes of nodes in networks by Everett and Borgatti in [14]. The set variant of RBC can be used, for example, for estimating the expected effectiveness of distributed monitors. 8 Algorithm 2: RBC of sequences (with sampling) Input: G(V, E), R, T, ρ, S = (s0 , s1 , . . . , sl ) (i 6= j ⇒ si 6= sj ) Output: RBC of S Data: delta[1..|V |] RBCof S = 0 for s, t ∈ V do ◮ topological sort ◮ init delta H accumulate δ̃•,• (Sρ ) k=0 for i = 0 to n do if vi = sk then k+=1 for vj ∈ successors(vi ) do if vj ≺ sk or vj is sk then delta[vj ]+ = delta[vi ] · R(s, vi , vj , t); Q RBCof S+ = delta[t] · si ∈S (ρsi ); return RBCof S Algorithm 3: RBC of sets (with sampling) Input: G(V, E), R, T, ρ Output: RBC Data: delta[1..|V |], totalTraffic RBC = 0 for s, t ∈ V do ◮ topological sort ◮ init delta H accumulate δ̈•,• (Mρ ) Pn−1 totalTraffic= i=0 delta[i] for i = 0 to n do delta[vi ] = delta[vi ] · (1 − ρvi ) for vj ∈ successors(vi ) do delta[vj ]+ = delta[vi ] · R(s, vi , vj , t); RBC+ = (totalTraffic−delta[t]); return RBC 9 Let M = {v0 , . . . , vk } be a set of nodes. Let δ̈s,t (M ) be the probability that a packet from s to t will pass through at least one of the nodes in M . δ̈s,t (M ) · T (s, t) is the expected number of packets sent from s to t that pass through M . If we disregard sampling rates, RBC of set M is: X δ̈s,t (M ) · T (s, t). δ̈•,• (M ) = s,t∈V Let ρv be the sampling rate of the monitor installed on the node v. Let Mρ = {v|ρv > 0} be a set of nodes with positive sampling rates. Mρ can be regarded as a fuzzy set where ρv is the extent to which v belongs to Mρ . Let δ̈s,t (Mρ ) be the probability that a packet from s to t will be sampled by at least one of the nodes in M . For the sake of simplicity we prefer to compute δ̈s,t (Mρ ) using its inverse probability, namely the probability that a packet from s to t will not be sampled by monitors in M . Assume, for example, that each M sampled packet is marked by the monitors. Let λs,tρ (v) be the probability that a packet from s to t will pass through v without being marked neither before arriving to v nor by v itself. The probability that a packet from s to t will not be market by v is 1 − ρv . Therefore, the probability that the packet will leave s without M being marked is 1 − ρs . Let u be a predecessor of v. A product λs,tρ (u) · R(s, u, v, t) is the probability that the packet will reach v through u without being marked. Summing these products over all predecessors of v will result in the probability that the packet will get to v without being marked as shown in Equation 5. M M λs,tρ (v) = (1 − ρv ) · X λs,tρ (s) = (1 − ρs ) M λs,tρ (u) (5) · R(s, u, v, t) u∈P reds,t (v) M λs,tρ (t) is the probability that the packet from s to t will not be sampled by any of the monitors. Therefore M (1 − λs,tρ (t)) · T (s, t) is the expected number of distinct packets from s to t captured by the monitors: M δ̈s,t (Mρ ) = 1 − λs,tρ (t). The RBC of the fuzzy set Mρ is the expected number of packets sampled by at least one node in Mρ and can be computed using the inverse probabilities as described in Equation 6. X M (6) δ̈•,• (Mρ ) = (1 − λs,tρ (t)) · T (s, t) s,t∈V Assume a node v ∈ V and sampling rates ρ such that ρv = 1 and for each u 6= v, ρu = 0. In this case δ•,• (v) = δ̈•,• ({v}ρ ) making RBC of sets a valid generalization of RBC of single nodes. In the following discussions we will occasionally omit the subscript ρ notation when referring solely to the nodes in M or when sampling rates are assumed to be 0 or 1. Equations 5 and 6 can be used to compute RBC of one group of monitors with given sampling rates in O(n2 m) as shown in Algorithm 3. In the input to Algorithm 3 we use ρ to represent nodes with positive sampling rates. In the propagation stage of Algorithm 3 only the traffic that was not sampled propagates until it reaches t. Algorithm 3 is composed of an outer loop with three inner stages similarly to Algorithm 1. The first two phases remain intact. The third phase implements Equation 5 to fill the delta array with the expected number of packets from s to t that were not captured before or at the respective node. In addition, instead of computing RBC of all nodes in the networks, the algorithm computes the total expected number of packets that were captured by at least one monitor according to Equation 6. This concludes the definition of RBC and its computation methods for routing strategies where the routing decisions depend on both the source and the target of a packet. Next we will show how the assumption of source-oblivious routing reduces the time complexity of the presented algorithms from O(n2 m) to O(nm). 10 5 Computing RBC for source-oblivious routing In this section we will describe how the computation of RBC can be optimized when assuming a sourceoblivious routing scheme. We will revise the computation of RBC of single nodes, sets, and sequences and present their respective algorithms with minimal changes. 5.1 RBC of individual nodes Let δ•,t (r) be the expected number of packets targeted at t that pass through the node r as defined by Equation 7. X δ•,t (r) = δs,t (r) · T (s, t) (7) s∈V δ•,t (r) estimates the ability of r to monitor traffic flows targeted at t. We will refer to δ•,t (r) as target dependency of t on r. In this and following subsections we will show how to compute RBC of individual nodes, sequences, and sets by aggregating target dependencies. Since target dependency is a summation of pairwise dependencies over all sources, RBC of the node r is a summation of target dependencies over all targets as shown in Equation 8. X δ•,• (r) = δ•,t (r) (8) t∈V If we are able to compute target dependency directly without using Equation 7 the computation of δ•,• (r) can be accelerated by replacing the loop over all s − t pairs in Algorithm 1 by a loop over all target nodes t only. Next, we will show that target dependency can be computed recursively similarly to the computation of pairwise dependency. The similarity between these computations will allow us introducing only minimal changes to the pseudo code of Algorithm 1 in order to adapt it to source-oblivious routing strategies and reduce its complexity. Let P redt (v) be a set of all predecessors of v on the way to t: P redt (v) = {u|R(, u, v, t) > 0}. In contrast to P reds,t (v) defined in Section 4, here the set of the possible predecessors of v is not influenced by the source of communication. Let u be a predecessor of v on the way to t. The probability of a packet to pass through v after visiting u is R(, u, v, t). The expected number of packets targeted at t that can be monitored by v include packets introduced to the network by v (T (v, t)) and all packets introduced or forwarded by v’s predecessors as described by Equation 9. X δ•,t (v) = T (v, t) + δ•,t (u) · R(, u, v, t) (9) u∈P redt (v) This equation can be derived directly from Equations 1 and 7 which describe the computation of δs,t (v) and define δ•,t (v) respectively. Since we assume loop-free routing P redt defines a DAG similarly to P reds,t , but this time the DAG has multiple sources and a single sink t as shown in Figure 3. Equation 9 allows computing the values of δ•,t (v) for all v ∈ V in O(m), in the worst case. Structural similarity of Equations 1 and 9 suggests that the same process can be used to compute δs,t (v) and δ•,t (v). In fact, by changing the “init delta” stage of Algorithm 1 as shown in Algorithm 4 we make the accumulation stage fill the delta array with target dependencies instead of pairwise dependencies. Algorithm 4 initializes each entry of the array delta[v] with T (v, t) instead of assigning T (s, t) to delta[s] and zero to all other entries. In contrast to Algorithms 1, 2, and 3, that loop through all s-t pairs of nodes, we need to loop only through all target nodes to compute RBC given the source-oblivious routing strategy. Algorithm 4 loops once through all target nodes t ∈ V , performing a three-stage operation similar to Algorithm 1. In the first stage, the algorithm builds the routing DAG with multiple sources and a single sink (opposed to the single source and single sink DAG, built by the algorithms in the previous section), sorting its nodes. In the second stage, the delta array is initialized to T (v, t). For example, in Figure 3 T (s1 , t1 ) = T (s2 , t1 ) = 10. Finally, in the third stage, the algorithm traverses the topologically sorted nodes of the network and aggregates 11 5 s2 v2 5 t2 5 s2 5 t2 5 s2 5 5 v1 5 v3 5 v2 5 v1 5 v3 10 5 v4 10 s1 t1 s1 s2 t2 v2 v1 v4 v3 t1 delta 10 10 0 5 10 5 15 20 5 5 v4 s1 t1 s1 s2 t2 v2 v1 v4 v3 t1 delta 10 10 0 5 10 5 5 5 (a) Single nodes (b) Sequence (v1 , v3 ) t2 5 v1 10 v2 v3 5 5 v4 5 s1 t1 s1 s2 t2 v2 v1 v4 v3 t1 delta 10 10 0 5 10 5 5 5 (c) Set {v1 , v3 } Figure 3: Example of a source-oblivious routing DAG with a single sink t and two sources (dashed gray lines). In this example we assume T (s1 , t1 ) = T (s2 , t1 ) = 10 and T (vi , t1 ) = 0. The numbers on the arrows in sub-figures (a), (b), and (c) indicate the delta values contributed by the topologically sorted nodes to their successors in Algorithms 1, 2, and 3 respectively. Algorithm 4: s-oblivious RBC of nodes Input: G(V, E), R, T Output: RBC[1..|V |] Data: delta[1..|V |] ∀v∈V , RBC[v] = 0 for t ∈ V do H topological sort E ′ = {(u, v)|R(, u, v, t) > 0} D = directed acyclic graph (V, E ′ ) {v0 ¹ v1 ¹ . . . ¹ vn = t} topologically sorted nodes of D; H init delta ∀v∈V , delta[v] = T (v, t); ◮ accumulate δ•,• (v); return RB RBC values. The third stage remains the same as in Algorithm 1, despite the fact that the delta array now represents target dependencies and not pairwise dependencies. Algorithm 4 iterates once over all nodes in the network, and performs for each one of them a computation that takes at most O(m) steps. Thus, the overall complexity of the algorithm is O(nm). This is an order of magnitude faster than Algorithm 1, whose complexity is O(n2 m). Next we present the equations which adapt RBC computation of sequences and sets to the semantics of target dependencies. 5.2 RBC of ordered sequences Employing target dependency. Let Sρ = (s1 , . . . , sk ) be a sequence of nodes with sampling rates ρs1 , . . . , ρsk respectively. Let δ̃•,t (S) be the expected number of packets targeted at t that pass through all nodes in the sequence S: X δ̃s,t (S) · T (s, t). δ̃•,t = s∈V 12 Q Accordingly, δ̃•,t (S) · v∈S (ρv ) is the expected number of packets targeted at t that are sampled by all nodes in the sequence. Equations 10 and 11 describe RBC of the sequence Sρ in terms of δ̃•,t (Sρ ). δ̃•,t ((v)) = δ•,t (v) (10) (vk−1 = vk ) : δ̃•,t ((. . . , vk−1 , vk )) = δ̃•,t ((. . . , v)) (vk−1 6= vk ) : δ̃•,t ((. . . , vk−1 , vk )) = X = δ̃•,t ((. . . , vk−1 , u)) · R(, u, vk , t) u∈P redt (vk ) δe•,• (Sρ ) = Y ρv · v∈S X δ̃•,t (S). (11) t∈V Algorithm 5: s-oblivious RBC of sequences (with sampling) Input: G(V, E), R, T, ρ, S = {s0 , . . . , sk } Output: RBC Data: delta[1..|V |] RBC = 0 for t ∈ V do ◮ topological sort H init delta for v ∈ V do if v ≺ s0 or v is s0 then delta[v] = T (v, t); else delta[v] = 0; ◮ accumulate δ̃•,• (Sρ ); return RBC Algorithm 5 computes RBC of a sequence of monitors in O(nm) time using Equations 10 and 11. During the iteration over all target nodes this algorithm sorts nodes, in the same way as Algorithm 4 and accumulates betweenness, in the same way as Algorithm 3. Entries of the delta[v] array represent the expected number of packets sampled by all monitors in the sequence preceding v in the topological order. In particular, all entries delta[v] preceding the first element in the sequence are initialized to T (v, t). Using precomputed data. Next we will closely examine the probability that a packet sent from s to t will pass through u and then through v (δ̃s,t ((u, v))). Consider Figure 4 as an example. Assume a packet targeted at t that has reached u. The probability that this packet will pass through v on its way to t does not depend on the source of the packet and on routing decisions made this far. Therefore, we can multiply the probability that the packet from s to t will reach u (δs,t (u)) by the probability that a packet from u to t will reach v (δu,t (v)) to get the probability that a packet from s to t will pass through both u and v: δ̃s,t ((u, v)) = δs,t (u) · δu,t (v). We can add more nodes to the sequence (u, v) using the following lemma: Lemma 1 (Dependency chaining) Let S = (s1 , . . . , sk ) be an ordered sequence of nodes. The probability that a packet sent from a node s to a different node t will pass through all nodes in S in a given order is: δ̃s,t ((s1 , . . . , sk )) = δs,t (s1 ) · δ̃s1 ,t ((s2 , . . . , sk )). 13 Proof: The following proof is based on the fact that the probability of a packet passing through (s2 , . . . , sk ), assuming that the packet already visited s1 , does not depend on the source of the packet since we assume that the routing scheme under investigation is source-oblivious. First we will prove the lemma for δ̃s,t ((s1 , . . . , sk )) = 0. δ̃s,t ((s1 , . . . , sk )) is the probability that a packet emanating from s and targeted to t will first pass through v1 , then through s2 , and so on, until sk . This is a non-zero probability if, and only if, there is at least one route from s to t that passes through (s1 , . . . , sk ) in this order. Such a route exists if, and only if, there is a route from s to t traversing s1 and there is a complement route from s1 to t that includes the nodes s2 , . . . , sk . δ̃s,t ((s1 , . . . , sk )) = 0 ⇔ δs,t (s1 ) · δ̃s1 ,t ((s2 , . . . , sk )) = 0. Before we continue the proof for a more general case δ̃s,t ((s1 , . . . , sk )) 6= 0 we will now show that for any set of nodes there is at most one permutation L of these nodes for which δ̃s,t (L) ≥ 0. Note that δ̃ is defined as probability and therefore cannot be negative. Proposition 1 Let s, t ∈ V be two nodes in the network. Let M ⊆ V be a subset of nodes. Let L1 and L2 be two permutations of M . Then for any loop-free routing strategy where routing decisions depend solely on s and t (or only t in case of source-oblivious routing), the following two options are mutually exclusive unless L1 = L2 : 1. δ̃s,t (L1 ) > 0 2. δ̃s,t (L2 ) > 0. Proof: Let L1 = (v1 , . . . , vl ) and L2 = (u1 , . . . , ul ) be two different permutations of M such that δ̃s,t (L1 ) > 0 and δ̃s,t (L2 ) > 0. Let i be the lowest integer such that vi 6= ui . Let j be the index of the node vi in L2 (vi = uj ). Let k be the index of the node ui in L1 (ui = vk ). Since all nodes appear only once in both permutations and i is the lowest index for which nodes are different it holds that i < j and i < k. Without loss of generality assume that j ≤ k. This means that for each one of the permutations there is at least one route from s to t passing through all the nodes in the order defined by the permutation. In particular it holds that there is at least one route from s to t passing through (vi , . . . , vk ) and similarly for (ui , . . . , uj ). Since routing decisions depend solely on s and t there is a non-zero probability that a packet from s to t will reach uj = vi through (ui , . . . , uj = vi ) and continue back to ui = vk through (vi , . . . , vk = ui ) in contradiction to the assumption that the routing strategy is loop-free. Note that the existence of a route from s to t through (s1 , . . . , sk ) implies that there is no route that passes through these nodes in a different order, according the above the above Proposition 1. Therefore, if δ̃s,t ((s1 , . . . , sk )) > 0 then the order of nodes s, s1 , . . . , sk , t is well defined (with s being the first node). Assume that δ̃s,t ((s1 , . . . , sk )) > 0. Let the event Zv represent all cases where the packet passes through node v. Let the event Tv represent all cases where the packet is targeted toward v. ”Targeting” here is different from ”passing through” since the target of a packet has an affect on routing decisions along the traversed path. # " k \ (12) Zsi |Zs ∩ Tt δ̃s,t ((s1 , . . . , sk )) = P r i=1 The next equation immediately follows from Equation 12 since Tk events Zs1 and i=2 Zsi . δ̃s,t ((s1 , . . . , sk )) = P r [Zs1 |Zs ∩ Tt ] · P r 14 " Tk i=1 k \ i=2 Zsi can be decomposed into two joint Zsi |Zs ∩ Zs1 ∩ Tt # (13) Since we are dealing with source-oblivious routing the nodes that the packet passed prior to passing through s1 (in particular the source node s) have no effect on the remaining routing decisions. Therefore s has no effect on the probability of a packet targeted at t passing through s2 , . . . , sk after visiting s1 . # " k \ (14) Zsi |Zs1 ∩ Tt δ̃s,t ((s1 , . . . , sk )) = P r [Zs1 |Zs ∩ Tt ] · P r i=2 Finally, according to the definitions of δ and δ̃, the proof of Lemma 1 can be completed. δ̃s,t ((v0 , . . . , vl )) = δs,t (s1 ) · δ̃s1 ,t ((s2 , . . . , sk )) (15) Using Lemma 1, pairwise dependency on a sequence can be represented as a product of pairwise dependencies on single nodes: s u v t Figure 4: In this figure assume that packets are sent from s to t and are forwarded by u and v from the left to the right. The probability that an arbitrary packet sent from s to t will pass through u and v is smaller than the probability that it will pass through u. δs,t (u) = 31 , δs,t (v) = 12 , δs,v (u) = 12 , and δu,t (v) = 21 . δ̃s,t ((u, v)) = 31 · 12 = 61 since we have two decision points: first on s and then on u. Note that 61 = δ̃s,t ((u, v)) 6= δs,v (u) · δs,t (v) = 14 . This is because δs,v (u) does not consider the ultimate target (t) and ignores one possible path from s to t. δ̃s,t ((s1 , . . . , sk )) = δs,t (s1 ) · δs1 ,t (s1 ) · . . . · δsk−1 ,t (sk ) (16) Multiplying the Equation 16 by T (s, t) and summing it over all sources s ∈ V results in a target dependency chain as the following: δ̃•,t ((s1 , . . . , sk )) = δ•,t (s1 ) · k Y δvi−1 ,t (si ) (17) i=1 Equations 16 and 17 can be used to compute δs,t (S) and δ̃•,t (S) respectively in O(|S|) steps given the values of δs,t (si ) and δ•,t (s1 ). Consequently δ̃•,• (S) can be computed in O(n · |S|) steps using the summation over all target nodes: X δ̃•,t (S). δe•,• (S) = t∈V The pseudo-code for the computation can be found in Algorithm 6. The pseudo-code is straight forward and contains two nested loops where the first one iterates over all target nodes in the network. The second loop iterates over the sequence members multiplying the pairwise dependencies. 5.3 RBC of sets M Employing target dependency. Let λ•,tρ (v) be the expected number of packets targeted at t that reach v without being captured by any of the nodes in Mρ : X M M λs,tρ (v) · T (s, t). λ•,tρ (v) = s∈V 15 Algorithm 6: s-oblivious RBC of sequences (with sampling, after preprocessing) Input: G(V, E), R, T, ρ, S = {s0 , . . . , sk }, δs,t (v), δ•,t (v) Output: RBC Data: delta RBC = 0 for t ∈ V do delta = δ•,t (s0 ) · ρs0 for i = 0 to k − 1 do delta∗ = δsi ,t (si+1 ) · ρsi+1 ; RBC+ = delta; return RBC M The following equations describe RBC of the fuzzy set Mρ in terms of λ•,tρ (v): M λ•,tρ (v) = (1 − ρv ) · T (v, t) + δ̈•,• (Mρ ) = X t∈V Ã X M λ•,tρ (u) · R(, u, v, t) · (1 − ρv ) (18) u∈P redt (v) X T (s, t) − M λ•,tρ (t) s∈V ! . (19) Algorithm 7 computes RBC of a set of monitors installed on nodes in a communication network with sourceoblivious routing strategy, given the sampling rates of the monitors. Thus, the time complexity of Algorithm is O(nm). This algorithm iterates over all nodes in the network and in each iteration, sorts nodes is the same way as Algorithm 4. It initializes each entry of the delta[v] array to (1 − ρv ) · T (v, t) and accumulates betweenness similarly to Algorithm 3. Algorithm 7: s-oblivious RBC of sets (with sampling) Input: G(V, E), R, T, ρ, M Output: RBC Data: delta[1..|V |] RBC = 0 for t ∈ V do ◮ topological sort H init delta ∀v∈V , delta[v] = (1 − ρv ) · T (v, t); ◮ accumulate δ̈•,• (Mρ ); return RBC Contribution to RBC of a set. In this subsection we assume that a set of monitors X is installed on nodes in a network and their sampling rates are specified by ρ. We investigate the expected number of unsampled packets that can be sampled by additional monitors. We will refer to this measure as the contribution of X individual nodes, sets of nodes, or sequences of nodes to RBC of Xρ . In Section 4.3 we have defined λs,tρ (v) as the probability that a packet from s to t will pass through v without being sampled by monitors in Xρ . This probability gives no information regarding the probability that this packet will be sampled after passing through v. X Let χs,tρ (w) be the probability that a packet from s to t will pass through w and will not be sampled by any of the monitors in Xρ (neither before nor after visiting w). Assume that v monitors the traffic with X sampling rate ρw > 0. Then χs,tρ (w) · ρv · T (s, t) is the expected number of packets from s to t that were X sampled only by w and not by other monitors. In other words, χs,tρ (w) · ρw is the contribution of w to the 16 X capability of the monitors to sample traffic between s and t. χs,tρ (u) can be computed for any u ∈ V by starting with X = ∅ and adding nodes to X one at a time using the following lemma: Lemma 2 (Pairwise dependency contribution) Let X = {v1 , . . . , vk } be a set of nodes with sampling rates specified by ρv1 , . . . , ρvk respectively. Let w be a node with sampling rate ρw . For any u ∈ V it holds that: X ∪{w}ρ (u) = χs,tρ (u) · (1 − ρw ) X ∪{w}ρ (u) = χs,tρ (u) − χs,tρ (u) · χu,tρ (w) · ρw − χs,tρ (w) · χw,tρ (u) · ρw (u = w) χs,tρ (u 6= w) χs,tρ X X X X X X (20) Proof: This lemma describes the computation of the probability that a packet from s to t will pass through u without being sampled neither by monitors in X nor by w. The guiding principle of the computation is: to X discard packets that were sampled by w we need to subtract from χs,tρ (u) the probability of the packet being sampled by w either before or after passing through u. The probability of a packet passing through w without being sampled by w or any other node in X X∪{w} (χ̃s,t (w)), equals its probability of passing through w without being sampled by any node in X and not being sampled by w. Being sampled by w and being sampled by any node in X are independent events (assuming that w ∈ / X). Hence, the first case of the lemma. Assume w 6= u. Let the event Yv represent the cases where the packet from s to t passes through the node v. Let the event Zv represent the cases where this packet was sampled by the node v. Let Z v represent the X cases where this packet was not sampled by the node v. χs,tρ (u) is the probability that a packet from s to t was not sampled by any node in X, but passes through u: \ X χs,tρ (u) = P r[ Z v ∩ Yu ]. (21) v∈X T P r[ v∈X Z v ∩ Yu ] can be decomposed into two cases: packets that were sampled by w and packets that were not: \ \ X χs,tρ (u) = P r[ Z v ∩ Zw ∩ Yu ] + P r[ Z v ∩ Z w ∩ Zu ]. (22) v∈X v∈X Assume a packet from s to t that was not sampled by any node in X. The above equation yields that the probability of this packet passing through u without being sampled by w (case Z w ∩ Yu ) is equal to the probability of the packet passing through u minus the probability of the packet passing through u while being sampled by w (case Zw ∩ Yu ). According to Proposition 1 the packet can pass through u and w by either passing first through u and then through w, or vice-versa. Note that packet can be sampled by w only if it passes through w. Therefore, the case Zw ∩ Yu can be represented as the sum of two sequence dependencies multiplied by the sampling rate of w (δ̃s,t ((w, u)) · ρw + δ̃s,t ((u, w)) · ρw ). Moreover, the proof of Proposition 1 can easily be translated from δ to χ by excluding packets that are sampled by some node in X. By replacing the probabilities in Equation 22 with the respective pairwise dependencies we get: ´ ³ X ∪{w}ρ X X X (u). (23) χs,tρ (u) = χ̃s,tρ ((u, w)) · ρw + χ̃s,tρ ((w, u)) · ρw + χs,tρ According to the Dependency Chaining Lemma, which can also be adjusted to χ by considering only packets X that were not sampled by X, χ̃s,tρ ((u, w)) can be decomposed into a product of two pairwise dependencies, completing the proof. X∪{w} X X X X χX s,t (u) − χs,t (u) · χu,t (w) · ρw − χs,t (w) · χw,t (u) · ρw = χs,t 17 (w). (24) X Let χ•,tρ (u) = P X s∈V χs,tρ (u)·T (s, t) be the expected number of packets targeted at t that will pass through X ∪{w}ρ v and will not be sampled by any of the monitors in Xρ . Lemma 2 can be used to compute χ•,tρ multiplying Equation 20 by T (s, t) and summing it over all sources. X X ∪{w}ρ (u) = χ•,tρ (u)(1 − ρw ) X ∪{w}ρ (u) = χ•,tρ (u) − χ•,tρ (u) · χu,tρ (w) · ρw − χ•,tρ (w) · χw,tρ (u) · ρw (u = w) χ•,tρ (u 6= w) χ•,tρ Let χ•,•ρ (w) = P t∈V (u) by X X X X X X (25) X χ•,tρ (w) be the expected number of packets between all source-target pairs that pass X through w without being sampled by any node in X. χ•,•ρ (w) · ρw can be considered as the contribution of w to RBC of Xρ : X δ̈•,• (Xρ ∪ {w}ρ ) = δ̈•,• (Xρ ) + χ•,•ρ (w) · ρw . Using precomputed data. We assume that all δs,t (v), δ•,t (v), and δ•,• (v) values are computed using Algorithm 1 and stored in a data structure with O(1) store and retrieval, such as matrices or a hash table. The computation speed up methods presented here are valid for source-oblivious routing strategies. In particular, we assume that all routing decisions specified by the probabilities R(, u, v, t) are independent. In order to make the discussion more intuitive we will use an example from set theory. Let A, B, and C be three sets. We need to compute the size of their union, namely, the number of elements belonging to at least one of the sets. Assume that we can easily compute the size of intersection but not the size of union. In order to overcome this difficulty we can use the Inclusion-Exclusion rule as following: |A ∪ B ∪ C| = |A| + |B| + |C| − |A ∩ B| − |A ∩ C| − |B ∩ C| + |A ∩ B ∩ C|. Regrouping the addends will result in: |A ∪ B ∪ C| = |A| + |B ∩ A| + |C ∩ A ∩ B|, where |B ∩ A| = |B| − |A ∩ B| accounts for all elements that belong to B and do not belong to A. In our case the size of a union can be associated with RBC of sets, where we account for packets sampled by at least one monitor in the set. Size of an intersection can be associated with RBC of sequences, where we account for packets sampled by all monitors in the sequence. Finally, the semantics of |B ∩ A| are similar {u} to semantics of χs,t (v) – the probability of a packet from s to t to pass through v without passing through u. Next we will apply the technique demonstrated in the above example for computing the expected number of packets sampled by a set of monitors. Pairwise dependency on a set of monitors can be computed by summing contributions of the set members as described by the following lemma: Lemma 3 (Summing dependency contributions) Let M = {v0 , . . . , vk } be a set of nodes with sampling rates specified by ρv0 , . . . , ρvk respectively. Let M (i) = {v0 , . . . , vi } be a subset of M . It holds that: δ̈s,t (Mρ ) = k X M (i−1) χs,tρ (vi ) · ρvi . i=0 Proof: Lemma 3 describes an iterative computation δ̈s,t (Mρ ). In each iteration we accumulate the contributions of vi , namely the uncovered traffic flows sampled by vi , to the pairwise dependency. Let the event Zv represent all cases where the packet from s to t is sampled by node v. Let the event Zv represent all cases where the packet from s to t is not sampled by the node v. Let the event Yv represent all cases where the packet from s to t passes through node v. By definition, δ̈s,t ({v1 , . . . , vk }) is the probability that the packet, from s to t, will pass through at least one of the k nodes v1 , . . . , vk # " k [ Zvi . δ̈s,t ({v1 , . . . , vk }ρ ) = P r i=1 18 We can substitute the right term in the above equation by a summation as following:   i−1 k \ X Zvj  . δ̈s,t ({v1 , . . . , vk }ρ ) = P r Zvi ∩ i=1 j=1 For each i the term inside the summation above is the probability that the packet from s to t will be sampled by vi without being sampled by v1 , . . . , vi−1 . According to the definition of χ   i−1 \ P r Yvi ∩ Zvj  = χ{v1 ,...,vi−1 } (vi ) j=1 The sampling rate of the node vi is ρvi and therefore we can substitute the term inside the summation by χ{v1 ,...,vi−1 } (vi ) · ρvi completing the proof: δ̈s,t ({v1 , . . . , vk }) = k X δ {v1 ,...,vi−1 } (vi ) · ρvi . i=1 Algorithm 8: s-oblivious RBC of sets (with sampling, after preprocessing) Input: G(V, E), R, T , ρ, M = {v0 , . . . , vk }, δs,t (v), δ•,t (v) Output: RBC Data: pdep[k × k × n], tdep[k × n], npdep[k × k × n], ntdep[k × n], RBC = 0 for s, v ∈ M , t ∈ V do pdep[s, v, t] = δs,t (v) for v ∈ M , t ∈ V do tdep[v, t] = δ•,t (v) H account for M for w ∈ M do for t ∈ V do RBC+=tdep[w, t] · ρw for u ∈ M do if u = w then ntdep[u, t] =tdep[u, t] · (1 − ρw ) else ntdep[u, t] = tdep[u, t]− −tdep[u, t]· pdep[u, w, t] · ρw − −tdep[w, t]· pdep[w, u, t] · ρw ; for s ∈ M do if u = w then npdep[s, u, t] =pdep[s, u, t] · (1 − ρw ) else npdep[s, u, t] =pdep[s, u, t]− −pdep[s, u, t]·pdep[u, w, t] · ρw − −pdep[s, w, t]·pdep[w, u, t] · ρw ; tdep=ntdep; pdep=npdep; return RBC 19 Lemma 3 provides us with a tool for iterative computation of pairwise dependency on sets of nodes. Summing δ̈s,t (Mρ ) over all sources while multiplying each addend by T (s, t) results in Equation 26 that describes iterative computation of target dependency on a set of nodes. δ̈•,t (Mρ ) = k X M (i−1) χ•,tρ (vi ) · ρvi (26) i=0 Summing Equation 26 over all targets results in iterative computation of RBC of a set of nodes: δ̈•,• (Mρ ) = k XX M (i−1) χ•,tρ (vi ) · ρvi (27) t∈V i=0 In the last algorithm presented in this paper (Algorithm 8), we compute RBC of a given set iterating over all nodes in the set and summing their marginal contributions as described in Equation 27. The marginal contributions are computed using Lemma 2. During the algorithm we maintain two matrices. One is the three dimensional matrix of pairwise dependencies and the other is the two dimensional matrix of target dependencies. The last dimension in these matrices is of size n while the other dimensions are of size k. We use Equation 20 to update pairwise dependencies and Equation 25 to update target dependencies. Algorithm 8 is composed of an initialization phase where precomputed values are copied into temporal matrices and four nested loops that compute RBC of the input set of nodes. Temporal arrays pdep and tdep maintain pairwise and target dependencies respectively. Initially the values in these arrays correspond {} {} to χs,t (v) = δs,t (v) and χ•,t (v) = δ•,t (v). In each iteration of the outer loop we process one node from the input set M . The marginal contributions of nodes to RBC of M are aggregated according to Equation 27 during the first inner loop (that iterates over all t ∈ V ). The second and third inner loops iterate over nodes in M and update the entries of the tdep and pdep matrices respectively according to Equations 25 and 20. M (1) M (1) After the first update of all the values in these matrices they correspond to χs,tρ (v) and χ•,tρ (v) where M (1) contains only the first node in M . During the second iteration we process one more node from M and update the matrices again and so on until we process all nodes in M . The overall time complexity of Algorithm 6 is O(k 3 n) where k is the size of M and n is the number of nodes in the network. 5.4 RBC of sets of edges and mixed sets In many applications the monitoring of the traffic is done by tapping the communication links and not the nodes. The problem of monitoring links can easily be reduced to a problem of monitoring nodes by adding a phantom node in the middle of the monitored link. When the routing scheme R and the traffic matrix T are updated appropriately. Algorithms that do not use pre-processing can be configured to avoid the phantom nodes in their main loop. This is an intuitive optimization since they do not introduce traffic to the network. In this case, adding phantom nodes does not increase the complexity of these algorithms but makes each iteration of the main loop twice as long as before. In algorithms that use pre-processing, after adding phantom nodes, the size of the pre-computed matrices will depend on the number of edges and not only the number of nodes. Thus, adding phantom nodes increases both time and space complexity of the preprocessing stage in Algorithms 6 and 6 from O(n2 m) time and O(n3 ) space to O(|E|2 m) time and O(|E|3 ) space. In order to avoid the addition of phantom nodes we should remember that the RBC of a (directed) link is the RBC of the sequence consisting of both of its’ nodes. Therefore, the RBC of a sequence of directed links is RBC of the sequence of nodes comprising the links and can be computed using Algorithms 2, 5, and 6. The RBC of a set of links can be computed iteratively, by taking into account one link at a time, similarly to RBC of a set of nodes. Let (u, v) be a link tapped by a monitor with sampling rate ρ(u,v) . The expected number of packets that will pass through w and will not be sampled by that monitor is equal to δ•,• (w) minus RBC of sequences (w, u, v) and (u, v, w). Note that if the sequence (u, v) represents a link and u 6= w 6= v, then the RBC 20 Algorithm 9: s-oblivious RBC of sets of links (with sampling, after preprocessing) Input: G(V, E), R, T, ρ, Q = {(u1 , v1 ), . . . , (uk , vk )}, δs,t (v), δ•,t (v) Output: RBC Data: pdep, npdep, tdep, ntdep RBC = 0 X = set of nodes comprising Q for s, v ∈ X, t ∈ V do pdep[s, v, t] = δs,t (v) for v ∈ X, t ∈ V do tdep[v, t] = δ•,t (v) H account for Q for (u, v) ∈ Q do for t ∈ V do RBC+=tdep[u, t]·pdep[u, v, t] for w ∈ X do ntdep[w, t] = tdep[w, t]− −tdep[w, t]· pdep[w, u, t]· pdep[u, v, t]− −tdep[w, t]· pdep[w, u, t]· pdep[u, v, t] for s ∈ X do npdep[s, w, t] = pdep[s, w, t]− −pdep[s, w, t]· pdep[w, u, t]· pdep[u, v, t]− −pdep[s, w, t]· pdep[w, u, t]· pdep[u, v, t]; tdep=ntdep; pdep=npdep; return RBC of the sequence (u, w, v) is zero. We can modify the recursive Equation 20 to compute the probability that a packet from s to t will pass through w and will not be sampled by any link monitor in a given set. Let Q(i) = {(u1 , v1 ), . . . , (ui , vi )} be a set of nodes with sampling rates specified by ρ(v1 ,u1 ) , . . . , ρ(ui ,vi ) respectively. Q(i+1) χs,tρ Q(i) ρ (i) Qρ Q(i) Q(i) Q(i) (w) = χs,tρ (w) − χ̃s,tρ ((w, u, v)) − χ̃s,tρ ((u, v, w)) Q(i) ρ (28) Q(i) ρ Q(i) ρ Q(i) ρ Q(i) ρ where χ̃s,t ((w, u, v)) = χs,t (w) · χw,t (ui+1 ) · χui+1 ,t (vi+1 ) and χ̃s,t ((u, v, w)) = χs,t (ui+1 ) · χui+1 ,t (vi+1 ) · Q(i) Q(i+1) ρ ρ χvi+1 (w)) can be computed using ,t (w). Target dependency with respect to a set of link monitors (χ•,t the equation above if we substitute pairwise dependencies with target dependencies. Algorithm 10: s-oblivious RBC of sets of links and nodes (with sampling, after preprocessing) Input: G(V, E), R, T, ρ, M, Q, δs,t (v), δ•,t (v) Output: RBC Data: pdep, npdep, tdep, ntdep RBC = 0 X = set of nodes comprising Q and M for s, v ∈ X, t ∈ V do pdep[s, v, t] = δs,t (v) for v ∈ X, t ∈ V do tdep[v, t] = δ•,t (v) ◮ account for M ◮ account for Q return RBC Algorithm 9 computes the RBC of a set of links similarly to the way that Algorithm 8 computes the RBC of a set of nodes. But instead of implementing the Equations 20 and 25 it implements the Equation 21 28 and its target dependency variant. Another difference between the Algorithms 9 and 8 is the size of the pdep and tdep matrices. The size of the first dimensions of these matrices is now equal to the number of of nodes attached to the input links. Algorithm 9 has the same four nested loops as Algorithm 8. In addition the number of nodes comprising the links in a given set of size k is at least k + 1 and at most 2k. Therefore the time complexity and the space complexity of Algorithm 9 is O(k 3 n) and O(k 2 n) respectively as well as to Algorithm 8. Both algorithms can be combined together in order to compute the RBC of a set of monitors that includes monitors installed on nodes and monitors installed on links as shown in Algorithm 10. Let M be the set of monitored nodes and Q be the set of monitored links. Let X = M ∪ {u, v : (u, v) ∈ Q} be the set of nodes comprising M and Q. First all the data relevant to X is copied into the tdep and pdep matrices. Afterwords, the cores of both algorithms are executed consequently to compute RBC of the mixed set M ∪ Q. 6 Conclusions In this paper we have defined a new Betweenness-Centrality measure called Routing Betweenness-Centrality (RBC) which is a generalization of well known betweenness centrality measures such as Shortest-Path Betweenness Centrality, Traffic Load Centrality, and Flow Betweenness Centrality. RBC measures the extent to which nodes or groups of nodes are exposed to the traffic given any loop-free routing strategy. The algorithms presented in this paper are easily modified to compute RBC of groups consisting of links and/or nodes (see Appendix 5.4). In fact a more sophisticated combinations of policies for traffic monitoring/controlling are supported. Using the methods present in this paper we can compute the expected number of packets each one of which satisfies a predicate in disjunctive normal form with at most one negation clause. For example, packets each one of which is sampled by q, u, and v, or by w and x but is neither sampled by y nor by w. The required computation complexity of our algorithms depend on whether the routing scheme is source dependent or source oblivious. Generally speaking, when the routing decisions in the network depend on both the source and the target of a packet the time complexity of RBC computation is an order of n higher than in the source-oblivious cases. For source oblivious routing schemes, our RBC algorithms can be used to compute the Shortest Path Betweenness-Centrality and Traffic Load Centrality with complexity matching the state of the art complexities; while our RBC algorithms are capable to compute a larger variety of Betweenness-Centrality measures. We show that prepossessing can dramatically reduce the time required for a single computation of RBC of a sequence. Prepossessing can also reduce the time required to compute the RBC of sets, when the size of the investigated set is smaller than the third square of m (the number of edges in the routing tree or the routing DAG of the given routing scheme). Since RBC is more general than existing known definitions of Betweenness-Centrality and capable of better reflecting routing schemes in communication networks we believe that many applications of RBC will be found in the near future. Currently we have already found RBC useful for predicting the effectiveness and the cost of passive network monitoring. RBC can be used in conjunction with various combinatorial optimization techniques and approximation algorithms such as those described in [13, 29, 32] for optimizing placement of passive monitors within the communication network. Other obvious applications include simulation free prediction of congestions in communication networks, design and examination of routing strategies and network layout, for, say, balancing the traffic load in the network and assuring service level agreements. Acknowledgments The authors would like to thank Omer Zohar for implementing and testing all RBC algorithms and department members who contributed valuable remarks on this paper. 22 References [1] Optimized multipath. http://www.faster-light.net/omp/. [2] J. M. Anthonisse. The rush in a directed graph. Technical Report BN 9/71, Stichting Mathematisch Centrum, Amsterdam, 1971. [3] A.-L. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286:509–512, 1999. [4] A.-L. Barabasi, R. Albert, and H. Jeong. Scale-free characteristics of random networks: the topology of the world-wide web. Physica A, 281:69–77, 2000. [5] M. Barthélemy. Betweenness centrality in large complex networks. The European Physical Journal B – Condensed Matter, 38(2):163–168, March 2004. [6] B. Bollobas and O. Riordan. Robustness and vulnerability of scale-free random graphs. Internet Mathematics, 1(1):1–35, 2003. [7] S. P. Borgatti. Centrality and network flow. Social Networks, 27:55–71, 2005. [8] S. P. Borgatti and M. G. Everett. A graph-theoretic perspective on centrality. Social Networks, 28(4):466–484, Oct. 2006. [9] P. Bork, L. J. Jensen, C. von Mering, A. K. Ramani, I. Lee, and E. M. Marcotte. Protein interaction networks from yeast to human. Curr. Opin. Struct. Biol., 14(3):292–299, Jun. 2004. [10] U. Brandes. A faster algorithm for betweenness centrality. Mathematical Sociology, 25(2):163–177, 2001. [11] U. Brandes. On variants of shortest-path betweenness centrality and their generic computation. Social Networks, 30(2):136–145, 2008. [12] G. R. Cantieni, G. Iannaccone, C. Barakat, C. Diot, and P. Thiran. Reformulating the monitor placement problem: optimal network-wide sampling. In CoNEXT ’06: Proceedings of the 2006 ACM CoNEXT conference, pages 1–12, New York, NY, USA, 2006. ACM. [13] S. Dolev, Y. Elovici, R. Puzis, and P. Zilberman. Incremental deployment of network monitors based on group betweenness centrality. to appear in Information Processing Letters. [14] M. G. Everett and S. P. Borgatti. The centrality of groups and classes. Mathematical Sociology, 23(3):181–201, 1999. [15] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. SIGCOMM Comput. Comm. Rev., 29(4):251–262, 1999. [16] L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry, 40(1):35–41, 1977. [17] L. C. Freeman. Centrality in social networks conceptual clarification. Social Networks, 1:215–239, 1979. [18] L. C. Freeman, S. P. Borgatti, and D. R. White. Centrality in valued graphs: A measure of betweenness based on network flow. Social Networks, 13(2):141–154, Jun. 1991. [19] K.-I. Goh, B. Kahng, and D. Kim. Universal behavior of load distribution in scale-free networks. Phys. Rev. Lett., 87(27):278701, Dec. 2001. [20] F. Harary, R.Z. Norman, and D. Cartwright. Structural models. An introduction to the theory of directed graphs. John Wiley and Sons, New York, 1965. [212] P. Holme. Congestion and centrality in traffic flow on complex networks. Advances in Complex Systems, 6(2):163–176, 2003. 23 [22] A.W. Jackson, W. Milliken, C.A. Santivanez, M. Condell, and W.T. Strayer. A topological analysis of monitor placement. Network Computing and Applications, 2007. NCA 2007. Sixth IEEE International Symposium on, pages 169–178, July 2007. [23] J. Moy. Rfc 2328 - osfp version 2. http://www.ietf.org/rfc/rfc2328.txt, Apr. 1998. [24] M. E. J. Newman. Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality. Phys. Rev. E, 64:016132, 2001. [25] M. E. J. Newman. A measure of betweenness centrality based on random walks. Social Networks, 27(1):39–54, Jan. 2005. [26] R. Pastor-Satorras and A. Vespignani. Immunization of complex networks. Phys. Rev. E, 65:036104, 2002. [27] S. Porta, P. Crucitti, and V. Latora. The network analysis of urban streets: a primal approach. Environment and Planning B: Planning and Design, 33(5):705–725, September 2006. [28] R. Puzis, Y. Elovici, and S. Dolev. Fast algorithm for successive computation of group betweenness centrality. Phys. Rev. E, 76(5):056709, 2007. [29] R. Puzis, Y. Elovici, and S. Dolev. Finding the most prominent group in complex networks. AI Comm., 20:287–296, 2007. [30] R. Puzis, D. Yagil, Y. Elovici, and D. Braha. Collaborative attack on internet users’ anonymity. Internet Research, 19(1):60–77. [31] S. H. Strogatz. Exploring complex networks. Nature, 410:268–276, March 2001. [32] K. Suh, Y. Guo, J. Kurose, and D. Towsley. Locating network monitors: Complexity, heuristics, and coverage. Computer Communications, 29:1564–1577, 2006. [33] S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cambridge, England: Cambridge University Press., 1994. [34] G. Yan, T. Zhou, B. Hu, Z.-Q. Fu, and B.-H. Wang. Efficient routing on complex networks. Phys. Rev. E, 73:046108, 2006. [35] S.-H. Yook, H. Jeong, and A.-L. Barabasi. Modeling the internet’s large-scale topology. Proceedings of the National Academy of Science, 99(21):13382–13386, Oct. 2002. [36] T. Zhou, J.-G. Liu, and B.-H. Wang. Notes on the algorithm for calculating betweenness. Chinese Physics Letter, 23:2327–2329, Aug. 2006. 24 i, j, k, l n m r, s, t, u, v, w, x, y, z G(V, E) M, X S ρv Mρ , Xρ , Sρ δs,t (v), δ̈s,t (M ), δ̃s,t (S) δ•,t (v), δ̈•,t (M ), δ̃•,t (S) δ•,• (v), δ̈•,• (M ), δ̃•,• (S) X X λs,tρ (v), λ•,tρ (v) χXρ p, P r σs,t , σs,t (v) φs,t , φs,t (v) Table 2: Notations Natural numbers – indexes or sizes of sets and sequences. The number of nodes in the network. The maximal number of edges in the routing tree (or the routing DAG for multi-path routing schemes). Nodes. A network with nodes V and edges E. Sets of nodes. Sequence of nodes. Sampling rate of v. Sets and sequences where the sampling rate of members is specified by ρ. Pairwise dependency of s and t on a node v, a set M , or a sequence S respectively. Target dependency of t on a node v, a set M , or a sequence S respectively. The Betweenness-Centrality of a node v, a set M , or a sequence S respectively. Same as δs,t (v) and δ•,t (v) but accounts only for packets not sampled by the monitors in X prior to reaching v. Same as δ but accounts only for packets not sampled by the monitors in X. Probability. Number of shortest paths between s and t and the number of them that pass through v (used for computing SPBC). Maximal flow between s and t and flow transferred by v respectively (used for computing FBC). APPENDIX A. Notations Next we will summarize the symbols and the notation principles used in this paper (see Table 2). The uppercase Latin letter G represents a network. Other uppercase Latin letters are used to represent sets and sequences. For example, V is the set of nodes in G, M ⊂ V is an arbitrary set of nodes in G, and S is an arbitrary sequence of nodes. E denotes the set of edges in G. Lowercase Latin letter n denotes the number of nodes in the network. The letter m denotes the number of edges in a routing DAG rooted at some target vertex. When m is used to specify the complexity of an algorithm it represents the maximal number of edges in any routing DAG. Natural numbers and indexes of arrays and sequences are denoted by i, j, k, or l. The letters r, s, t, u, v, w, x, y, and z represent nodes. p and P r are used to represent a probability. R(s, u, v, t) is a quaternary function that encodes the routing scheme in the network. Greek letters represent an influence on the traffic. ρ is used to denote the sampling rate of monitors. In previous subsections we defined several pairwise dependencies of a pair of source-target nodes on another node (δs,t (v)), set of nodes (δ̈s,t (M )), or sequence of nodes (δ̃s,t (S)). In general the double-dot accent ¨ is added to functions related to RBC of sets and the tilde accent ˜ is added to functions related to RBC of sequences. λ, that was defined in Section 4.3, and χ, that will be defined in section 5.3, also represent pairwise dependencies like δ. However, λ refers to the probability of a packet not being sampled prior to reaching the argument node or nodes and χ refers to the probability of a packet not being sampled at all. We use a bullet (•) with δ, λ, or χ instead of the subscript indexes to indicate that the pairwise dependency is summed over all sources and/or targets. RBC is denoted using bullets instead of both subscript indexes (s and δ•,• (v)). Target dependency will be denoted in the next section using a bullet instead of the first subscript index (δ•,t (v)). 25