Communication-Efficient Implementation of Join in Sensor Networks

Himanshu Gupta

Communication-Efficient Implementation of Join in Sensor Networks

Himanshu Gupta

2005, Database Systems for Advanced Applications

visibility

…

description

12 pages

link

1 file

A sensor network is a wireless ad hoc network of resourceconstrained sensor nodes. In this article, we address the problem of communication-efficient implementation of the SQL "join" operator in sensor networks. We design an optimal join-implementation algorithm that provably incurs minimum communication cost under certain reasonable assumptions. In addition, we design a much faster suboptimal heuristic that empirically delivers a near-optimal solution. We evaluate the performance of our designed algorithms through extensive simulations.

Communication-Efficient Implementation of Join in Sensor Networks Vishal Chowdhary and Himanshu Gupta SUNY, Stony Brook, NY 11754. vishal,hgupta@cs.sunysb.edu Abstract. A sensor network is a wireless ad hoc network of resourceconstrained sensor nodes. In this article, we address the problem of communication-efficient implementation of the SQL “join” operator in sensor networks. We design an optimal join-implementation algorithm that provably incurs minimum communication cost under certain reasonable assumptions. In addition, we design a much faster suboptimal heuristic that empirically delivers a near-optimal solution. We evaluate the performance of our designed algorithms through extensive simulations. 1 Introduction A sensor network consists of sensor nodes with a short-range radio and on-board processing capability forming a multi-hop network of an irregular topology. Each sensor node can sense certain physical phenomena like light, temperature, or vibration. There are many exciting applications [3, 13, 14] of such sensor networks, including monitoring and surveillance systems in both military and civilian contexts, building smart environments and infrastructures such as intelligent transportation systems and smart homes. In a sensor network, sensor nodes generate data items that are simply readings of one or more sensing devices on the node. Thus, a sensor network can be viewed as a distributed database system where each sensor node generates a stream of data tuples. Appropriately enough, the term sensor database is increasingly being used in research literature. Like a database, the sensor network is queried to gather and/or process the sensed data tuples. Database queries in SQL are a very general representation of queries over data, and efficient implementation of SQL queries is of great significance because of the enormous amount of data present in a typical sensor network. Since sensor nodes have limited battery energy, the distributed implementation of SQL queries in sensor networks must minimize the communication cost incurred, which is the main consumer of battery energy [31]. In this article, we address how to efficiently execute database queries in a sensor network, when the data distributed across sensors in a sensor network is viewed as relational database tables. In particular, we address communicationefficient in-network processing of the join operator, which is essentially a cartesian product of the operand tables followed by a predicate selection. We design an optimal algorithm for a join operation that provably incurs minimum communication cost in dense sensor networks under some reasonable assumptions of communication cost and computation model. We also design a much faster suboptimal heuristic that empirically performs very close to the optimal algorithm, and results in significant savings over the naive approaches. The rest of the paper is organized as follows. We start with modeling the sensor network as a database in Section 2. In Section 3, we present various algorithms for in-network implementation of the join operator, along with certain generalizations. We present our experiment results in Section 4. Related work is discussed in Section 5, and concluding remarks presented in Section 6. 2 Sensor Network Databases A sensor network consists of a large number of sensors distributed randomly in a geographical region. Each sensor has limited on-board processing capability and is equipped with sensing devices. A sensor node also has a radio which is used to communicate directly with some of the sensors around it. Two sensor nodes S1 and S2 can directly communicate with each other if and only if the distance between them is less than the transmission radius. Sensor nodes may indirectly communicate with each other through other intermediate nodes – thus, forming a multi-hop network. We assume that each sensor node in the sensor network has a limited storage capacity of m units. Also, sensors have limited battery energy, which must be conserved for prolong unattended operation. Thus, we have focused on minimization of communication cost (hence, energy cost) as the key performance criteria of the join implementation strategies. 2.1 Modeling the Sensor Network as a Database In a sensor network, the data generated by the sensor nodes is simply the readings of one or more sensing devices on the node. Thus, the data present in a sensor network can be modeled as relational database tables, wherein each sensor produces data records/tuples of a certain format and semantics. In some sense, a relational database table is a collection of similar-typed tuples from a group of sensors in the network. Due to the spatial and real-time nature of the data generated, a tuple usually has timeStamp and nodeLocation as attributes. In a sensor network, relational database tables are typically stream database tables [2] partitioned horizontally across (or generated by) a set of sensors in the network. In-network Implementation. A plausible implementation of a sensor network database query engine could be to have an external database system handle all the queries over the network. In such a realization, all the data from each sensor node in the network is sent to the external system that handles the execution of queries completely. Such an implementation would incur very high communication costs and congestion-related bottlenecks. Thus, prior research has proposed query engines that would execute the queries within the network with little external help. In particular, [18] shows that in-network implementation of database queries is fundamental to achieving energy-efficient communication in sensor networks. The focus of this article is communication-efficient in-network implementation of the join operator. As selection and projection are unary operators and operate on each tuple independently, they could be efficiently implemented using efficient routing and topology construction techniques. Union operation can be reduced to duplicate elimination, and the difference and intersection operations can be reduced to the join operation. Implementation of other database operators (aggregation, duplicate elimination, and outerjoins) is challenging and is the focus of our future work. Querying and Cost Model. A query in a sensor network is initiated at a node called query source and the result of the query is required to be routed back to the query source for storage and/or consumption. A stream database table may be generated by a set of sensor nodes in a closed geographical region. The optimization algorithms, proposed in this article, to determine how to implement the join operation efficiently, are run at the query source. As typical sensor network queries are long running, the query source can gather all the catalogue information needed (estimated sizes and locations of the operand relations, join selectivity factor to estimate the size of the join result, density of the network) by initially sampling the operand tables. As mentioned before, we concentrate on implementations that minimize communication cost. We define the total communication cost incurred as the total data transfer between neighboring sensor nodes. 3 In-network Implementation of Join In this section, we develop communication-efficient algorithms for implementation of a join operation in sensor networks. We start with assuming that the operand tables are static (non-streaming). Later in the section, we describe how our algorithms can be generalized for stream database tables, as data in sensor network is better represented as data stream tables. The SQL join operator is used to correlate data from multiple tables, and can be defined as a selection predicate over the cross-product of a pair of tables; a join of R and S tables is denoted as R ⋉ ⋊ S. Consider a join operation, initiated by a query source node Q, involving two static (non-streaming) tables R and S distributed horizontally across some geographical regions R and S in the network. We assume that the geographic regions are disjoint and small relative to the distances between the query source and the operand table regions (see [10] for a discussion on relaxation of this assumption). If we do not make any assumptions about the join predicates involved, each data tuple of table R should be paired with every tuple of S and checked for the join condition. The joined tuple is then routed (if it passes the join selection condition) to the query source Q where all the tuples are accumulated or consumed. Given that each sensor node has limited memory resources, we need to find out appropriate regions in the network that would take the responsibility of computing the join. In particular, we may need to store and process the relations at some intermediate location before routing the result to the query source. A simple nested-loop implementation of a join used in traditional databases is to generate the cross product (all pairs of tuples), and then extract those pairs that satisfy the selection predicate of the join. More involved implementations of a join operator widely used in database systems are merge-sort and hashjoin. These classical methods are unsuitable for direct implemention in sensor networks due to the limited memory resources at each node in the network. Moreover, the traditional join algorithms focus on minimizing computation cost, while in sensor networks the primary performance criteria is communication cost. Below, we discuss various techniques for efficient implementation of the join operation in sensor networks. Naive Approach. A simple way to compute R ⋉ ⋊ S could be to route the tuples of S from their original location S to the region R, broadcast the S-tuples in the region R, compute the join within the region R, and then route the joined tuples to the query source Q. We refer to this approach as the Naive approach. Note that the roles of the tables R and S can be interchanged in the above approach. Centroid Approach. Centroid approach is to compute the join operation in a circular region around some point C in the sensor network. In particular, let Pc be the smallest circular region around C such that the region Pc has at least |R|/m sensor nodes to store the table R. First, we route both the operand table to C. Second, we distribute R and broadcast S in the region Pc around C. Lastly, we compute the join operation, and route the resulting tuples of (R ⋉ ⋊ S) to the query source Q. Since the communication cost incurred in the second step is independent of the choice of C, it is easy to see that the communication cost incurred in the above approach is minimized when the point C is the weighted centroid of the triangle formed by R, S, and Q. Here, the choice of the centroid point C is weighted by the sizes of R, S, and (R ⋉ ⋊ S). 3.1 Optimal Join Algorithm In this section, we present an algorithm that constructs an optimal region for computing the join operation using minimum communication cost. We assume that the sensor network is sufficiently dense that we can find a sensor node at any point in the region. To formally prove the claim of optimality, we need to restrict ourselves to a class of join algorithms called Distribute-Broadcast Join Algorithms (defined below). In effect, our claim of optimality states that the proposed join algorithm incurs less communication cost than any distributebroadcast join algorithm. Definition 1. A join algorithm to compute R ⋉ ⋊ S in a sensor network is a distribute-broadcast join algorithm if the join is processed by first distributing Q Pr Cr Q Po Cq2 Pr Cq Ps Cs P R Cr S Table S Table R (a) Cq Ps P Cs R S Table S Table R (b) Fig. 1. Possible Shape of an Optimal Join-Region. the table R in some region P (other than the region R storing R)1 of the sensor network followed by broadcasting the relation S within the region P to compute the join. The joined tuples are then routed from each sensor in the region P to the query source. As before, consider a query source Q and regions R and S that store the static operand tables R and S in a sensor network. The key challenge in designing an optimal algorithm for implementation of a join operation is to select a region P for processing the join in such a way that the total communication cost is minimized. We use the term join-region to refer to a region in the sensor network that is responsible for computing the join. Shape of an Optimal Join-Region. Theorem 1 (see [10] for proof) shows that the join-region P that incurs minimum communication cost has a shape as shown in Figure 1 (a) or (b). In particular, the optimal join-region P is formed using three point Cr , Cs , and Cq in the sensor network (typically these points will lie within the △RSQ). More precisely, given three points Cr , Cs , and Cq in the sensor network, the region P takes one of the following forms: 1. Region P is formed of the paths Pr = (Cr , Cq ) and Ps = (Cs , Cq ), the line segment Cq Q, and a circular region PO of appropriate radius around Q. See Figure 1 (a). 2. Region P is formed of the paths Pr = (Cr , Cq ) and Ps = (Cs , Cq ), and a part of the line segment Cq Q. See Figure 1 (b). The total number of sensors in the region P is l = |R|/m, where |R| is the size of the table R which will be distributed over the region P , and m is the memory size of each sensor node. Theorem 1. The shape of the join-region P used by a distribute-broadcast join algorithm that incurs optimal communication cost is as described above or as depicted in Figure 1 (a) or (b). 1 Else, the algorithm will be identical to one of the Naive Approaches. Theorem 1 restricts the shape of an optimal join-region. However, there are still an infinite number of possible join-regions of shapes depicted in Figure 1. Below, we further restrict the shape of an optimal join-region by characterizing the equations of the paths Pr and Ps , which connect Cr and Cs respectively to Q. We start with a definition. Definition 2. The sensor length between a region X and a point y in a sensor network plane is denoted as d(X , y) and is defined as the average weighted distance, in terms of number of hops/sensors, between the region X and the point y. Here, the distance between a point x ∈ X and y is weighted by the amount of data residing at x. Optimizing Paths Pr and Ps . Consider an optimal join-region P that implements a join operation using minimum communication cost. From Theorem 1, we know that the region P is of the shape depicted in Figure 1 (a) or (b). The total communication cost T incurred in processing of a join using the region P is |R|d(R, Cr ) + |S|d(S, Cs ) + |R ⋉ ⋊ S|d(P, Q) + |R||P |/2 + |S||P |, where the first two terms represent the cost of routing R and S to C r and Cs respectively, the third term represents the cost of routing the result R ⋉ ⋊S from P to Q, and the last two terms represent the cost of distributing R and broadcasting S in the region P . Here, we assume that lack of global knowledge about the other sensors’ locations and available memory capacities preclude the possibility of distributing or broadcasting more efficiently than doing it in a simple linear manner. Now, the only component of cost T that depends on the shape of P is |R ⋉ ⋊ S|d(P, Q). Let P ′ = P −Pr −Ps , i.e., the region P without the paths Pr and Ps . Since the result |R ⋉ ⋊ S| is evenly spread along the entire region P , we have d(P, Q) = |P1 | (|P ′ |d(P ′, Q) + |Pr |d(Pr , Q) + |Ps |d(Ps , Q)), where the notation |B| for a region B denotes the number of sensor nodes in the region B. For a given set of points Cr , Cs , and Cq , the total communication cost T is minimized when the path Pr is constructed such that |Pr |d(Pr , Q) is minimized. Otherwise, we could reconstruct Pr with a smaller |Pr |d(Pr , Q), and remove/add sensors nodes from the end2 of the region P ′ to maintain |P | = |R|/m. Removal of sensor nodes from P ′ will always reduce T , and it can be shown that addition of sensor nodes to the end of the region P ′ will not increase the cost more than the reduction achieved by optimizing Pr . Similarly, the path Ps could be optimized independently. We now derive the equation of the path Pr that minimizes |Pr |d(Pr , Q) for a given Cr and Cq . Consider an arbitrary point R(x, y) along the optimal path Pr . The length p of an infinitesimally small segment of the path Pr beginning at (dx)2 + (dy)2 , and the average distance of this segment from Q is R(x, y) is p 2 2 x + y , if the coordinates of pQ are (0, 0). Sum of all these distances over the Rx p path Pr is F = x12 x2 + y2 (1 + (y′ )2 dx. To get the equation for the path 2 Here, by the end of the region P ′ , we mean either the circular part PO or the line segment Cq Cq2 depending on the shape. Pr , we would need to determine the extremals of the above function F . Using the technique of calculus of variations [15], we can show that the extremal values of F satisfy the Euler-Lagrange differential equation. The equation of the path Pr can thus be computed as (we omit the details): β = x2 cos α + 2xy sin α − y2 cos α where the constants α and β are evaluated by substituting for coordinates of C r and Cq in the equation. Optimal Join Algorithm. Given points Cr , Cs , Cq , and Q, let Pr and Ps be the optimized paths connecting Cr and Cs to Cq respectively as described above. For a given triplet of points (Cr , Cs , Cq ), the optimal join-region P is as follows. Let l = |R|/m and lY = |Pr | + |Ps | + |Cq Q|. – When lY < l, P = Pr ∪ Ps ∪ Cq Q ∪ PO , where PO is a circular region around Q such that |PO | = l − (|Cq , Q| + |Pr | + |Ps |). See Figure 1 (a). – When lY ≥ l, P = Pr ∪ Ps ∪ Cq Cq2 , where Cq2 is such that |Cq Cq2 | = l − (|Pr | + |Ps |). See Figure 1 (b). Now, we can construct an optimal join-region to compute a join operation for tables R and S and the query source Q, by considering all possible triples of points Cr , Cs , and Cq in the sensor network, and picking the triplet (Cr , Cs , Cq ) that results in a join-region P (as describe above) with minimum communication cost. The time complexity of the above algorithm is O(n3 ), where n is the total number of sensor nodes in the sensor network. Suboptimal Heuristic. The high time complexity of the optimal algorithm described above makes it impractical for large sensor networks. Thus, we propose a suboptimal heuristic that runs in O(n3/2 ) time, and incidentally performs very Q well in practice. Essentially, for a given Cr , we stipulate that Cs should be symmetrically (|R|d(R, Cr ) = Cq |S|d(S, Cs )) located in the △RQS. In addition, we approximate paths Pr and Ps to be straight line Cs M Cr segments, and choose the point Cq on the median R S of the △C C Q. See Figure 2. Thus, for each point |R|d(R, C ) = |S|d(S, C ) r s Table R Table S as Cr in the sensor network, we determine Cs and search for the best Cq on the median of △Cr Cs Q. Fig. 2. Heuristic r 3.2 s Join Implementation for Stream Database Tables In the previous subsection, we discussed implementation of the join operation in a sensor network for static database tables. Since, sensor network data is better represented as stream database tables, we now generalize the algorithms to handle stream database tables. First, we start with presenting our model of stream database tables in sensor networks. Data Streams in Sensor Networks. As for the case of static tables, a stream database table R corresponding to a data stream in a sensor network is associated with a region R, where each node in R is continually generating tuples for the table R. To deal with the unbounded size of stream database tables, the tables are usually restricted to a finite set of tuples called the sliding window [1, 12, 27]. In effect, we expire or archive tuples from the data stream based on some criteria so that the total number of stored tuples does not exceed the bounded window size. We use WR to denote the sliding window for a stream database table R. Naive Approach for Stream Tables. In the Naive Approach, we use the region R (or S) to store the windows WR and WS of the stream tables R and S.3 Each sensor node in the region R uses WR /(|WR | + |WS |) fraction of its local memory to store tuples of WR , and the remaining fraction of the memory to store tuples of WS . To perform the join operation, each newly generated tuple (of R or S) is broadcast to all the nodes in the region R, and is also stored in some node of R with available memory. Note that the generated data tuples of S need to be first routed from the region S to the region R. The resulting joined tuples are routed from R to the query source Q. Generalizing Other Approaches. The other approaches viz. Centroid Approach, Optimal Algorithm, and Suboptimal Heuristic, use a join-region that is separate from the regions R and S. These algorithms are generalized to handle stream database tables as follows. First, the strategy to choose the join-region P remains the same as before for static tables, except for the size of the join-region. For stream database tables, the chosen join-region is used to store WR as well as WS , with each sensor node in the join-region using WR /|WR | + |WS | fraction of its memory to store tuples of WR , and the rest to store tuples of WS . Each newly generated tuple (of R or S) is routed from its source node in R or S to the join-region P , and broadcast to all the nodes in P . The resulting joined tuples are then routed to Q. As part of the broadcast process (without incurring any additional communication cost), each generated tuple of R (or S) is also stored at some node in P with available memory. 4 Performance Evaluation In this section, we compare the performance of Naive Approach, Centroid Algorithm, Optimal Algorithm, and Suboptimal Heuristic. In our previous discussion, we have assumed dense sensor networks where we can find a sensor node at any desirable point in the region. On real sensor networks, we use our proposed algorithms in conjunction with the trajectory based forwarding (TBF) routing technique [28], which works by forwarding packets to nodes closest to the intended path/trajectory. More specifically, to form the Pr , Ps, and Cq Q (or Cq Cq2 ) parts of the join-region, we use nodes that are closest to uniformly spaced points on the geometrically constructed paths. In addition, each algorithm is generalized 3 If the total memory of the nodes in R is not sufficient to store WR and WS , then the region R is expanded to include more sensor nodes. 4000 2000 -4 5*10 10 -3 0.005 0.01 0.05 Join Selectivity Factor (a) t = 0.13 units 0.1 3000 2000 1000 -4 5*10 -3 10 0.005 0.01 Naive Centroid Suboptimal Heuristic OptBased 3 3 8000 3000 Naive Centroid Suboptimal Heuristic OptBased 4000 Total Communication Cost (x 10 ) Total Communication Cost (x 10 ) 3 Total Communication Cost (x 10 ) 5000 Naive Centroid Suboptimal Heuristic OptBased 16000 0.05 Join Selectivity Factor (b) t = 0.15 units 0.1 2000 1000 500 -4 5*10 -3 10 0.005 0.01 0.05 0.1 Join Selectivity Factor (c) t = 0.18 units Fig. 3. Total communication cost for various transmission radii (t), and fixed △RSQ. for stream database tables as discussed in Section 3.2. We refer to the generalized algorithms as Naive, Centroid, OptBased, and Suboptimal Heuristic respectively. Definition 3. Given instances of relations R and S and a join predicate, the join-selectivity factor (f) is the probability that a random pair of tuples from R and S will satisfy the given join predicate. In other words, the join selectivity factor is the ratio of the size of R ⋉ ⋊ S to the size of the cartesian product, i.e., f = |R ⋉ ⋊ S|/(|R||S|). Parameter Values and Experiments. We generated random sensor networks by randomly placing 10,000 sensors with uniform transmission radius (t) in an area of 10×10 units. For the purposes of comparing the performance of our algorithms, varying the number of sensors is tantamount to varying the transmission radius. Thus, we fix the number of sensors to be 10,000 and measure performance for different transmission radii. Memory size of a sensor node is 300 tuples, and the size of each of the sliding windows WR and WS of stream tables R and S is 8,000 tuples. For simplicity, we chose uniform data generation rates for R and S streams. In each of the experiments, we measure communication cost incurred in processing 8000 newly generated tuples of R and S each, after the join-region is already filled with previously generated tuples. We use the GPSR [19] algorithm to route tuples. Catalogue information is gathered for non-Naive approaches by collecting a small sample of data streams at the query source. In the first set of experiments, we consider a fixed △RSQ and calculate the total communication cost for various transmission radii and join-selectivity factors. Next, we fix the transmission radius and calculate the total communication cost for various join-selectivity factors and various shapes/sizes of the △RSQ. Fixed Triangle RSQ. In this set of experiments (Figure 3), we fix the locations of regions R, S, and query source Q and measure the performance of our algorithms for various values of transmission radii and join-selectivity factors. In particular, we choose coordinates (0,0), (5,9.5), and (9.5,0) for R, Q, and S respectively. We have looked at three transmission radii viz. 0.13, 0.15, and 0.18 units. Lower transmission radii left the sensor network disconnected, and the trend observed 500 Naive Centroid Suboptimal Heuristic OptBased 250 10 15 20 25 30 35 Area of Triangle QRS (a) f = 10−4 40 45 3 Total Communication Cost (x 10 ) 4000 3 Total Communication Cost (x 10 ) 3 Total Communication Cost (x 10 ) 4000 1000 2000 1000 Naive Centroid Suboptimal Heuristic OptBased 10 15 20 25 30 35 Area of Triangle QRS 40 (b) f = 5 ∗ 10−3 45 2000 Naive Centroid Suboptimal Heuristic OptBased 1000 10 15 20 25 30 35 Area of Triangle QRS 40 45 (c) f = 10−2 Fig. 5. Total communication cost for various △RSQ. Here, t = 0.15. for these three transmission radii values is sufficient to infer behavior for larger transmission radii (see Figure 4). From Figure 3 (a)-(c), we can see that the Suboptimal Heuristic performs very close to the OptBased Algorithm, and significantly outperforms (upto 100%) the Naive and Centroid Approaches for most parameter values. The performance of the Naive approach worsens drastically with the increase in the join-selectivity factor, since the routing cost of the joined tuples from the join region (R or S) to the query source Q becomes more dominant. Fixed Transmission Radius (0.15 units). We also observe the performance of various algorithms Naive 4000 Centroid for different size and shapes of △RSQ. In particSuboptimal Heuristic 3000 OptBased ular, we fix the transmission radius of each sensor 2000 node in the network to be 0.15 units, and generate various △RSQ’s as follows. We fix locations of regions R and S, and select many locations of 1000 the query source Q with the constraint that the 0.13 0.18 0.25 0.35 0.5 1.0 area of the △RSQ is between 10% to 50% of the Transmission Radius total sensor network area. For each such generated △RSQ, we run all the four algorithms for Fig. 4. Here, f = 0.05. three representative join-selectivity factor values viz. 10−4 , 5 ∗ 10−3 , and 10−2 . See Figure 5. Again we observe that the Suboptimal Heuristic performs very close to the OptBased Algorithm, and incurs much less communication cost than the Naive and Centroid Approaches for all join-selectivity factor values. 3 Total Communication Cost (x 10 ) 5000 Summary. From the above experiments, we observe that the Suboptimal Heuristic performs very close to the OptBased Algorithm, but performs substantially better than the Centroid and Naive Approaches for a wide range of sensor network parameters. The savings in communication cost reduce with the increase in join-selectivity factor and/or transmission radius. We expect the join-selectivity factor to be relatively low in large sensor networks because of large sizes of operand tables and data generated having only local spatial and temporal data correlations. Moreover, since sensor nodes have the capability to adjust transmission power, effective topology control [30, 32] is used to minimize transmission radius at each node to conserve overall energy. Thus, the Suboptimal Heuristic is a natural choice for efficient implementation of join in sensor networks, and should result in substantial energy savings in practice. 5 Related Work The vision of sensor network as a database has been proposed by many works [5, 16, 26], and simple query engines such as TinyDB [26] have been built for sensor networks. In particular, the COUGAR project [5, 33, 34] at Cornell University is one of the first attempts to model a sensor network as a database system. The TinyDB Project [26] at Berkeley also investigates query processing techniques for sensor networks. However, TinyDB implements very limited functionality [25] of the traditional database language SQL. A plausible implementation of an SQL query engine for sensor networks could be to ship all sensor nodes’ data to an external server that handles the execution of queries completely [21]. Such an implementation would incur high communication costs and congestionrelated bottlenecks. In particular, [18] shows that in-network implementation of database queries is fundamental to conserving energy in sensor networks. Thus, recent research has focussed on in-network implementation of database queries. However, prior research has only addressed limited SQL functionality – single queries involving simple aggregations [22, 24, 34] and/or selections [25] over single tables [23], or local joins [34]. So far, it has been considered that correlations such as median computation or joins should be computed on a single node [4, 25, 34]. In particular, [4] address the problem of operator placement for in-network query processing, assuming that each operator is executed locally and fully on a single sensor node. The problem of distributed and communicationefficient implementation of join has not been addressed yet in the context of sensor networks. In addition, there has been a large body of work done on efficient query processing in data stream processing systems [6, 8, 9, 27]. In particular, [11] approximates sliding window joins over data streams and [17] has designed join algorithms for joining multiple data streams constrained by a sliding time window. However, a data stream processing system is not necessarily distributed and hence, minimizing communication cost is not the focus of the research. There has been a lot of work on query processing in distributed database systems [7, 20, 29], but sensor networks differ significantly from distributed database systems because of their multi-hop communication cost model and resource limitations. 6 Conclusions Sensor networks are capable of generating large amounts of data. Hence, efficient query processing in sensor networks is of great importance. Since sensor nodes have limited battery power and memory resources, designing communicationefficient distributed implementation of database queries is a key research challenge. In this article, we have focussed on implementation of the join operator, which is one of the core operators of database query language. In particular, we have designed an Optimal Algorithm that incurs minimum communication cost for implementation of join in sensor networks under certain reasonable assumptions. Moreover, we reduced the time complexity of the Optimal Algorithm to design a Suboptimal Heuristic, and showed through extensive simulations that the Suboptimal Heuristic performs very close to the Optimal Algorithm. Techniques developed in this article are shown to result in substantial energy savings over simpler approaches for a wide range of sensor network parameters. References 1. D. J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120– 139, 2003. 2. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 2002. 3. B. Badrinath, M. Srivastava, K. Mills, J. Scholtz, and K. Sollins, editors. Special Issue on Smart Spaces and Environments, IEEE Personal Communications, 2000. 4. B. Bonfils and P. Bonnet. Adaptive and decentralized operator placement for in-network query processing. In Proceedings of the International Workshop on Information Processing in Sensor Networks (IPSN), 2003. 5. P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. In Proceeding of the International Conference on Mobile Data Management, 2001. 6. D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams - A new class of data management applications. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 2002. 7. S. Ceri and G. Pelagatti. Distributed Database Design: Principles and Systems. MacGraw-Hill (New York NY), 1984. 8. S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2003. 9. J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for internet databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2000. 10. V. Chowdhary and H. Gupta. Communication-efficient implementation of join in sensor networks. Technical report, SUNY, Stony Brook, Computer Science Department, 2004. 11. A. Das, J. Gehrke, and M. Riedewald. Approximate join processing over data streams. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2003. 12. L. Ding, N. Mehta, E. Rundensteiner, and G. Heineman. Joining punctuated streams. In Proceedings of the International Conference on Extending Database Technology, 2004. 13. D. Estrin, R. Govindan, and J. Heidemann, editors. Special Issue on Embedding the Internet, Communications of the ACM, volume 43, 2000. 14. D. Estrin, R. Govindan, J. S. Heidemann, and S. Kumar. Next century challenges: Scalable coordination in sensor networks. In Proceedings of the International Conference on Mobile Computing and Networking (MobiCom), 1999. 15. I. Gelfand and S. Fomin. Calculus of Variations. Dover Publications, 2000. 16. R. Govindan, J. Hellerstein, W. Hong, S. Madden, M. Franklin, and S. Shenker. The sensor network as a database. Technical report, University of Southern California, Computer Science Department, 2002. 17. M. Hammad, W. Aref, A. Catlin, M. Elfeky, and A. Elmagarmid. A stream database server for sensor applications. Technical report, Purdue University, Department of Computer Science, 2002. 18. J. S. Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan, D. Estrin, and D. Ganesan. Building efficient wireless sensor networks with low-level naming. In Symposium on Operating Systems Principles, 2001. 19. B. Karp and H. Kung. Gpsr: greedy perimeter stateless routing for wireless networks. In Proceedings of the International Conference on Mobile Computing and Networking (MobiCom), 2000. 20. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4), 2000. 21. S. Madden and M. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Proceedings of the International Conference on Database Engineering (ICDE), 2002. 22. S. Madden, M. Franklin, J. Hellerstein, and W. Hong. TAG: A tiny aggregation service for ad-hoc sensor networks. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2002. 23. S. Madden and J. M. Hellerstein. Distributing queries over low-power wireless sensor networks. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2002. 24. S. Madden, R. Szewczyk, M. Franklin, and D. Culler. Supporting aggregate queries over ad-hoc wireless sensor networks. In Workshop on Mobile Computing and Systems Applications, 2002. 25. S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. The design of an acquisitional query processor for sensor networks. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2003. 26. S. R. Madden, J. M. Hellerstein, and W. Hong. TinyDB: In-network query processing in tinyos. http://telegraph.cs.berkeley.edu/tinydb, Sept. 2003. 27. R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, approximation, and resource management in a data stream management system. In Proceedings of the International Conference on Innovative Data Systems Research (CIDR), 2003. 28. B. Nath and D. Niculescu. Routing on a curve. In Proceedings of the Workshop on Hot Topics in Networks, 2002. 29. M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1999. 30. J. Pan, Y. T. Hou, L. Cai, Y. Shi, and S. X. Shen. Topology control for wireless sensor networks. In Proceedings of the International Conference on Mobile Computing and Networking (MobiCom), 2003. 31. G. Pottie and W. Kaiser. Wireless integrated sensor networks. Communications of the ACM, 43, 2000. 32. R. Ramanathan and R. Rosales-Hain. Topology control in multihop wireless networks using transmit power adjustment. In Proceedings of the IEEE INFOCOM, 2000. 33. Y. Yao and J. Gehrke. The cougar approach to in-network query processing in sensor networks. In SIGMOD Record, 2002. 34. Y. Yao and J. Gehrke. Query processing for sensor networks. Innovative Data Systems Research (CIDR), 2003. In Proceedings of the International Conference on

Log In

Communication-Efficient Implementation of Join in Sensor Networks

Related papers

Related papers

Related topics