Communication-Efficient Implementation of Join
in Sensor Networks
Vishal Chowdhary and Himanshu Gupta
SUNY, Stony Brook, NY 11754.
vishal,hgupta@cs.sunysb.edu
Abstract. A sensor network is a wireless ad hoc network of resourceconstrained sensor nodes. In this article, we address the problem of
communication-efficient implementation of the SQL “join” operator in
sensor networks. We design an optimal join-implementation algorithm
that provably incurs minimum communication cost under certain reasonable assumptions. In addition, we design a much faster suboptimal
heuristic that empirically delivers a near-optimal solution. We evaluate
the performance of our designed algorithms through extensive simulations.
1
Introduction
A sensor network consists of sensor nodes with a short-range radio and on-board
processing capability forming a multi-hop network of an irregular topology. Each
sensor node can sense certain physical phenomena like light, temperature, or vibration. There are many exciting applications [3, 13, 14] of such sensor networks,
including monitoring and surveillance systems in both military and civilian contexts, building smart environments and infrastructures such as intelligent transportation systems and smart homes. In a sensor network, sensor nodes generate data items that are simply readings of one or more sensing devices on the
node. Thus, a sensor network can be viewed as a distributed database system
where each sensor node generates a stream of data tuples. Appropriately enough,
the term sensor database is increasingly being used in research literature. Like
a database, the sensor network is queried to gather and/or process the sensed
data tuples. Database queries in SQL are a very general representation of queries
over data, and efficient implementation of SQL queries is of great significance
because of the enormous amount of data present in a typical sensor network.
Since sensor nodes have limited battery energy, the distributed implementation
of SQL queries in sensor networks must minimize the communication cost incurred, which is the main consumer of battery energy [31].
In this article, we address how to efficiently execute database queries in a
sensor network, when the data distributed across sensors in a sensor network is
viewed as relational database tables. In particular, we address communicationefficient in-network processing of the join operator, which is essentially a cartesian product of the operand tables followed by a predicate selection. We design
an optimal algorithm for a join operation that provably incurs minimum communication cost in dense sensor networks under some reasonable assumptions of
communication cost and computation model. We also design a much faster suboptimal heuristic that empirically performs very close to the optimal algorithm,
and results in significant savings over the naive approaches.
The rest of the paper is organized as follows. We start with modeling the
sensor network as a database in Section 2. In Section 3, we present various
algorithms for in-network implementation of the join operator, along with certain
generalizations. We present our experiment results in Section 4. Related work is
discussed in Section 5, and concluding remarks presented in Section 6.
2
Sensor Network Databases
A sensor network consists of a large number of sensors distributed randomly in a
geographical region. Each sensor has limited on-board processing capability and
is equipped with sensing devices. A sensor node also has a radio which is used
to communicate directly with some of the sensors around it. Two sensor nodes
S1 and S2 can directly communicate with each other if and only if the distance
between them is less than the transmission radius. Sensor nodes may indirectly
communicate with each other through other intermediate nodes – thus, forming
a multi-hop network. We assume that each sensor node in the sensor network
has a limited storage capacity of m units. Also, sensors have limited battery
energy, which must be conserved for prolong unattended operation. Thus, we
have focused on minimization of communication cost (hence, energy cost) as the
key performance criteria of the join implementation strategies.
2.1
Modeling the Sensor Network as a Database
In a sensor network, the data generated by the sensor nodes is simply the readings of one or more sensing devices on the node. Thus, the data present in a
sensor network can be modeled as relational database tables, wherein each sensor produces data records/tuples of a certain format and semantics. In some
sense, a relational database table is a collection of similar-typed tuples from a
group of sensors in the network. Due to the spatial and real-time nature of the
data generated, a tuple usually has timeStamp and nodeLocation as attributes.
In a sensor network, relational database tables are typically stream database
tables [2] partitioned horizontally across (or generated by) a set of sensors in the
network.
In-network Implementation. A plausible implementation of a sensor network
database query engine could be to have an external database system handle all
the queries over the network. In such a realization, all the data from each sensor
node in the network is sent to the external system that handles the execution of
queries completely. Such an implementation would incur very high communication costs and congestion-related bottlenecks. Thus, prior research has proposed
query engines that would execute the queries within the network with little external help. In particular, [18] shows that in-network implementation of database
queries is fundamental to achieving energy-efficient communication in sensor
networks. The focus of this article is communication-efficient in-network implementation of the join operator. As selection and projection are unary operators
and operate on each tuple independently, they could be efficiently implemented
using efficient routing and topology construction techniques. Union operation
can be reduced to duplicate elimination, and the difference and intersection operations can be reduced to the join operation. Implementation of other database
operators (aggregation, duplicate elimination, and outerjoins) is challenging and
is the focus of our future work.
Querying and Cost Model. A query in a sensor network is initiated at a node
called query source and the result of the query is required to be routed back to
the query source for storage and/or consumption. A stream database table may
be generated by a set of sensor nodes in a closed geographical region. The optimization algorithms, proposed in this article, to determine how to implement
the join operation efficiently, are run at the query source. As typical sensor network queries are long running, the query source can gather all the catalogue
information needed (estimated sizes and locations of the operand relations, join
selectivity factor to estimate the size of the join result, density of the network)
by initially sampling the operand tables. As mentioned before, we concentrate on
implementations that minimize communication cost. We define the total communication cost incurred as the total data transfer between neighboring sensor
nodes.
3
In-network Implementation of Join
In this section, we develop communication-efficient algorithms for implementation of a join operation in sensor networks. We start with assuming that the
operand tables are static (non-streaming). Later in the section, we describe how
our algorithms can be generalized for stream database tables, as data in sensor
network is better represented as data stream tables.
The SQL join operator is used to correlate data from multiple tables, and
can be defined as a selection predicate over the cross-product of a pair of tables; a join of R and S tables is denoted as R ⋉
⋊ S. Consider a join operation,
initiated by a query source node Q, involving two static (non-streaming) tables
R and S distributed horizontally across some geographical regions R and S in
the network. We assume that the geographic regions are disjoint and small relative to the distances between the query source and the operand table regions
(see [10] for a discussion on relaxation of this assumption). If we do not make
any assumptions about the join predicates involved, each data tuple of table R
should be paired with every tuple of S and checked for the join condition. The
joined tuple is then routed (if it passes the join selection condition) to the query
source Q where all the tuples are accumulated or consumed. Given that each
sensor node has limited memory resources, we need to find out appropriate regions in the network that would take the responsibility of computing the join. In
particular, we may need to store and process the relations at some intermediate
location before routing the result to the query source.
A simple nested-loop implementation of a join used in traditional databases
is to generate the cross product (all pairs of tuples), and then extract those pairs
that satisfy the selection predicate of the join. More involved implementations
of a join operator widely used in database systems are merge-sort and hashjoin. These classical methods are unsuitable for direct implemention in sensor
networks due to the limited memory resources at each node in the network.
Moreover, the traditional join algorithms focus on minimizing computation cost,
while in sensor networks the primary performance criteria is communication
cost. Below, we discuss various techniques for efficient implementation of the
join operation in sensor networks.
Naive Approach. A simple way to compute R ⋉
⋊ S could be to route the tuples
of S from their original location S to the region R, broadcast the S-tuples in the
region R, compute the join within the region R, and then route the joined tuples
to the query source Q. We refer to this approach as the Naive approach. Note
that the roles of the tables R and S can be interchanged in the above approach.
Centroid Approach. Centroid approach is to compute the join operation in
a circular region around some point C in the sensor network. In particular, let
Pc be the smallest circular region around C such that the region Pc has at least
|R|/m sensor nodes to store the table R. First, we route both the operand table
to C. Second, we distribute R and broadcast S in the region Pc around C. Lastly,
we compute the join operation, and route the resulting tuples of (R ⋉
⋊ S) to the
query source Q. Since the communication cost incurred in the second step is
independent of the choice of C, it is easy to see that the communication cost
incurred in the above approach is minimized when the point C is the weighted
centroid of the triangle formed by R, S, and Q. Here, the choice of the centroid
point C is weighted by the sizes of R, S, and (R ⋉
⋊ S).
3.1
Optimal Join Algorithm
In this section, we present an algorithm that constructs an optimal region for
computing the join operation using minimum communication cost. We assume
that the sensor network is sufficiently dense that we can find a sensor node at
any point in the region. To formally prove the claim of optimality, we need to
restrict ourselves to a class of join algorithms called Distribute-Broadcast Join
Algorithms (defined below). In effect, our claim of optimality states that the
proposed join algorithm incurs less communication cost than any distributebroadcast join algorithm.
Definition 1. A join algorithm to compute R ⋉
⋊ S in a sensor network is a
distribute-broadcast join algorithm if the join is processed by first distributing
Q
Pr
Cr
Q
Po
Cq2
Pr
Cq Ps
Cs
P
R
Cr
S
Table S
Table R
(a)
Cq Ps
P
Cs
R
S
Table S
Table R
(b)
Fig. 1. Possible Shape of an Optimal Join-Region.
the table R in some region P (other than the region R storing R)1 of the sensor
network followed by broadcasting the relation S within the region P to compute
the join. The joined tuples are then routed from each sensor in the region P to
the query source.
As before, consider a query source Q and regions R and S that store the static
operand tables R and S in a sensor network. The key challenge in designing an
optimal algorithm for implementation of a join operation is to select a region
P for processing the join in such a way that the total communication cost is
minimized. We use the term join-region to refer to a region in the sensor network
that is responsible for computing the join.
Shape of an Optimal Join-Region. Theorem 1 (see [10] for proof) shows
that the join-region P that incurs minimum communication cost has a shape as
shown in Figure 1 (a) or (b). In particular, the optimal join-region P is formed
using three point Cr , Cs , and Cq in the sensor network (typically these points
will lie within the △RSQ). More precisely, given three points Cr , Cs , and Cq in
the sensor network, the region P takes one of the following forms:
1. Region P is formed of the paths Pr = (Cr , Cq ) and Ps = (Cs , Cq ), the line
segment Cq Q, and a circular region PO of appropriate radius around Q. See
Figure 1 (a).
2. Region P is formed of the paths Pr = (Cr , Cq ) and Ps = (Cs , Cq ), and a
part of the line segment Cq Q. See Figure 1 (b).
The total number of sensors in the region P is l = |R|/m, where |R| is the
size of the table R which will be distributed over the region P , and m is the
memory size of each sensor node.
Theorem 1. The shape of the join-region P used by a distribute-broadcast join
algorithm that incurs optimal communication cost is as described above or as
depicted in Figure 1 (a) or (b).
1
Else, the algorithm will be identical to one of the Naive Approaches.
Theorem 1 restricts the shape of an optimal join-region. However, there are
still an infinite number of possible join-regions of shapes depicted in Figure 1.
Below, we further restrict the shape of an optimal join-region by characterizing
the equations of the paths Pr and Ps , which connect Cr and Cs respectively to
Q. We start with a definition.
Definition 2. The sensor length between a region X and a point y in a sensor
network plane is denoted as d(X , y) and is defined as the average weighted distance, in terms of number of hops/sensors, between the region X and the point
y. Here, the distance between a point x ∈ X and y is weighted by the amount of
data residing at x.
Optimizing Paths Pr and Ps . Consider an optimal join-region P that implements a join operation using minimum communication cost. From Theorem 1,
we know that the region P is of the shape depicted in Figure 1 (a) or (b). The
total communication cost T incurred in processing of a join using the region P
is
|R|d(R, Cr ) + |S|d(S, Cs ) + |R ⋉
⋊ S|d(P, Q) + |R||P |/2 + |S||P |,
where the first two terms represent the cost of routing R and S to C r and
Cs respectively, the third term represents the cost of routing the result R ⋉
⋊S
from P to Q, and the last two terms represent the cost of distributing R and
broadcasting S in the region P . Here, we assume that lack of global knowledge
about the other sensors’ locations and available memory capacities preclude the
possibility of distributing or broadcasting more efficiently than doing it in a
simple linear manner. Now, the only component of cost T that depends on the
shape of P is |R ⋉
⋊ S|d(P, Q). Let P ′ = P −Pr −Ps , i.e., the region P without the
paths Pr and Ps . Since the result |R ⋉
⋊ S| is evenly spread along the entire region
P , we have d(P, Q) = |P1 | (|P ′ |d(P ′, Q) + |Pr |d(Pr , Q) + |Ps |d(Ps , Q)), where the
notation |B| for a region B denotes the number of sensor nodes in the region
B. For a given set of points Cr , Cs , and Cq , the total communication cost T is
minimized when the path Pr is constructed such that |Pr |d(Pr , Q) is minimized.
Otherwise, we could reconstruct Pr with a smaller |Pr |d(Pr , Q), and remove/add
sensors nodes from the end2 of the region P ′ to maintain |P | = |R|/m. Removal
of sensor nodes from P ′ will always reduce T , and it can be shown that addition
of sensor nodes to the end of the region P ′ will not increase the cost more
than the reduction achieved by optimizing Pr . Similarly, the path Ps could be
optimized independently.
We now derive the equation of the path Pr that minimizes |Pr |d(Pr , Q) for
a given Cr and Cq . Consider an arbitrary point R(x, y) along the optimal path
Pr . The length
p of an infinitesimally small segment of the path Pr beginning at
(dx)2 + (dy)2 , and the average distance of this segment from Q is
R(x,
y)
is
p
2
2
x + y , if the coordinates
of
pQ are (0, 0). Sum of all these distances over the
Rx p
path Pr is F = x12 x2 + y2 (1 + (y′ )2 dx. To get the equation for the path
2
Here, by the end of the region P ′ , we mean either the circular part PO or the line
segment Cq Cq2 depending on the shape.
Pr , we would need to determine the extremals of the above function F . Using
the technique of calculus of variations [15], we can show that the extremal values
of F satisfy the Euler-Lagrange differential equation. The equation of the path
Pr can thus be computed as (we omit the details):
β = x2 cos α + 2xy sin α − y2 cos α
where the constants α and β are evaluated by substituting for coordinates of C r
and Cq in the equation.
Optimal Join Algorithm. Given points Cr , Cs , Cq , and Q, let Pr and Ps be
the optimized paths connecting Cr and Cs to Cq respectively as described above.
For a given triplet of points (Cr , Cs , Cq ), the optimal join-region P is as follows.
Let l = |R|/m and lY = |Pr | + |Ps | + |Cq Q|.
– When lY < l, P = Pr ∪ Ps ∪ Cq Q ∪ PO , where PO is a circular region
around Q such that |PO | = l − (|Cq , Q| + |Pr | + |Ps |). See Figure 1 (a).
– When lY ≥ l, P = Pr ∪ Ps ∪ Cq Cq2 , where Cq2 is such that |Cq Cq2 | =
l − (|Pr | + |Ps |). See Figure 1 (b).
Now, we can construct an optimal join-region to compute a join operation
for tables R and S and the query source Q, by considering all possible triples of
points Cr , Cs , and Cq in the sensor network, and picking the triplet (Cr , Cs , Cq )
that results in a join-region P (as describe above) with minimum communication
cost. The time complexity of the above algorithm is O(n3 ), where n is the total
number of sensor nodes in the sensor network.
Suboptimal Heuristic. The high time complexity of the optimal algorithm
described above makes it impractical for large sensor networks.
Thus, we propose a suboptimal heuristic that runs
in O(n3/2 ) time, and incidentally performs very
Q
well in practice. Essentially, for a given Cr , we
stipulate that Cs should be symmetrically (|R|d(R, Cr ) =
Cq
|S|d(S, Cs )) located in the △RQS. In addition, we
approximate paths Pr and Ps to be straight line
Cs
M
Cr
segments, and choose the point Cq on the median
R
S of the △C C Q. See Figure 2. Thus, for each point
|R|d(R, C ) = |S|d(S, C )
r s
Table R
Table S
as Cr in the sensor network, we determine Cs and
search for the best Cq on the median of △Cr Cs Q.
Fig. 2. Heuristic
r
3.2
s
Join Implementation for Stream Database Tables
In the previous subsection, we discussed implementation of the join operation
in a sensor network for static database tables. Since, sensor network data is
better represented as stream database tables, we now generalize the algorithms
to handle stream database tables. First, we start with presenting our model of
stream database tables in sensor networks.
Data Streams in Sensor Networks. As for the case of static tables, a stream
database table R corresponding to a data stream in a sensor network is associated
with a region R, where each node in R is continually generating tuples for the
table R. To deal with the unbounded size of stream database tables, the tables
are usually restricted to a finite set of tuples called the sliding window [1, 12, 27].
In effect, we expire or archive tuples from the data stream based on some criteria
so that the total number of stored tuples does not exceed the bounded window
size. We use WR to denote the sliding window for a stream database table R.
Naive Approach for Stream Tables. In the Naive Approach, we use the
region R (or S) to store the windows WR and WS of the stream tables R and
S.3 Each sensor node in the region R uses WR /(|WR | + |WS |) fraction of its
local memory to store tuples of WR , and the remaining fraction of the memory
to store tuples of WS . To perform the join operation, each newly generated tuple
(of R or S) is broadcast to all the nodes in the region R, and is also stored in
some node of R with available memory. Note that the generated data tuples of
S need to be first routed from the region S to the region R. The resulting joined
tuples are routed from R to the query source Q.
Generalizing Other Approaches. The other approaches viz. Centroid Approach, Optimal Algorithm, and Suboptimal Heuristic, use a join-region that is
separate from the regions R and S. These algorithms are generalized to handle
stream database tables as follows. First, the strategy to choose the join-region P
remains the same as before for static tables, except for the size of the join-region.
For stream database tables, the chosen join-region is used to store WR as well
as WS , with each sensor node in the join-region using WR /|WR | + |WS | fraction
of its memory to store tuples of WR , and the rest to store tuples of WS . Each
newly generated tuple (of R or S) is routed from its source node in R or S to the
join-region P , and broadcast to all the nodes in P . The resulting joined tuples
are then routed to Q. As part of the broadcast process (without incurring any
additional communication cost), each generated tuple of R (or S) is also stored
at some node in P with available memory.
4
Performance Evaluation
In this section, we compare the performance of Naive Approach, Centroid Algorithm, Optimal Algorithm, and Suboptimal Heuristic. In our previous discussion,
we have assumed dense sensor networks where we can find a sensor node at any
desirable point in the region. On real sensor networks, we use our proposed algorithms in conjunction with the trajectory based forwarding (TBF) routing technique [28], which works by forwarding packets to nodes closest to the intended
path/trajectory. More specifically, to form the Pr , Ps, and Cq Q (or Cq Cq2 ) parts
of the join-region, we use nodes that are closest to uniformly spaced points on
the geometrically constructed paths. In addition, each algorithm is generalized
3
If the total memory of the nodes in R is not sufficient to store WR and WS , then
the region R is expanded to include more sensor nodes.
4000
2000
-4
5*10
10
-3
0.005
0.01
0.05
Join Selectivity Factor
(a) t = 0.13 units
0.1
3000
2000
1000
-4
5*10
-3
10
0.005
0.01
Naive
Centroid
Suboptimal Heuristic
OptBased
3
3
8000
3000
Naive
Centroid
Suboptimal Heuristic
OptBased
4000
Total Communication Cost (x 10 )
Total Communication Cost (x 10 )
3
Total Communication Cost (x 10 )
5000
Naive
Centroid
Suboptimal Heuristic
OptBased
16000
0.05
Join Selectivity Factor
(b) t = 0.15 units
0.1
2000
1000
500
-4
5*10
-3
10
0.005
0.01
0.05
0.1
Join Selectivity Factor
(c) t = 0.18 units
Fig. 3. Total communication cost for various transmission radii (t), and fixed △RSQ.
for stream database tables as discussed in Section 3.2. We refer to the generalized
algorithms as Naive, Centroid, OptBased, and Suboptimal Heuristic respectively.
Definition 3. Given instances of relations R and S and a join predicate, the
join-selectivity factor (f) is the probability that a random pair of tuples from R
and S will satisfy the given join predicate. In other words, the join selectivity
factor is the ratio of the size of R ⋉
⋊ S to the size of the cartesian product, i.e.,
f = |R ⋉
⋊ S|/(|R||S|).
Parameter Values and Experiments. We generated random sensor networks
by randomly placing 10,000 sensors with uniform transmission radius (t) in an
area of 10×10 units. For the purposes of comparing the performance of our algorithms, varying the number of sensors is tantamount to varying the transmission
radius. Thus, we fix the number of sensors to be 10,000 and measure performance
for different transmission radii. Memory size of a sensor node is 300 tuples, and
the size of each of the sliding windows WR and WS of stream tables R and S is
8,000 tuples. For simplicity, we chose uniform data generation rates for R and S
streams. In each of the experiments, we measure communication cost incurred in
processing 8000 newly generated tuples of R and S each, after the join-region is
already filled with previously generated tuples. We use the GPSR [19] algorithm
to route tuples. Catalogue information is gathered for non-Naive approaches by
collecting a small sample of data streams at the query source. In the first set
of experiments, we consider a fixed △RSQ and calculate the total communication cost for various transmission radii and join-selectivity factors. Next, we fix
the transmission radius and calculate the total communication cost for various
join-selectivity factors and various shapes/sizes of the △RSQ.
Fixed Triangle RSQ. In this set of experiments (Figure 3), we fix the locations of regions R, S, and query source Q and measure the performance of our
algorithms for various values of transmission radii and join-selectivity factors.
In particular, we choose coordinates (0,0), (5,9.5), and (9.5,0) for R, Q, and S
respectively.
We have looked at three transmission radii viz. 0.13, 0.15, and 0.18 units. Lower
transmission radii left the sensor network disconnected, and the trend observed
500
Naive
Centroid
Suboptimal Heuristic
OptBased
250
10
15
20
25
30
35
Area of Triangle QRS
(a) f = 10−4
40
45
3
Total Communication Cost (x 10 )
4000
3
Total Communication Cost (x 10 )
3
Total Communication Cost (x 10 )
4000
1000
2000
1000
Naive
Centroid
Suboptimal Heuristic
OptBased
10
15
20
25
30
35
Area of Triangle QRS
40
(b) f = 5 ∗ 10−3
45
2000
Naive
Centroid
Suboptimal Heuristic
OptBased
1000
10
15
20
25
30
35
Area of Triangle QRS
40
45
(c) f = 10−2
Fig. 5. Total communication cost for various △RSQ. Here, t = 0.15.
for these three transmission radii values is sufficient to infer behavior for larger
transmission radii (see Figure 4). From Figure 3 (a)-(c), we can see that the
Suboptimal Heuristic performs very close to the OptBased Algorithm, and significantly outperforms (upto 100%) the Naive and Centroid Approaches for most
parameter values. The performance of the Naive approach worsens drastically
with the increase in the join-selectivity factor, since the routing cost of the joined
tuples from the join region (R or S) to the query source Q becomes more dominant.
Fixed Transmission Radius (0.15 units). We
also observe the performance of various algorithms
Naive
4000
Centroid
for different size and shapes of △RSQ. In particSuboptimal Heuristic
3000
OptBased
ular, we fix the transmission radius of each sensor
2000
node in the network to be 0.15 units, and generate various △RSQ’s as follows. We fix locations
of
regions R and S, and select many locations of
1000
the query source Q with the constraint that the
0.13 0.18 0.25 0.35
0.5
1.0
area of the △RSQ is between 10% to 50% of the
Transmission Radius
total sensor network area. For each such generated △RSQ, we run all the four algorithms for
Fig. 4. Here, f = 0.05.
three representative join-selectivity factor values
viz. 10−4 , 5 ∗ 10−3 , and 10−2 . See Figure 5. Again we observe that the Suboptimal Heuristic performs very close to the OptBased Algorithm, and incurs
much less communication cost than the Naive and Centroid Approaches for all
join-selectivity factor values.
3
Total Communication Cost (x 10 )
5000
Summary. From the above experiments, we observe that the Suboptimal Heuristic performs very close to the OptBased Algorithm, but performs substantially
better than the Centroid and Naive Approaches for a wide range of sensor network parameters. The savings in communication cost reduce with the increase in
join-selectivity factor and/or transmission radius. We expect the join-selectivity
factor to be relatively low in large sensor networks because of large sizes of
operand tables and data generated having only local spatial and temporal data
correlations. Moreover, since sensor nodes have the capability to adjust transmission power, effective topology control [30, 32] is used to minimize transmission
radius at each node to conserve overall energy. Thus, the Suboptimal Heuristic
is a natural choice for efficient implementation of join in sensor networks, and
should result in substantial energy savings in practice.
5
Related Work
The vision of sensor network as a database has been proposed by many works [5,
16, 26], and simple query engines such as TinyDB [26] have been built for sensor
networks. In particular, the COUGAR project [5, 33, 34] at Cornell University
is one of the first attempts to model a sensor network as a database system.
The TinyDB Project [26] at Berkeley also investigates query processing techniques for sensor networks. However, TinyDB implements very limited functionality [25] of the traditional database language SQL. A plausible implementation
of an SQL query engine for sensor networks could be to ship all sensor nodes’
data to an external server that handles the execution of queries completely [21].
Such an implementation would incur high communication costs and congestionrelated bottlenecks. In particular, [18] shows that in-network implementation
of database queries is fundamental to conserving energy in sensor networks.
Thus, recent research has focussed on in-network implementation of database
queries. However, prior research has only addressed limited SQL functionality
– single queries involving simple aggregations [22, 24, 34] and/or selections [25]
over single tables [23], or local joins [34]. So far, it has been considered that correlations such as median computation or joins should be computed on a single
node [4, 25, 34]. In particular, [4] address the problem of operator placement for
in-network query processing, assuming that each operator is executed locally and
fully on a single sensor node. The problem of distributed and communicationefficient implementation of join has not been addressed yet in the context of
sensor networks.
In addition, there has been a large body of work done on efficient query
processing in data stream processing systems [6, 8, 9, 27]. In particular, [11] approximates sliding window joins over data streams and [17] has designed join
algorithms for joining multiple data streams constrained by a sliding time window. However, a data stream processing system is not necessarily distributed
and hence, minimizing communication cost is not the focus of the research.
There has been a lot of work on query processing in distributed database systems [7, 20, 29], but sensor networks differ significantly from distributed database
systems because of their multi-hop communication cost model and resource limitations.
6
Conclusions
Sensor networks are capable of generating large amounts of data. Hence, efficient
query processing in sensor networks is of great importance. Since sensor nodes
have limited battery power and memory resources, designing communicationefficient distributed implementation of database queries is a key research challenge. In this article, we have focussed on implementation of the join operator,
which is one of the core operators of database query language. In particular, we
have designed an Optimal Algorithm that incurs minimum communication cost
for implementation of join in sensor networks under certain reasonable assumptions. Moreover, we reduced the time complexity of the Optimal Algorithm to
design a Suboptimal Heuristic, and showed through extensive simulations that
the Suboptimal Heuristic performs very close to the Optimal Algorithm. Techniques developed in this article are shown to result in substantial energy savings
over simpler approaches for a wide range of sensor network parameters.
References
1. D. J. Abadi, D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and
S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120–
139, 2003.
2. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In
Proceedings of the ACM Symposium on Principles of Database Systems (PODS), 2002.
3. B. Badrinath, M. Srivastava, K. Mills, J. Scholtz, and K. Sollins, editors. Special Issue on Smart Spaces and
Environments, IEEE Personal Communications, 2000.
4. B. Bonfils and P. Bonnet. Adaptive and decentralized operator placement for in-network query processing. In
Proceedings of the International Workshop on Information Processing in Sensor Networks (IPSN), 2003.
5. P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. In Proceeding of the International
Conference on Mobile Data Management, 2001.
6. D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and
S. Zdonik. Monitoring streams - A new class of data management applications. In Proceedings of the International
Conference on Very Large Data Bases (VLDB), 2002.
7. S. Ceri and G. Pelagatti. Distributed Database Design: Principles and Systems. MacGraw-Hill (New York NY), 1984.
8. S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy,
S. R. Madden, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing. In Proceedings of the
ACM SIGMOD Conference on Management of Data, 2003.
9. J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagaraCQ: a scalable continuous query system for internet
databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2000.
10. V. Chowdhary and H. Gupta. Communication-efficient implementation of join in sensor networks. Technical
report, SUNY, Stony Brook, Computer Science Department, 2004.
11. A. Das, J. Gehrke, and M. Riedewald. Approximate join processing over data streams. In Proceedings of the ACM
SIGMOD Conference on Management of Data, 2003.
12. L. Ding, N. Mehta, E. Rundensteiner, and G. Heineman. Joining punctuated streams. In Proceedings of the
International Conference on Extending Database Technology, 2004.
13. D. Estrin, R. Govindan, and J. Heidemann, editors. Special Issue on Embedding the Internet, Communications of
the ACM, volume 43, 2000.
14. D. Estrin, R. Govindan, J. S. Heidemann, and S. Kumar. Next century challenges: Scalable coordination in
sensor networks. In Proceedings of the International Conference on Mobile Computing and Networking (MobiCom), 1999.
15. I. Gelfand and S. Fomin. Calculus of Variations. Dover Publications, 2000.
16. R. Govindan, J. Hellerstein, W. Hong, S. Madden, M. Franklin, and S. Shenker. The sensor network as a
database. Technical report, University of Southern California, Computer Science Department, 2002.
17. M. Hammad, W. Aref, A. Catlin, M. Elfeky, and A. Elmagarmid. A stream database server for sensor applications. Technical report, Purdue University, Department of Computer Science, 2002.
18. J. S. Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan, D. Estrin, and D. Ganesan. Building efficient
wireless sensor networks with low-level naming. In Symposium on Operating Systems Principles, 2001.
19. B. Karp and H. Kung. Gpsr: greedy perimeter stateless routing for wireless networks. In Proceedings of the
International Conference on Mobile Computing and Networking (MobiCom), 2000.
20. D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4), 2000.
21. S. Madden and M. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In
Proceedings of the International Conference on Database Engineering (ICDE), 2002.
22. S. Madden, M. Franklin, J. Hellerstein, and W. Hong. TAG: A tiny aggregation service for ad-hoc sensor
networks. In Proceedings of the Symposium on Operating Systems Design and Implementation (OSDI), 2002.
23. S. Madden and J. M. Hellerstein. Distributing queries over low-power wireless sensor networks. In Proceedings
of the ACM SIGMOD Conference on Management of Data, 2002.
24. S. Madden, R. Szewczyk, M. Franklin, and D. Culler. Supporting aggregate queries over ad-hoc wireless sensor
networks. In Workshop on Mobile Computing and Systems Applications, 2002.
25. S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. The design of an acquisitional query processor
for sensor networks. In Proceedings of the ACM SIGMOD Conference on Management of Data, 2003.
26. S. R. Madden, J. M. Hellerstein, and W. Hong.
TinyDB: In-network query processing in tinyos.
http://telegraph.cs.berkeley.edu/tinydb, Sept. 2003.
27. R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and
R. Varma. Query processing, approximation, and resource management in a data stream management system.
In Proceedings of the International Conference on Innovative Data Systems Research (CIDR), 2003.
28. B. Nath and D. Niculescu. Routing on a curve. In Proceedings of the Workshop on Hot Topics in Networks, 2002.
29. M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, 1999.
30. J. Pan, Y. T. Hou, L. Cai, Y. Shi, and S. X. Shen. Topology control for wireless sensor networks. In Proceedings
of the International Conference on Mobile Computing and Networking (MobiCom), 2003.
31. G. Pottie and W. Kaiser. Wireless integrated sensor networks. Communications of the ACM, 43, 2000.
32. R. Ramanathan and R. Rosales-Hain. Topology control in multihop wireless networks using transmit power
adjustment. In Proceedings of the IEEE INFOCOM, 2000.
33. Y. Yao and J. Gehrke. The cougar approach to in-network query processing in sensor networks. In SIGMOD
Record, 2002.
34. Y. Yao and J. Gehrke.
Query processing for sensor networks.
Innovative Data Systems Research (CIDR), 2003.
In Proceedings of the International Conference on