0% found this document useful (0 votes)
2 views

Fragment Allocation and Replication in Distributed

Uploaded by

ms240400014mak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Fragment Allocation and Replication in Distributed

Uploaded by

ms240400014mak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/320467866

Fragment Allocation and Replication in Distributed Databases

Article in International Journal of Database Theory and Application · August 2017


DOI: 10.14257/ijdta.2017.10.8.05

CITATIONS READS

0 81

1 author:

Ali Amiri
Oklahoma State University - Stillwater
9 PUBLICATIONS 45 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ali Amiri on 21 August 2023.

The user has requested enhancement of the downloaded file.


International Journal of Database Theory and Application
Vol.10, No.8 (2017), pp.43-56
http://dx.doi.org/10.14257/ijdta.2017.10.8.05

Fragment Allocation and Replication in Distributed Databases

Ali Amiri
Department of MSIS
Spears School of Business
Oklahoma State University
Stillwater, OK, 74078, USA
amiri@okstate.edu

Abstract
We study the problem of designing a distributed database system. We develop
optimization models for the problem that deals simultaneously with two major design
issues, namely which fragments to replicate, and where to store those fragments and
replicas. Given the difficulty of the problem, we propose a solution algorithm based on a
new formulation of the problem in which every server is allocated a fragment
combination from a set of combinations generated by a randomized greedy heuristic. The
results of a computational study show that the algorithm outperforms a standard branch
& bound technique for large instances of the problem.

Keywords: Distributed Database; Fragment Allocation, Fragment Replication,


Optimization

1. Introduction
A distributed database is a single logical database that is spread physically across
computers in multiple locations that are connected by a data communications network.
The adoption of distributed databases is nurtured by various business conditions such as
global nature of business operations and transactions, distribution and autonomy of
business units, and telecommunications costs and reliability [7]. A distributed database
offers several potential benefits over a centralized database such as improved availability,
reliability and performance and lower costs. For those benefits to realize, a distributed
database should be properly designed with respects to three important issues: how to
partition the database into fragments, which fragments to replicate, and where to store
those fragments and replicas [1,11].
As stated in [12], logical database design deals with determining the contents of a
database independently of the physical implementation considerations. The guide of the
design process is typically the statement of user view requirements/needs in the form of a
set of user views [12]. Each view defines the data contents and transactions which are
required by a specific job/function that a user view (or group of user views) performs.
Each view requires the identification of the database fragments that are needed to process
retrieval and update transactions of the corresponding user view. The task of fragmenting
the database effectively is a difficult one in itself. However, a variety of approaches exist
for fragmenting databases [2,4], and this paper assumes that the fragments have already
been identified based on the definition of the user views.
The response to a query in a distributed database environment may require assembling
data from several different sites (although with location transparency, the user is unaware
of this need). The volume of data transmitted over the communication network depends
heavily on the type of the query and the locations of the fragments and their replicas
involved in the query. A retrieval query uses only one copy of each fragment in the
query; whereas an update query uses every copy of each fragment in the query to maintain

ISSN: 2005-4270 IJDTA


Copyright ⓒ 2017 SERSC Australia
International Journal of Database Theory and Application
Vol.10, No.8 (2017)

up-to-date data. Besides the way a query is formulated by a user view, query optimization
by the database management system affects the query execution efficiency, especially
when the query requires fragments located in different sites. In this paper, we design the
distributed database in such a way that all the fragments required by a user view are
located in the same site. This approach simplifies query optimization and improves
response time to users, especially when processing retrieval queries. As this approach
allows replication of fragments, an update query still requires updating every copy of the
fragments in the queries. While the nonredundant fragment allocation scheme (i.e., storing
one copy of each fragment in the database) makes the problem less difficult to solve, it is
a very restrictive perspective. Under that scheme, the strategy is first to find a
nonredundant solution by ignoring replication and second to apply a heuristic to that
solution to decide how to replicate fragments [6,8]. The quality of the final solution
obtained using this non-integrated strategy is inferior to the solution of the integrated
approach (i.e., which considers fragment replication and allocation simultaneously) that
we adopt in our paper [8].
Many studies have considered various aspects of the distributed database design
problem in a variety of contexts. A comprehensive review of these studies can be found
in [9]. Here, we consider studies that are most related to the topic of our paper. Menon [7]
presented new integer programming formulations for the nonredundant version of the
fragment allocation problem (where exactly one copy of each fragment exits across all
sites). These formulations are solved using the commercial integer programming solver
CPLEX [5]. Menon [7] reported computational test results which show that his
formulations are more effective than prior formulations using up to 200 fragments and 10
servers.
Hababeh et al., [3] developed a method for grouping the distributed sites into clusters
and customizing the database fragments allocation to the clusters and their sites. The
method proceeds in three steps: (i) it groups the computing sites into disjoint clusters, (ii)
it allocates the fragments to the clusters, and (iii) it allocates the fragments within each
cluster to its sites independently of the other clusters and sites. Hababeh et al., [3]
reported results of computational tests using a database system with 6 sites and 8
fragments.
Sen et al., [10] studied a general version of the problem where (i) the files/segments
have to be clustered and the clusters need to be assigned to a given number of servers
whose locations are to be chosen from among a pre-determined set of potential locations,
and (ii) query traffic between the users and the servers are routed over a fully connected
backbone network. This general problem is solved indirectly by solving first the data
partitioning problem and then the segment allocation problem.
In this paper, we present an optimization model for the distributed database design
problem (DDDP) that deals simultaneously with two major design issues, namely which
fragments to replicate, and where to store those fragments and replicas. Again, we assume
that the database fragments have already been determined based on the definition of the
user views. We develop a mathematical programming model of the DDDP which allows a
user view to be assigned to a server only if that server is allocated all the fragments
required by the user view. In particular, the model considers the objective of minimizing
total system cost which is composed of three components: (i) the communication and
processing costs of retrieval transactions of the user views, (ii) the communication and
processing costs of update transactions of the user views, and (iii) the cost of storing and
maintaining the fragments and their replicas on the servers. The decisions to make are (i)
how to assign user views to the servers to process their transactions, and (ii) how to
replicate the fragments and allocate them to the servers. A solution to the problem should
satisfy the following requirements: (i) each user view is assigned to one server to process
its transactions, (ii) a user view can be assigned to a server only if all the fragments

44 Copyright ⓒ 2017 SERSC Australia


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

needed by the user view transactions are stored on that server, and (iii) the processing
capacities of the servers are not exceeded.
Given the difficulty of the distributed database design problem, we propose a solution
algorithm based on a new formulation of the problem where every server is allocated a
fragment combination from among all combination or from a well generated set of
combinations. A combination consists of a group of fragments to be allocated to a server.
We run a computational study to study the effectiveness of the solution algorithm. The
results show that the proposed algorithm outperforms a standard branch & bound
technique based on the first formulation of the problem for large instances of the problem.
The remainder of the paper is organized as follows. Section 2 presents two integer
programming formulations of the problem. Two solution procedures are described in
Section 3. Computational results are reported in Section 4 while Section 5 concludes.

2. Problem Formulations
Define N the set of user views that need to be assigned to the set M of servers/sites.
Define F the set of fragments that need to be allocated to the servers, 𝐹𝑖′ the set of
fragments needed by user view 𝑖 ∈ 𝑁 in retrieval transactions, and 𝐹𝑖′′ the set of fragments
needed by user view 𝑖 ∈ 𝑁 in update transactions. Define parameters 𝑑𝑖′ to be processing
′′
demand of retrieval transactions of user view 𝑖 ∈ 𝑁, 𝑑𝑖𝑘 processing demand of update

transactions of fragment 𝑘 ∈ 𝐴 of user view 𝑖 ∈ 𝑁, 𝐶𝑖𝑗 communication and processing
′′
cost of retrieval transactions of user view 𝑖 ∈ 𝑁 when it is assigned to server 𝑗 ∈ 𝑀, 𝐶𝑖𝑘𝑗
communication and processing cost of update transactions of user view 𝑖 ∈ 𝑁 for
fragment 𝑘 ∈ 𝐹 when it stored at server 𝑗 ∈ 𝑀, 𝐶𝑘𝑗 cost of storing and maintaining
fragment 𝑘 ∈ 𝐹 in server 𝑗 ∈ 𝑀, and 𝑄𝑗 processing capacity of server 𝑗 ∈ 𝑀. Define the
decision variables as
1 𝑖𝑓 𝑢𝑠𝑒𝑟 𝑣𝑖𝑒𝑤 𝑖 𝑖𝑠 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑠𝑒𝑟𝑣𝑒𝑟 𝑗 ∈ 𝑀
𝑋𝑖𝑗 = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑒𝑠𝑒
1 𝑖𝑓 𝑓𝑟𝑎𝑔𝑚𝑒𝑛𝑡 𝑘 ∈ 𝐹 𝑖𝑠 𝑠𝑡𝑜𝑟𝑒𝑑 𝑖𝑛 𝑠𝑒𝑟𝑣𝑒𝑟 𝑗 ∈ 𝑀
𝑌𝑘𝑗 = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑒𝑠𝑒
Given these definitions, the distributed database design problem (DDDP) can be
formulated as model DDDP1 below.

DDDP1:

𝑚𝑖𝑛 ∑ ∑ 𝐶𝑖𝑗′ 𝑋𝑖𝑗 + ∑ ∑ ∑ 𝐶𝑖𝑘𝑗


′′
𝑌𝑘𝑗 + ∑ ∑ 𝐶𝑘𝑗 𝑌𝑘𝑗 (1)
𝑖∈𝑁 𝑗∈𝑀 𝑖∈𝑁 𝑘∈𝐹𝑖′′ 𝑗∈𝑀 𝑗∈𝑀 𝑘∈𝐹

st

∑ 𝑋𝑖𝑗 = 1 ∀𝑖 ∈𝑁 (2)
𝑗∈𝑀

𝑋𝑖𝑗 ≤ 𝑌𝑘𝑗 ∀𝑖 ∈ 𝑁, 𝑘 ∈ 𝐹𝑖′ ∪ 𝐹𝑖′′ , 𝑗 ∈ 𝑀 (3)

∑ 𝑑𝑖′ 𝑋𝑖𝑗 + ∑ ∑ 𝑑𝑖𝑘


′′
𝑌𝑘𝑗 ≤ 𝑄𝑗 ∀𝑗 ∈ 𝑀 (4)
𝑖∈𝑁 𝑖∈𝑁 𝑘∈𝐹𝑖′′

𝑋𝑖𝑗 , 𝑌𝑘𝑗 ∈ {0,1} ∀𝑖 ∈ 𝑁, 𝑘 ∈ 𝐹, 𝑗 ∈ 𝑀 (5)

The goal is to minimize the total communication and processing cost of user view
retrieval and update transactions and cost of setting up and maintaining fragments in the

Copyright ⓒ 2017 SERSC Australia 45


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

servers. Constraints (2) ensures that each user view is assigned to one server to handle its
retrieval and update transactions. Constraints (3) ensure that if a user view is assigned to
a server, then every fragment required by the user view is stored in that server.
Constraints (4) limit the total processing demand of the user views assigned to a server to
the capacity of the server. The first and second terms in the left hand sides of constraints
(4) represent the processing requirements of retrieval and update transactions,
respectively. Constraints (5) are restrict the decision variables to be binary.
Since the distributed database design problem (DDDP1) is NP-complete, it is
extremely difficult to solve it optimally in an acceptable computing time. This difficulty is
the direct result of the fact that both the allocation of fragments to the servers and the
assignment of user views to the servers have to be handled concurrently. We conducted
computational tests that show that the linear programming relaxation of (DDDP1)
produces non-integer values of the decision variables and, hence, a standard branch &
bound technique using formulation (DDDP1) is not likely to generate an optimal solution
in reasonable time. We next develop a second formulation of the distributed database
design problem that will be used in solving the problem more effectively.
A new formulation for the distributed database design problem can be obtained by
viewing it as a covering problem. This formulation stems from the observation that in
any feasible solution to the problem, the user views have to be covered by the servers they
are assigned to, with a user view being allocated to one server that maintains all the
fragments the user view needs in its retrieval and update transactions. As a result, we can
solve the problem using all combinations of fragments and assign the “best” combinations
to the servers for which the total cost is minimized. Every combination of fragments is a
valid configuration that can be allocated to a server.
Using the above insight, the distributed database design problem can now be
reformulated in the following way. Define G as the set of all combinations, where a
combination is simply a subset of fragments that can be assigned to a server. Define
parameter 𝐺𝑖 to be the set of combinations which contain all fragments used in processing
transactions of user view 𝑖 ∈ 𝑁. Define parameter 𝑎𝑘𝑔 to be 1 if fragment k is included in
combination g. Define a binary variable 𝑊𝑗𝑔 to be 1 if server j is allocated fragment
combination g and 0 if not. Each server needs to be assigned one combination, including
the empty combination to leave the possibility that a server may not be used. This can be
stated mathematically as

∑ 𝑊𝑗𝑔 = 1 ∀𝑗 ∈𝑀
𝑔∈𝐺

Each user view should be assigned to one server whose configuration should include
all fragments that the user view needs. Mathematically, this is equivalent to saying

𝑋𝑖𝑗 ≤ ∑ 𝑊𝑗𝑔 ∀ 𝑖 ∈ 𝑁, 𝑗 ∈ 𝑀
𝑔∈𝐺𝑖

The distributed database design problem can now be reformulated as DDDP2.


DDDP2:

𝑚𝑖𝑛 ∑ ∑ 𝐶𝑖𝑗′ 𝑋𝑖𝑗 + ∑ ∑ ∑ ∑ "


𝐶𝑖𝑘𝑗 𝑊𝑗𝑔 + ∑ ∑ ∑ 𝐶𝑘𝑗 𝑊𝑗𝑔 (6)
𝑖∈𝑁 𝑗∈𝑀 𝑖∈𝑁 𝑗∈𝑀 𝑔∈𝐺 𝑘∈𝐹 " :𝑎𝑘𝑔 =1 𝑗 𝑔 𝑘∈𝐹:𝑎𝑘𝑗 =1
𝑖

st

∑ 𝑋𝑖𝑗 = 1 ∀𝑖 ∈𝑁 (7)
𝑗∈𝑀

46 Copyright ⓒ 2017 SERSC Australia


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

𝑋𝑖𝑗 ≤ ∑ 𝑊𝑗𝑔 ∀ 𝑗 ∈ 𝑀, 𝑖 ∈ 𝑁 (8)


𝑔∈𝐺𝑖

∑ 𝑊𝑗𝑔 = 1 ∀𝑗 ∈𝑀 (9)
𝑔

∑ 𝑑𝑖′ 𝑋𝑖𝑗 + ∑ ∑ ∑ ′′
𝑑𝑖𝑘 𝑊𝑗𝑔 ≤ 𝑄𝑗 ∀𝑗 ∈ 𝑀 (10)
𝑖∈𝑁 𝑖∈𝑁 𝑔∈𝐺 𝑘∈𝐹𝑖′′ :𝑎𝑘𝑔 =1

𝑋𝑖𝑗 , 𝑊𝑗𝑔 ∈ {0,1} ∀𝑗 ∈ 𝑀, 𝑖 ∈ 𝑁, 𝑔 ∈ 𝐺 (11)


The goal (6) minimizes the overall cost made of the three components as in (1).
Constraints (7) serve the same purpose as constraints (2). Constraints (8) ensure that a
user view can be allocated to a server only if the server is assigned a configuration that
has all the fragments needed by the user view. Constraints (9) state that each server is
allocated a fragment combination. Constraints (10) serve the same purpose as constraints
(4). Constraints (11) represent the binary restrictions on the variables.
The following example illustrates the definition of the problem. There are 20 user
views, 20 fragments, and 5 servers. The parameters for the retrieval and update
transactions are in table 1. All the servers have the same capacity 3663. The storage and
maintenance costs for the 20 fragments are {371, 617, 414, 313, 73, 936, 464, 220, 412,
924, 578, 831, 707, 540, 461, 477, 326, 558, 227, 60} for all servers. One possible
solution has a total cost of 61706 and consits of assigning user views {6, 7, 9, 10, 12, 20}
to server 1, user views {11, 16, 18} to server 2, user views {4, 14, 15, 17} to server 3,
user views {2, 3, 5, 19} to server 4, and user views {1, 8, 13} to server 5, and allocationg
fragments {1, 2, 3, 5, 6, 7, 8, 9, 11, 13, 14, 15, 18, 19, 20} to server 1, fragments {1, 2, 3,
4, 5, 6, 7, 8, 10, 11, 12, 15, 16, 17, 18, 19, 20} to server 2, fragments {1, 5, 6, 7, 8, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20} to server 3, fragments {1, 2, 4, 8, 9, 10, 11, 13, 14, 17,
19, 20} to server 4, and fragments {1, 3, 6, 8, 10, 13, 16, 19} to server 5. The utilizations
of the five servers are 96%, 94%, 98%, 55%, and 39%, respectively.

Copyright ⓒ 2017 SERSC Australia 47


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

3. CONFIG: A Solution Procedure based on Formulation (DDDP2)


While formulation (DDDP2) is equivalent to formulation (DDDP1), it hides the fact
there is an exponential number of configurations, and enumerating all configurations is
not a reasonable undertaking. Therefore, we describe a randomized greedy algorithm
(RGA) to solve the distributed database design problem and, as a by-product, to identify
good configurations which can be fed into formulation (DDDP2) to solve the problem. A

48 Copyright ⓒ 2017 SERSC Australia


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

simple deterministic greedy algorithm involves sequentially selecting a pair of user view
and server and allocating the user view to the server. The criterion used in the selection is
the net increase in overall cost per unit of demand of the user view. For every user view,
we identify the least expensive server to assign the user view to taking into account the
fragments that are already allocated to the server. From all pairs of user views and servers,
we choose the “best” pair of user view and server that produces the least increase in
overall cost per unit of demand of the user view. The heuristic finishes when we assign all
user views to the servers. To escape a possible local optimum, we randomize the search
by perturbing the costs as follows. If we let ∆𝑖𝑗 =Increase in total cost if user view i is
allocated to server j, then we use ∆̃𝑖𝑗 = (1 + 𝜀𝑖𝑗 )∆𝑖𝑗 in selecting the next pair of user view

and server, where i is a random number from the uniform distribution U[-0.05, 0.05].
An outline of the algorithm RGA is described below.
The following notation is used in describing the algorithm.
C overall cost, initially TC=0
𝑁̅ set of un-assigned user views, initially 𝑁 ̅=𝑁
𝑁̿ set of processed user views (i.e., user views assigned to servers), initially 𝑁 ̿=∅
𝐹𝑗 set of fragments allocated to server 𝑗 ∈ 𝑀, initially 𝐹𝑗 = ∅
𝑄̅𝑗 remaining capacity of server 𝑗 ∈ 𝑀, initially 𝑄̅𝑗 = 𝑄𝑗
Randomized Greedy Algorithm (RGA):
 Let 𝐵∗ be the best feasible solution found until now.
Repeat 500 times
Step 1: Initialize
 𝐶 = 0, 𝑁 ̅ = 𝑁, 𝑁 ̿ = ∅, 𝐹𝑗 = ∅, ̅ 𝑄𝑗 = 𝑄𝑗 ∀𝑗 ∈ 𝑀, B is the solution to construct.
Step 2: User view selection and allocation
While 𝑁̅ ≠ ∅ do
 For evey pair (𝑖, 𝑗) ∈ 𝑁 ̅ × 𝑀, compute ∆̃𝑖𝑗 = (1 + 𝜀𝑖𝑗 )∆𝑖𝑗 ; the remaining capacity
of server j (i.e., 𝑅𝑗 ) should large enough to handle the additional retrieval and
update demand
 Select the pair (𝑖 ∗ , 𝑗 ∗ ) with the smallest ∆̃𝑖𝑗 . Assign user view 𝑖 ∗ to server 𝑗 ∗.
 Update
 Current solution B.
 𝐶
 𝑁 ̅=𝑁 ̅\{𝑖 ∗ },
 𝑁 ̿=𝑁 ̿ ∪ {𝑖 ∗ },
 𝐹𝑗 ∗ = 𝐹𝑗 ∗ ∪ 𝐹𝑖′ ∪ 𝐹𝑖"
 𝑄̅𝑗 ∗
End While
If solution B has a better objective function value than solution 𝐵∗ , then 𝐵∗ = 𝐵.
End Repeat
Retain 𝐵∗ as the best feasible solution and stop.
During the execution of the algorithm RGA, many distinct server configurations are
produced; each configuration identifies a set of fragments to store on a server. Such
configurations can be fed into formulation (DDDP2) which can be solved with a standard
branch & bound algorithm to produce theoretically the best feasible solution to the
original problem using those configurations. We refer to this procedure as algorithm
CONFIG. Given that the subset of configurations employed does not necessarily contain
the complete set of best configurations, the optimal solution to the distributed database
design problem (DDDP1) is not ensured. In order to expedite the branch and bound
algorithm, we limit the number of configurations used in formulation (DDDP2) to 200
which are part of the "best" solutions generated during the application of RGA. The reason

Copyright ⓒ 2017 SERSC Australia 49


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

for setting up this limit is that we observed that when we solve model (DDDP2) using all
the distinct configurations generated by RGA the solution quality deteriorates significantly
because the solver has to spend a lot of time evaluating all those configurations rather
than concentrating on the most promising ones.
We illustrate the performance of the two formulations using the small example
described earlier. The solution produced using formulation (DDDP1) is optimal with a
total cost of 51908 and consits of assigning user views {6, 8, 9, 12, 15, 20} to server 1,
user views {7, 11} to server 2, user views {2, 3, 5, 10, 13, 14, 16} to server 3, and user
views {1, 4, 17, 18, 19} to server 5, and allocationg fragments {1, 2, 3, 5, 6, 11, 13, 14,
15, 16, 18, 19, 20} to server 1, fragments {1, 3, 5, 7, 9, 15, 20} to server 2, fragments {1,
2, 3, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 20} to server 3, and fragments {1, 4, 5, 6, 7, 8, 9,
10, 11, 12, 14, 17, 19, 20} to server 5. The utilizations of the five servers are 99%, 29%,
98%, 0%, and 97%, respectively. The solution produced using algorithm CONFIG has a
total cost of 52358, corresponding to an optimality gap of only 0.87%, and it consits of
assigning user views {6, 8, 9, 12, 15, 20} to server 1, user views {7, 11} to server 2, user
views {1, 3, 5, 10, 13, 14, 16} to server 3, and user views {2, 4, 17, 18, 19} to server 5,
and allocationg fragments {1, 2, 3, 5, 6, 11, 13, 14, 15, 16, 18, 19, 20} to server 1,
fragments {1, 3, 5, 7, 9, 15, 20} to server 2, fragments {1, 2, 3, 6, 7, 8, 10, 13, 14, 16, 17,
18, 19} to server 3, and fragments {1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 17, 19, 20} to
server 5. The utilizations of the five servers are 99%, 29%, 95%, 0%, and 99%,
respectively.

4. Computational Study and Results


We run a computational study to assess the effectiveness of the proposed algorithm in
solving the distributed database design problem (DDDP). We coded the algorithm in
Visual Studio .Net and run on Intel Core i7 with 3.6 GHz and 8 GB of RAM. We describe
here the process of generating the test problems and present and discuss the results.

4.1. Test Problems


We explore the effect of various parameters of the problem, namely number of user
views, number of fragments, and number of servers on the effectiveness of the solution
procedures using the test problems. We varied the number of user views between 60 and
180, the number of fragments between 10 and 30, and the number of servers between 5
and 25. Ten instances in each category with similar structure were created randomly and
solved so as to assess properly the effectiveness of the solution methods on that problem
structure. We used in each problem instance three categories of user views of equal sizes:
small (S), medium (M), and large (L). Each small user view 𝑖 ∈ 𝑁 requires a number of
fragments 𝑛𝑖 drawn uniformly from [0.1|A|,0.2|A|], where |A| is the total number of
fragments in the database and the retrieval demand 𝑑𝑖′ is uniformly distributed between
[20𝑛𝑖 , 40𝑛𝑖 ]. Each medium user view 𝑖 ∈ 𝑁 requires a number of fragments 𝑛𝑖 drawn
uniformly from [0.2|A|,0.3|A|] and the retrieval demand 𝑑𝑖′ is uniformly distributed
between [50𝑛𝑖 , 70𝑛𝑖 ]. Each large user view 𝑖 ∈ 𝑁 requires a number of fragments 𝑛𝑖
drawn uniformly from [0.3|A|,0.5|A|] and the retrieval demand 𝑑𝑖′ is uniformly distributed
between [80𝑛𝑖 , 100𝑛𝑖 ]. For each user view 𝑖 ∈ 𝑁, the number fragments that need to be
′′
updated is 0.5𝑛𝑖 and the update processing demand 𝑑𝑖𝑘 is drawn uniformly from [10,20]
if the user view is small, from [25,35] if the user view is medium, and from [40,50] if the
user view is large. The capacity of each server is set to ⌊1.5 ∗ 𝑂𝑣𝑒𝑟𝑎𝑙𝑙𝐷𝑒𝑚/|𝑀|⌋, where
OverallDem is the total retrieval and update demands of all user views. The
communication and processing cost (𝐶𝑖𝑗′ ) of retrieval transactions of user view 𝑖 ∈ 𝑁
when it is assigned to server 𝑗 ∈ 𝑀 is set equal to 𝑑𝑖′ ∗ 𝑐𝑖′ , where 𝑐𝑖′ is a uniform random
′′ ′" " "
number in [2,6]; the values of 𝐶𝑖𝑘𝑗 are set equal to 𝑑𝑖𝑘 ∗ 𝑐𝑖𝑘 where 𝑐𝑖𝑘 is a uniform

50 Copyright ⓒ 2017 SERSC Australia


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

number in [1,2]; and the values of 𝐶𝑘𝑗 were generated from the uniform distribution
[50,1000].

4.2. Results and Analysis


The results shown in tables 2-5 are the averages of the ten instances in each category.
In Tables 2, 3 and 5, “Iteration Count” represents the number of simplex iterations, “Node
Count” represents the number of nodes in the branch & bound tree, and “Gap” is the
optimality gap between the best feasible solutions and lower bounds. Table 2 reports the
results from solving the distributed database problem with formulation (DDDP1) by
running CPLEX with the default setting for two hours. As shown in the table, the
difficulty of the problem increases in general when its size increases as indicated by the
higher optimality gaps. For instance, when the number of servers is 5, the optimal
solutions were obtained for all test problems, and when the number of servers is 25, the
average gap jumped to 53%. This confirms that it is difficult to solve the problem
optimally, or even approximately, using the original formulation (DDDP1).

Table 3 reports the results of the tests to evaluate the effectiveness of solving the
problem using algorithm CONFIG which incorporates the set of fragment combinations
produced by the heuristic RGA into formulation DDDP2. Each combination defines a

Copyright ⓒ 2017 SERSC Australia 51


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

subset of fragments that can be allocated to a server. The average number of such
combinations used in formulation DDDP2 is 38.1. The distributed database design
problem (DDDP2) is computationally less difficult than the problem with formulation
(DDDP1). The average optimality gap for the solutions obtained using CONFIG is only
2.83%. However, it is worth noting that the problem with formulation (DDDP2) is still
NP-complete, and hence its complexity can grow drastically with problem size. Even
though a small number of configurations are used in formulation (DDDP2), the
effectiveness of this formulation is on average better than that of formulation (DDDP1).
Indeed, the solutions produced using CONFIG are on average 2.20% cheaper than the
solutions produced using formulation (DDDP1). Algorithm CONFIG based on
formulation (DDDP2) outperforms formulation (DDDP1) when the number servers
increases. For instance, when there are 5 servers, both CONFIG and CPLEX applied to
(DDPP1) obtained optimal solution; however, when there are 25 servers, the costs of the
solutions produced using CONFIG are on average 6.56% lower than those produced using
formulation (DDDP1).

Table 4 sheds some light on the characteristics of solutions obtained using CONFIG
such as allocation of cost among its three components, server capacity utilization, number

52 Copyright ⓒ 2017 SERSC Australia


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

of fragments stored in the servers, and number of user views allocated to the servers.
When the number of servers increases, the contributions of the update and fragment
components increase at the expense of the retrieval component. For example, for test
problem with 180 user views, when the number of servers is 5, the shares of the retrieval,
update and fragment components are 79.9%, 12.3%, and 7.8%, respectively; whereas
those numbers become 35.9%. 39.3%, and 24.8% when the number of servers is 25. In
addition, when the number of servers increases, server capacity utilization increases, and
the number of user views per server decreases; but, the number of fragments per server
does not change significantly.

Table 5 shows the computational results of the performance of algorithm CONFIG as a


results of varying the number of fragments in the database. The table also shows that
CONFIG produced solutions which have 2.71% lower costs than those produced using
formulation (DDDP1). This outperformance seems to improve when the number of
fragments increases. For instances, when there are 10 fragments in the database, solutions
produced using CONFIG have on average 1.32% lower costs than solutions produced
using formulation (DDDP1). However, when the number of fragments increases to 30,
solutions produced using CONFIG have on average 4.86% lower costs than produced
using formulation (DDDP1.

Copyright ⓒ 2017 SERSC Australia 53


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

5. Conclusion
A distributed database should be properly designed to benefit from the potential
advantages it offers. Effective distribution of the database fragments plays a crucial role
in the functioning of the database, affecting both cost and performance. We developed
optimization models for the problem that deals simultaneously with two major design
issues, namely which fragments to replicate, and where to store those fragments and
replicas. Since the problem is very complex, we developed a new formulation of the
problem where every server is assigned a fragment combination from a subset of
combinations properly produced by a randomized greedy algorithm in which the
randomization feature allows the exploration of a larger search space. We conducted an
elaborate computational study which shows that the reformulation is far more effective
than the original formulation especially for large size instances.

54 Copyright ⓒ 2017 SERSC Australia


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

References
[1] H. I. Abdalla, “A New Data Re-allocation Model for Distributed Database Systems”, International
Journal of Database Theory and Application vol. 5, no. 2, (2012), pp. 45-60.
[2] P. R. Bhuyar, A. D. Gawande and A. B. Deshmukh, “Horizontal Fragmentation Technique in
Distributed Database”, International Journal of Scientific and Research Publications, vol. 2, no. 5,
(2012), pp. 1-7.
[3] I. O. Hababeh, M. Ramachandran and N. Bowring, “A High-Performance Computing Method for Data
Allocation in Distributed Database Systems”, The Journal of Supercomputing, vol. 39, no. 1, (2007), pp.
3-18.
[4] Y. F. Huang and C. J. Lai, “Integrating Frequent Pattern Clustering and Branch-and-Bound Approaches
for Data Partitioning”, Information Sciences, vol. 328, (2016), 288-301.
[5] IBM ILOG CPLEX Optimization Studio 12.5, IBM (2012).
[6] K. Karlapalem and N. M. Pun, “Query-Driven Data Allocation Algorithms for Distributed Database
Systems”, International Conference on Database and Expert Systems Application, Springer Berlin
Heidelberg, (1997), pp. 347-356.
[7] S. Menon, “Allocating Fragments in Distributed Databases”, IEEE Transactions on Parallel and
Distributed Systems, vol. 16, no. 7, (2005), 577-585.
[8] M. T. Ö zsu and P. Valduriez, “Principles of Distributed Database Systems”, Springer Science &
Business Media, (2011).
[9] G. Sen, M. Krishnamoorthy, N. Rangaraj and V. Narayanan, “Facility Location Models to Locate Data
in Information Networks: A Literature Review”, Annals of Operations Research, vol. 246, no. 1, (2015),
pp. 1-36.
[10] G. Sen, M. Krishnamoorthy, N. Rangaraj and V. Narayanan, “Exact Approaches for Static Data
Segment Allocation Problem in an Information Network”, Computers & Operations Research, vol. 62,
(2015), pp. 282-295.
[11] S. Song, “Design of Distributed Database Systems: An Iterative Genetic Algorithm”, Journal of
Intelligent Information Systems, vol. 45, no. 1, (2015), pp. 29-59.
[12] V. C. Storey and R. C. Goldstein, “A methodology for creating user views in database design”, ACM
Transactions on Database Systems, vol. 13, no. 3, (1988), pp. 305-338.

Author
Ali Amiri received the MBA and Ph.D. degrees in Management
Science/information systems from The Ohio State University,
Columbus, OH, in 1988 and 1992, respectively. He is a Professor
of Management Science and Information Systems at Oklahoma
State University. His research interests include data
communications, electronic commerce, data mining, and database
management. His papers have appeared in a variety of journals
including IEEE Transactions on Communications, European
Journal of Operational Research, Computers and Operations
Research, INFORMS Journal on Computing, Decision Support
Systems, ACM Transactions on Internet Technology, Information
Sciences, and Naval Research Logistics.

Copyright ⓒ 2017 SERSC Australia 55


International Journal of Database Theory and Application
Vol.10, No.8 (2017)

56 Copyright ⓒ 2017 SERSC Australia

View publication stats

You might also like