Fragment Allocation and Replication in Distributed
Fragment Allocation and Replication in Distributed
net/publication/320467866
CITATIONS READS
0 81
1 author:
Ali Amiri
Oklahoma State University - Stillwater
9 PUBLICATIONS 45 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ali Amiri on 21 August 2023.
Ali Amiri
Department of MSIS
Spears School of Business
Oklahoma State University
Stillwater, OK, 74078, USA
amiri@okstate.edu
Abstract
We study the problem of designing a distributed database system. We develop
optimization models for the problem that deals simultaneously with two major design
issues, namely which fragments to replicate, and where to store those fragments and
replicas. Given the difficulty of the problem, we propose a solution algorithm based on a
new formulation of the problem in which every server is allocated a fragment
combination from a set of combinations generated by a randomized greedy heuristic. The
results of a computational study show that the algorithm outperforms a standard branch
& bound technique for large instances of the problem.
1. Introduction
A distributed database is a single logical database that is spread physically across
computers in multiple locations that are connected by a data communications network.
The adoption of distributed databases is nurtured by various business conditions such as
global nature of business operations and transactions, distribution and autonomy of
business units, and telecommunications costs and reliability [7]. A distributed database
offers several potential benefits over a centralized database such as improved availability,
reliability and performance and lower costs. For those benefits to realize, a distributed
database should be properly designed with respects to three important issues: how to
partition the database into fragments, which fragments to replicate, and where to store
those fragments and replicas [1,11].
As stated in [12], logical database design deals with determining the contents of a
database independently of the physical implementation considerations. The guide of the
design process is typically the statement of user view requirements/needs in the form of a
set of user views [12]. Each view defines the data contents and transactions which are
required by a specific job/function that a user view (or group of user views) performs.
Each view requires the identification of the database fragments that are needed to process
retrieval and update transactions of the corresponding user view. The task of fragmenting
the database effectively is a difficult one in itself. However, a variety of approaches exist
for fragmenting databases [2,4], and this paper assumes that the fragments have already
been identified based on the definition of the user views.
The response to a query in a distributed database environment may require assembling
data from several different sites (although with location transparency, the user is unaware
of this need). The volume of data transmitted over the communication network depends
heavily on the type of the query and the locations of the fragments and their replicas
involved in the query. A retrieval query uses only one copy of each fragment in the
query; whereas an update query uses every copy of each fragment in the query to maintain
up-to-date data. Besides the way a query is formulated by a user view, query optimization
by the database management system affects the query execution efficiency, especially
when the query requires fragments located in different sites. In this paper, we design the
distributed database in such a way that all the fragments required by a user view are
located in the same site. This approach simplifies query optimization and improves
response time to users, especially when processing retrieval queries. As this approach
allows replication of fragments, an update query still requires updating every copy of the
fragments in the queries. While the nonredundant fragment allocation scheme (i.e., storing
one copy of each fragment in the database) makes the problem less difficult to solve, it is
a very restrictive perspective. Under that scheme, the strategy is first to find a
nonredundant solution by ignoring replication and second to apply a heuristic to that
solution to decide how to replicate fragments [6,8]. The quality of the final solution
obtained using this non-integrated strategy is inferior to the solution of the integrated
approach (i.e., which considers fragment replication and allocation simultaneously) that
we adopt in our paper [8].
Many studies have considered various aspects of the distributed database design
problem in a variety of contexts. A comprehensive review of these studies can be found
in [9]. Here, we consider studies that are most related to the topic of our paper. Menon [7]
presented new integer programming formulations for the nonredundant version of the
fragment allocation problem (where exactly one copy of each fragment exits across all
sites). These formulations are solved using the commercial integer programming solver
CPLEX [5]. Menon [7] reported computational test results which show that his
formulations are more effective than prior formulations using up to 200 fragments and 10
servers.
Hababeh et al., [3] developed a method for grouping the distributed sites into clusters
and customizing the database fragments allocation to the clusters and their sites. The
method proceeds in three steps: (i) it groups the computing sites into disjoint clusters, (ii)
it allocates the fragments to the clusters, and (iii) it allocates the fragments within each
cluster to its sites independently of the other clusters and sites. Hababeh et al., [3]
reported results of computational tests using a database system with 6 sites and 8
fragments.
Sen et al., [10] studied a general version of the problem where (i) the files/segments
have to be clustered and the clusters need to be assigned to a given number of servers
whose locations are to be chosen from among a pre-determined set of potential locations,
and (ii) query traffic between the users and the servers are routed over a fully connected
backbone network. This general problem is solved indirectly by solving first the data
partitioning problem and then the segment allocation problem.
In this paper, we present an optimization model for the distributed database design
problem (DDDP) that deals simultaneously with two major design issues, namely which
fragments to replicate, and where to store those fragments and replicas. Again, we assume
that the database fragments have already been determined based on the definition of the
user views. We develop a mathematical programming model of the DDDP which allows a
user view to be assigned to a server only if that server is allocated all the fragments
required by the user view. In particular, the model considers the objective of minimizing
total system cost which is composed of three components: (i) the communication and
processing costs of retrieval transactions of the user views, (ii) the communication and
processing costs of update transactions of the user views, and (iii) the cost of storing and
maintaining the fragments and their replicas on the servers. The decisions to make are (i)
how to assign user views to the servers to process their transactions, and (ii) how to
replicate the fragments and allocate them to the servers. A solution to the problem should
satisfy the following requirements: (i) each user view is assigned to one server to process
its transactions, (ii) a user view can be assigned to a server only if all the fragments
needed by the user view transactions are stored on that server, and (iii) the processing
capacities of the servers are not exceeded.
Given the difficulty of the distributed database design problem, we propose a solution
algorithm based on a new formulation of the problem where every server is allocated a
fragment combination from among all combination or from a well generated set of
combinations. A combination consists of a group of fragments to be allocated to a server.
We run a computational study to study the effectiveness of the solution algorithm. The
results show that the proposed algorithm outperforms a standard branch & bound
technique based on the first formulation of the problem for large instances of the problem.
The remainder of the paper is organized as follows. Section 2 presents two integer
programming formulations of the problem. Two solution procedures are described in
Section 3. Computational results are reported in Section 4 while Section 5 concludes.
2. Problem Formulations
Define N the set of user views that need to be assigned to the set M of servers/sites.
Define F the set of fragments that need to be allocated to the servers, 𝐹𝑖′ the set of
fragments needed by user view 𝑖 ∈ 𝑁 in retrieval transactions, and 𝐹𝑖′′ the set of fragments
needed by user view 𝑖 ∈ 𝑁 in update transactions. Define parameters 𝑑𝑖′ to be processing
′′
demand of retrieval transactions of user view 𝑖 ∈ 𝑁, 𝑑𝑖𝑘 processing demand of update
′
transactions of fragment 𝑘 ∈ 𝐴 of user view 𝑖 ∈ 𝑁, 𝐶𝑖𝑗 communication and processing
′′
cost of retrieval transactions of user view 𝑖 ∈ 𝑁 when it is assigned to server 𝑗 ∈ 𝑀, 𝐶𝑖𝑘𝑗
communication and processing cost of update transactions of user view 𝑖 ∈ 𝑁 for
fragment 𝑘 ∈ 𝐹 when it stored at server 𝑗 ∈ 𝑀, 𝐶𝑘𝑗 cost of storing and maintaining
fragment 𝑘 ∈ 𝐹 in server 𝑗 ∈ 𝑀, and 𝑄𝑗 processing capacity of server 𝑗 ∈ 𝑀. Define the
decision variables as
1 𝑖𝑓 𝑢𝑠𝑒𝑟 𝑣𝑖𝑒𝑤 𝑖 𝑖𝑠 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑡𝑜 𝑠𝑒𝑟𝑣𝑒𝑟 𝑗 ∈ 𝑀
𝑋𝑖𝑗 = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑒𝑠𝑒
1 𝑖𝑓 𝑓𝑟𝑎𝑔𝑚𝑒𝑛𝑡 𝑘 ∈ 𝐹 𝑖𝑠 𝑠𝑡𝑜𝑟𝑒𝑑 𝑖𝑛 𝑠𝑒𝑟𝑣𝑒𝑟 𝑗 ∈ 𝑀
𝑌𝑘𝑗 = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑒𝑠𝑒
Given these definitions, the distributed database design problem (DDDP) can be
formulated as model DDDP1 below.
DDDP1:
st
∑ 𝑋𝑖𝑗 = 1 ∀𝑖 ∈𝑁 (2)
𝑗∈𝑀
The goal is to minimize the total communication and processing cost of user view
retrieval and update transactions and cost of setting up and maintaining fragments in the
servers. Constraints (2) ensures that each user view is assigned to one server to handle its
retrieval and update transactions. Constraints (3) ensure that if a user view is assigned to
a server, then every fragment required by the user view is stored in that server.
Constraints (4) limit the total processing demand of the user views assigned to a server to
the capacity of the server. The first and second terms in the left hand sides of constraints
(4) represent the processing requirements of retrieval and update transactions,
respectively. Constraints (5) are restrict the decision variables to be binary.
Since the distributed database design problem (DDDP1) is NP-complete, it is
extremely difficult to solve it optimally in an acceptable computing time. This difficulty is
the direct result of the fact that both the allocation of fragments to the servers and the
assignment of user views to the servers have to be handled concurrently. We conducted
computational tests that show that the linear programming relaxation of (DDDP1)
produces non-integer values of the decision variables and, hence, a standard branch &
bound technique using formulation (DDDP1) is not likely to generate an optimal solution
in reasonable time. We next develop a second formulation of the distributed database
design problem that will be used in solving the problem more effectively.
A new formulation for the distributed database design problem can be obtained by
viewing it as a covering problem. This formulation stems from the observation that in
any feasible solution to the problem, the user views have to be covered by the servers they
are assigned to, with a user view being allocated to one server that maintains all the
fragments the user view needs in its retrieval and update transactions. As a result, we can
solve the problem using all combinations of fragments and assign the “best” combinations
to the servers for which the total cost is minimized. Every combination of fragments is a
valid configuration that can be allocated to a server.
Using the above insight, the distributed database design problem can now be
reformulated in the following way. Define G as the set of all combinations, where a
combination is simply a subset of fragments that can be assigned to a server. Define
parameter 𝐺𝑖 to be the set of combinations which contain all fragments used in processing
transactions of user view 𝑖 ∈ 𝑁. Define parameter 𝑎𝑘𝑔 to be 1 if fragment k is included in
combination g. Define a binary variable 𝑊𝑗𝑔 to be 1 if server j is allocated fragment
combination g and 0 if not. Each server needs to be assigned one combination, including
the empty combination to leave the possibility that a server may not be used. This can be
stated mathematically as
∑ 𝑊𝑗𝑔 = 1 ∀𝑗 ∈𝑀
𝑔∈𝐺
Each user view should be assigned to one server whose configuration should include
all fragments that the user view needs. Mathematically, this is equivalent to saying
𝑋𝑖𝑗 ≤ ∑ 𝑊𝑗𝑔 ∀ 𝑖 ∈ 𝑁, 𝑗 ∈ 𝑀
𝑔∈𝐺𝑖
st
∑ 𝑋𝑖𝑗 = 1 ∀𝑖 ∈𝑁 (7)
𝑗∈𝑀
∑ 𝑊𝑗𝑔 = 1 ∀𝑗 ∈𝑀 (9)
𝑔
∑ 𝑑𝑖′ 𝑋𝑖𝑗 + ∑ ∑ ∑ ′′
𝑑𝑖𝑘 𝑊𝑗𝑔 ≤ 𝑄𝑗 ∀𝑗 ∈ 𝑀 (10)
𝑖∈𝑁 𝑖∈𝑁 𝑔∈𝐺 𝑘∈𝐹𝑖′′ :𝑎𝑘𝑔 =1
simple deterministic greedy algorithm involves sequentially selecting a pair of user view
and server and allocating the user view to the server. The criterion used in the selection is
the net increase in overall cost per unit of demand of the user view. For every user view,
we identify the least expensive server to assign the user view to taking into account the
fragments that are already allocated to the server. From all pairs of user views and servers,
we choose the “best” pair of user view and server that produces the least increase in
overall cost per unit of demand of the user view. The heuristic finishes when we assign all
user views to the servers. To escape a possible local optimum, we randomize the search
by perturbing the costs as follows. If we let ∆𝑖𝑗 =Increase in total cost if user view i is
allocated to server j, then we use ∆̃𝑖𝑗 = (1 + 𝜀𝑖𝑗 )∆𝑖𝑗 in selecting the next pair of user view
and server, where i is a random number from the uniform distribution U[-0.05, 0.05].
An outline of the algorithm RGA is described below.
The following notation is used in describing the algorithm.
C overall cost, initially TC=0
𝑁̅ set of un-assigned user views, initially 𝑁 ̅=𝑁
𝑁̿ set of processed user views (i.e., user views assigned to servers), initially 𝑁 ̿=∅
𝐹𝑗 set of fragments allocated to server 𝑗 ∈ 𝑀, initially 𝐹𝑗 = ∅
𝑄̅𝑗 remaining capacity of server 𝑗 ∈ 𝑀, initially 𝑄̅𝑗 = 𝑄𝑗
Randomized Greedy Algorithm (RGA):
Let 𝐵∗ be the best feasible solution found until now.
Repeat 500 times
Step 1: Initialize
𝐶 = 0, 𝑁 ̅ = 𝑁, 𝑁 ̿ = ∅, 𝐹𝑗 = ∅, ̅ 𝑄𝑗 = 𝑄𝑗 ∀𝑗 ∈ 𝑀, B is the solution to construct.
Step 2: User view selection and allocation
While 𝑁̅ ≠ ∅ do
For evey pair (𝑖, 𝑗) ∈ 𝑁 ̅ × 𝑀, compute ∆̃𝑖𝑗 = (1 + 𝜀𝑖𝑗 )∆𝑖𝑗 ; the remaining capacity
of server j (i.e., 𝑅𝑗 ) should large enough to handle the additional retrieval and
update demand
Select the pair (𝑖 ∗ , 𝑗 ∗ ) with the smallest ∆̃𝑖𝑗 . Assign user view 𝑖 ∗ to server 𝑗 ∗.
Update
Current solution B.
𝐶
𝑁 ̅=𝑁 ̅\{𝑖 ∗ },
𝑁 ̿=𝑁 ̿ ∪ {𝑖 ∗ },
𝐹𝑗 ∗ = 𝐹𝑗 ∗ ∪ 𝐹𝑖′ ∪ 𝐹𝑖"
𝑄̅𝑗 ∗
End While
If solution B has a better objective function value than solution 𝐵∗ , then 𝐵∗ = 𝐵.
End Repeat
Retain 𝐵∗ as the best feasible solution and stop.
During the execution of the algorithm RGA, many distinct server configurations are
produced; each configuration identifies a set of fragments to store on a server. Such
configurations can be fed into formulation (DDDP2) which can be solved with a standard
branch & bound algorithm to produce theoretically the best feasible solution to the
original problem using those configurations. We refer to this procedure as algorithm
CONFIG. Given that the subset of configurations employed does not necessarily contain
the complete set of best configurations, the optimal solution to the distributed database
design problem (DDDP1) is not ensured. In order to expedite the branch and bound
algorithm, we limit the number of configurations used in formulation (DDDP2) to 200
which are part of the "best" solutions generated during the application of RGA. The reason
for setting up this limit is that we observed that when we solve model (DDDP2) using all
the distinct configurations generated by RGA the solution quality deteriorates significantly
because the solver has to spend a lot of time evaluating all those configurations rather
than concentrating on the most promising ones.
We illustrate the performance of the two formulations using the small example
described earlier. The solution produced using formulation (DDDP1) is optimal with a
total cost of 51908 and consits of assigning user views {6, 8, 9, 12, 15, 20} to server 1,
user views {7, 11} to server 2, user views {2, 3, 5, 10, 13, 14, 16} to server 3, and user
views {1, 4, 17, 18, 19} to server 5, and allocationg fragments {1, 2, 3, 5, 6, 11, 13, 14,
15, 16, 18, 19, 20} to server 1, fragments {1, 3, 5, 7, 9, 15, 20} to server 2, fragments {1,
2, 3, 6, 7, 8, 10, 13, 14, 16, 17, 18, 19, 20} to server 3, and fragments {1, 4, 5, 6, 7, 8, 9,
10, 11, 12, 14, 17, 19, 20} to server 5. The utilizations of the five servers are 99%, 29%,
98%, 0%, and 97%, respectively. The solution produced using algorithm CONFIG has a
total cost of 52358, corresponding to an optimality gap of only 0.87%, and it consits of
assigning user views {6, 8, 9, 12, 15, 20} to server 1, user views {7, 11} to server 2, user
views {1, 3, 5, 10, 13, 14, 16} to server 3, and user views {2, 4, 17, 18, 19} to server 5,
and allocationg fragments {1, 2, 3, 5, 6, 11, 13, 14, 15, 16, 18, 19, 20} to server 1,
fragments {1, 3, 5, 7, 9, 15, 20} to server 2, fragments {1, 2, 3, 6, 7, 8, 10, 13, 14, 16, 17,
18, 19} to server 3, and fragments {1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 17, 19, 20} to
server 5. The utilizations of the five servers are 99%, 29%, 95%, 0%, and 99%,
respectively.
number in [1,2]; and the values of 𝐶𝑘𝑗 were generated from the uniform distribution
[50,1000].
Table 3 reports the results of the tests to evaluate the effectiveness of solving the
problem using algorithm CONFIG which incorporates the set of fragment combinations
produced by the heuristic RGA into formulation DDDP2. Each combination defines a
subset of fragments that can be allocated to a server. The average number of such
combinations used in formulation DDDP2 is 38.1. The distributed database design
problem (DDDP2) is computationally less difficult than the problem with formulation
(DDDP1). The average optimality gap for the solutions obtained using CONFIG is only
2.83%. However, it is worth noting that the problem with formulation (DDDP2) is still
NP-complete, and hence its complexity can grow drastically with problem size. Even
though a small number of configurations are used in formulation (DDDP2), the
effectiveness of this formulation is on average better than that of formulation (DDDP1).
Indeed, the solutions produced using CONFIG are on average 2.20% cheaper than the
solutions produced using formulation (DDDP1). Algorithm CONFIG based on
formulation (DDDP2) outperforms formulation (DDDP1) when the number servers
increases. For instance, when there are 5 servers, both CONFIG and CPLEX applied to
(DDPP1) obtained optimal solution; however, when there are 25 servers, the costs of the
solutions produced using CONFIG are on average 6.56% lower than those produced using
formulation (DDDP1).
Table 4 sheds some light on the characteristics of solutions obtained using CONFIG
such as allocation of cost among its three components, server capacity utilization, number
of fragments stored in the servers, and number of user views allocated to the servers.
When the number of servers increases, the contributions of the update and fragment
components increase at the expense of the retrieval component. For example, for test
problem with 180 user views, when the number of servers is 5, the shares of the retrieval,
update and fragment components are 79.9%, 12.3%, and 7.8%, respectively; whereas
those numbers become 35.9%. 39.3%, and 24.8% when the number of servers is 25. In
addition, when the number of servers increases, server capacity utilization increases, and
the number of user views per server decreases; but, the number of fragments per server
does not change significantly.
5. Conclusion
A distributed database should be properly designed to benefit from the potential
advantages it offers. Effective distribution of the database fragments plays a crucial role
in the functioning of the database, affecting both cost and performance. We developed
optimization models for the problem that deals simultaneously with two major design
issues, namely which fragments to replicate, and where to store those fragments and
replicas. Since the problem is very complex, we developed a new formulation of the
problem where every server is assigned a fragment combination from a subset of
combinations properly produced by a randomized greedy algorithm in which the
randomization feature allows the exploration of a larger search space. We conducted an
elaborate computational study which shows that the reformulation is far more effective
than the original formulation especially for large size instances.
References
[1] H. I. Abdalla, “A New Data Re-allocation Model for Distributed Database Systems”, International
Journal of Database Theory and Application vol. 5, no. 2, (2012), pp. 45-60.
[2] P. R. Bhuyar, A. D. Gawande and A. B. Deshmukh, “Horizontal Fragmentation Technique in
Distributed Database”, International Journal of Scientific and Research Publications, vol. 2, no. 5,
(2012), pp. 1-7.
[3] I. O. Hababeh, M. Ramachandran and N. Bowring, “A High-Performance Computing Method for Data
Allocation in Distributed Database Systems”, The Journal of Supercomputing, vol. 39, no. 1, (2007), pp.
3-18.
[4] Y. F. Huang and C. J. Lai, “Integrating Frequent Pattern Clustering and Branch-and-Bound Approaches
for Data Partitioning”, Information Sciences, vol. 328, (2016), 288-301.
[5] IBM ILOG CPLEX Optimization Studio 12.5, IBM (2012).
[6] K. Karlapalem and N. M. Pun, “Query-Driven Data Allocation Algorithms for Distributed Database
Systems”, International Conference on Database and Expert Systems Application, Springer Berlin
Heidelberg, (1997), pp. 347-356.
[7] S. Menon, “Allocating Fragments in Distributed Databases”, IEEE Transactions on Parallel and
Distributed Systems, vol. 16, no. 7, (2005), 577-585.
[8] M. T. Ö zsu and P. Valduriez, “Principles of Distributed Database Systems”, Springer Science &
Business Media, (2011).
[9] G. Sen, M. Krishnamoorthy, N. Rangaraj and V. Narayanan, “Facility Location Models to Locate Data
in Information Networks: A Literature Review”, Annals of Operations Research, vol. 246, no. 1, (2015),
pp. 1-36.
[10] G. Sen, M. Krishnamoorthy, N. Rangaraj and V. Narayanan, “Exact Approaches for Static Data
Segment Allocation Problem in an Information Network”, Computers & Operations Research, vol. 62,
(2015), pp. 282-295.
[11] S. Song, “Design of Distributed Database Systems: An Iterative Genetic Algorithm”, Journal of
Intelligent Information Systems, vol. 45, no. 1, (2015), pp. 29-59.
[12] V. C. Storey and R. C. Goldstein, “A methodology for creating user views in database design”, ACM
Transactions on Database Systems, vol. 13, no. 3, (1988), pp. 305-338.
Author
Ali Amiri received the MBA and Ph.D. degrees in Management
Science/information systems from The Ohio State University,
Columbus, OH, in 1988 and 1992, respectively. He is a Professor
of Management Science and Information Systems at Oklahoma
State University. His research interests include data
communications, electronic commerce, data mining, and database
management. His papers have appeared in a variety of journals
including IEEE Transactions on Communications, European
Journal of Operational Research, Computers and Operations
Research, INFORMS Journal on Computing, Decision Support
Systems, ACM Transactions on Internet Technology, Information
Sciences, and Naval Research Logistics.