Experimental Study of Data Migration Algorithms
Experimental Study of Data Migration Algorithms
Experimental Study of Data Migration Algorithms
net/publication/227273595
CITATIONS READS
48 198
8 authors, including:
Jared Saia
University of New Mexico
138 PUBLICATIONS 3,113 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jared Saia on 28 May 2014.
Eric Anderson1 , Joe Hall2 , Jason Hartline2 , Michael Hobbs1 , Anna R. Karlin2 ,
Jared Saia2 , Ram Swaminathan1 , and John Wilkes1
1
Storage Systems Program, Hewlett-Packard Laboratories
Palo Alto, CA 94304
{anderse,mjhobbs,swaram,wilkes}@hpl.hp.com
2
Department of Computer Science and Engineering, University of Washington
Seattle, WA 98195
{jkh,hartline,karlin,saia}@cs.washington.edu
1 Introduction
The performance of modern day large-scale storage systems (such as disk farms)
can be improved by balancing the load across devices. Unfortunately, the optimal
data layout is likely to change over time because of workload changes, device
additions, or device failures. Thus, it may be desirable to periodically compute
a new assignment of data to devices [3,5,6,11], either at regular intervals or on
demand as system changes occur. Once the new assignment is computed, the
data must be migrated from the old configuration to the new configuration.
This migration must be done as efficiently as possible to minimize the impact
of the migration on the system. The large size of the data objects (gigabytes
are common) and the the large amount of total data (can be petabytes) makes
migration a process which can easily take several days.
In this paper, we consider the problem of finding an efficient migration plan.
We focus solely on the offline migration problem i.e. we ignore the load im-
posed by user requests for data objects during the migration. Our motivation
for studying this problem lies in migrating data for large-scale storage system
management tools such as Hippodrome [2]. Hippodrome automatically adapts to
G. Brodal et al. (Eds.): WAE 2001, LNCS 2141, pp. 145–158, 2001.
c Springer-Verlag Berlin Heidelberg 2001
146 Eric Anderson et al.
• k
• ◦
k k k
k
• • • •
k k
It is easy to see that n/3 is a worst case lower bound on the number of bypass
nodes needed to perform a migration in ∆ stages – consider the example demand
graph consisting of k disjoint copies of the 3-cycle (n = 3k).
In this paper, we introduce two new migration algorithms. The primary focus
of our work is on the empirical evaluation of these algorithms, and the migration
algorithms introduced in [7].
For the case where there are no space constraints, we evaluate two algorithms
which use indirection. We introduce the first of these in this paper; it is called
Max-Degree-Matching. This algorithm can find a migration taking ∆ steps while
using no more than 2n/3 bypass nodes. We compare this to 2-factoring [7] which
finds a migration plan taking 2 ∆/2 steps by using no more than n/3 bypass
nodes [7]. While 2-factoring has better theoretical bounds than Max-Degree-
Matching, we will see that Max-Degree-Matching uses fewer bypass nodes on
almost all tested demand graphs.
For migration with space constraints, we introduce a new algorithm, Greedy-
Matching, which uses no bypass nodes. We know of no good bound on the num-
ber of time steps taken by Greedy-Matching in the worst case; however, in our
experiments, Greedy-Matching often returned plans with very close to ∆ time
steps and never took more than 3∆/2 time steps. This compares favorably with
4-factoring direct [7] which also never uses bypass nodes but which always takes
essentially 3∆/2 time steps.
The paper is organized as follows. In Section 2, we describe the algorithms
we have evaluated for indirect migration without space constraints. In Section 3,
we describe the algorithms we have evaluated for migration with space con-
straints. Section 4 describes how we create the demand graphs on which we test
the migration algorithms while Sections 5 and 6 describe our experimental re-
sults. Section 7 gives an analysis and discussion of these results and Section 8
summarizes and gives directions for future research.
148 Eric Anderson et al.
4-factoring direct and 4-factoring indirect. Hall et al. show that 4-factoring di-
rect finds a 6 ∆/4 stage migration without bypass nodes and that 4-factoring
indirect finds a 4 ∆/4 stage migration plan using at most n/3 bypass nodes.
In our implementation of 4-factoring indirect, we again use the heuristic of
using nodes with only dummy edges in a particular stage as bypass nodes for
that stage.
4 Experimental Setup
The following table summarizes the theoretical results known for each algo-
rithm2 .
Algorithm Type Space Plan Length Worst Case
Constraints Max. Bypass Nodes
2-factoring [7] indirect No 2 ∆/2 n/3
Max-Degree-Matching indirect No ∆ 2n/3
Greedy-Matching direct Yes unknown 0
4-factoring direct [7] direct Yes 6 ∆/4 0
4-factoring indirect [7] indirect Yes 4 ∆/4 n/3
5.2 Performance
Figure 1 shows the performance of the algorithms on the load-balancing graphs
in terms of the number of bypass nodes used and the time steps taken. The x-axis
in each plot gives the index of the graph which is consistent across both plots.
The indices of the graphs are clustered according to the sets the graphs are from
with the first, second, third and fourth sets appearing left to right, separated by
solid lines.
An Experimental Study of Data Migration Algorithms 151
The first plot shows the number of bypass nodes used by 2-factoring, 4-
factoring indirect and Max-Degree-Matching. We see that Max-Degree-Matching
uses 0 bypass nodes on most of the graphs and never uses more than 1. The
number of bypass nodes used by 2-factoring and 4-factoring indirect are always
between 0 and 6, even for the graphs with about 300 nodes. The second plot
shows the number of stages required divided by ∆ for Greedy-Matching. Recall
that this ratio for 2-factoring,Max-Degree-Matching and 4-factoring indirect is
essentially 1 while the ratio for 4-factoring direct is essentially 1.5. In the graphs
in the second and third set, Greedy-Matching almost always has a ratio near 1.
However in the first set, Greedy-Matching has a ratio exceeding 1.2 on several
of the graphs and a ratio of more than 1.4 on one of them. In all cases, Greedy-
Matching has a ratio less than 4-factoring direct.
We note the following important points: (1) On all of the graphs, the number
of bypass nodes needed is less than 6 while the theoretical upper bounds are
significantly higher. In fact, Max-Degree-Matching used no bypass nodes for the
majority of the graphs (2) Greedy-Matching always takes fewer stages than 4-
factoring direct.
Varying Number of Nodes The three plots in the left column of Figure 2
give results for random graphs where the edge density is fixed and the number
of nodes varies. The first plot in this column shows the number of bypass nodes
used for General Graphs with edge density fixed at 10 as the number of nodes
increases from 100 to 1200. We see that Max-Degree-Matching and 2-factoring
consistently use no bypass nodes. 4-factoring indirect uses between 2 and 3
bypass nodes and surprisingly this number does not increase as the number of
nodes in the graph increases.
The second plot shows the number of bypass nodes used for Regular Graphs
with ∆ = 10 as the number of nodes increases from 100 to 1200. We see that
the number of bypass nodes needed by Max-Degree-Matching stays relatively
constant at 1 as the number of nodes increases. The number of bypass nodes
used by 2-factoring and 4-factoring indirect are very similar, starting at 3 and
growing very slowly to 6, approximately linearly with a slope of 1/900.
The third plot shows the number of bypass nodes used on Zipf Graphs with
minimum degree 1 as the number of nodes increases. In this graph, 2-factoring
152 Eric Anderson et al.
5
Bypass nodes required
0
0 10 20 30 40 50 60 70
Graph number
1.6
stages required/delta
1.5
1.4
1.3
1.2
1.1
1
0 10 20 30 40 50 60 70
Graph number
Fig. 1. The top plot gives the number of bypass nodes required for the algorithms
2-factoring, 4-factoring indirect and Max-Degree-Matching on each of the Load-
Balancing Graphs. The bottom plot gives the ratio of time steps required to ∆
for Greedy-Matching on each of the Load-Balancing Graphs. The three solid lines
in both plots divide the four sets of Load-Balancing Graphs
An Experimental Study of Data Migration Algorithms 153
Varying Edge Density The three plots in the right column of Figure 2 show
the number of bypass nodes used for graphs with a fixed number of nodes as the
edge density varies. The first plot in the column shows the number of bypass
nodes used on General Graphs, when the number of nodes is fixed at 100, and
edge density is varied from 20 to 200. We see that the number of bypass nodes
used by Max-Degree-Matching is always 0. The number of bypass nodes used
by 2 and 4-factoring indirect increases very slowly, approximately linearly with
a slope of about 1/60. Specifically, the number used by 2-factoring increases from
1/2 to 6 while the number used by 4-factoring indirect increases from 4 to 6.
The second plot shows the number of bypass nodes used on Regular Graphs,
when the number of nodes is fixed at 100 and ∆ is varied from 20 to 200.
The number of bypass nodes used by Max-Degree-Matching stays relatively flat
varying slightly between 1/2 and 1. The number of bypass nodes used by 2-
factoring and 4-factoring indirect increases near linearly with a larger slope of
1/30, increasing from 4 to 12 for 2-factoring and from 4 to 10 for 4-factoring
indirect.
The third plot shows the number of bypass nodes used on Zipf Graphs, when
the number of nodes is fixed at 146 and the minimum degree is varied from 1 to
10. 2-factoring here again always uses 0 bypass nodes. The Max-Degree-Matching
curve again stays relatively flat varying between 1/4 and 1. 4-factoring indirect
varies slightly, from 2 to 4, again near linearly with a slope of 1/5.
We suspect that our heuristic of using nodes with only dummy edges as
bypass nodes in a stage helps 2-factoring significantly on Zipf Graphs since
there are so many nodes with small degree and hence many dummy self-loops.
For General and Regular Graphs, the migration plans Greedy-Matching found
never took more than ∆ + 1 time steps. Since the other algorithms we tested
are guaranteed to have plans taking less than ∆ + 3, we present no plots of
the number of time steps required for these algorithms on General and Regular
Graphs.
As shown in Figure 3, the number of stages used by Greedy-Matching for
Zipf Graphs is significantly worse than for the other types of random graphs. We
note however that it always performs better than 4-factoring direct. The first
plot shows that the number of extra stages used by Greedy-Matching for Zipf
Graphs with minimum degree 1 varies from 2 to 4 as the number of nodes varies
from 100 to 800. The second plot shows that the number of extra stages used
by Greedy-Matching for Zipf Graphs with 146 nodes varies from 1 to 11 as the
minimum degree of the graphs varies from 1 to 10. High density Zipf graphs are
the one weakness we found forGreedy-Matching.
154 Eric Anderson et al.
Bypass nodes
8
2
6
1.5
1 4
0.5 2
0 0
0 200 400 600 800 1000 1200 0 20 40 60 80 100 120 140 160 180 200
Number of Nodes Density
Bypass nodes
8
4
6
3
2 4
1 2
0 0
0 200 400 600 800 1000 1200 0 20 40 60 80 100 120 140 160 180 200
Number of Nodes Delta
4
Bypass nodes
Bypass nodes
2
2
1
1
0 0
0 100 200 300 400 500 600 700 1 2 3 4 5 6 7 8 9 10
Number of Nodes Minimum Degree
Fig. 2. These six plots give the number of bypass nodes needed for 2-factoring,
4-factoring direct and Max-Degree-Matching for the General, Regular and Zipf
Graphs. The three plots in the left column give the number of bypass nodes
needed as the number of nodes in the random graphs increase. The three plots
in the right column give the number of bypass nodes needed as the density of
the random graphs increase. The plots in the first row are for General Graphs,
plots in the second row are for Regular Graphs and plots in the third row are
for Zipf Graphs
An Experimental Study of Data Migration Algorithms 155
40
100
30
20
50
10
0 0
0 100 200 300 400 500 600 700 1 2 3 4 5 6 7 8 9 10
Number of Nodes Minimum Degree
Fig. 3. Number of steps above Delta needed for Greedy-Matching on Zipf Graphs
7 Analysis
Our major empirical conclusions for the graphs tested are:
performed so much better than 2-factoring and 4-factoring indirect despite its
worse theoretical bound.
8 Conclusion
We have introduced two new data migration algorithms and have empirically
evaluated their performance compared with two algorithms from [7]. The metrics
we used to evaluate the algorithms are: (1) the number of time steps required to
perform the migration, and (2) the number of bypass nodes used as intermediate
storage devices. We have found on several types of random and load-balancing
graphs that the new algorithms outperform the algorithms from [7] on these two
metrics despite the fact that the theoretical bounds for the new algorithms are
not as good. Not surprisingly, we have also found that for all the algorithms
tested, the theoretical bounds are overly pessimistic. We conclude that many of
the algorithms described in this paper are both practical and effective for data
migration.
There are several directions for future work. Real world devices have different
I/O speeds. For example, one device might be able handle sending or receiving
twice as many objects per stage as another device. We want good approximation
algorithms for migration with different device speeds. Also in some important
cases, a complete graph is a poor approximation to the network topology. For
example, a wide area network typically has a very sparse topology which is more
closely related to a tree than to a complete graph. We want good approximation
algorithms for commonly occuring topologies (such as trees) and in general for
arbitrary topologies. Saia [10] gives some preliminary approximation algorithms
for migration with variable device speeds and different network topologies.
Another direction for future work is designing algorithms which work well
for the online migration problem. In this paper, we have ignored loads imposed
by user requests in devising a migration plan. A better technique for creating
a migration plan would be to migrate the objects in such a way that we inter-
fere as little as possible with the ability of the devices to satisfy user requests
and at the same time improve the load balancing behavior of the network as
quickly as possible. This may require adaptive algorithms since user requests are
unpredictable.
A final direction for future work is designing algorithms which make use of
free nodes when available but do not require them to perform well. In particular,
we want a good approximation algorithm for migration with indirection when no
external bypass nodes are available. To the best of our knowledge, no algorithm
with an approximation ratio better than 3/2 for this problem is known at this
time.
An Experimental Study of Data Migration Algorithms 157
References
1. E. Anderson, J. Hall, J. Hartline, M. Hobbs, A. Karlin, J. Saia, R. Swaminathan,
and J. Wilkes. An Experimental Study of Data Migration Algorithms. Technical
report, University of Washington, 2001. 148, 149, 150
2. E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. Veitch. Hippo-
drome: running circles around storage administration. Submitted to Symposium
on Operating System Principles, 2001. 145, 150
3. E. Borowsky, R. Golding, A. Merchant, L. Schreier, E. Shriver, M. Spasojevic, and
J. Wilkes. Using attribute-managed storage to achieve QoS. In 5th Intl. Workshop
on Quality of Service, Columbia Univ., New York, June 1997. 145
4. Transaction Processing Performance Council. TPC Benchmark D (Decision Sup-
port) Standard Specification Revision 2.1. 1996. 150
5. B. Gavish and O. R. Liu Sheng. Dynamic file migration in distributed computer
systems. Communications of the ACM, 33:177–189, 1990. 145
6. I. Golubchik, S. Khuller S. Khanna, R. Thurimella, and A. Zhu. Approximation
Algorithms for Data Placement on Parallel Disks. In Proceedings of the Eleventh
Annual ACM-SIAM Symposium on Discrete ALgorithms, pages 223–232, 2000.
145
7. J. Hall, J. Hartline, A. Karlin, J. Saia, and J. Wilkes. On algorithms for efficient
data migration. In 12th annual ACM-SIAM Symposium on Discrete Algorithms,
2001. 145, 147, 149, 156 p
8. S. Micali and V. Vazirani. An O( |V ||E|) algorithm for finding a maximum
matching in general graphs. In 21st Annual Symposium on Foundations of Com-
puter Science, 1980. 148, 158
9. T. Nishizeki and K. Kashiwagi. On the 1.1 edge-coloring of multigraphs. In SIAM
Journal on Discrete Mathematics, volume 3, pages 391–410, 1990. 147
10. J. Saia. Data Migration with Edge Capacities and Machine Speeds. Technical
report, University of Washington, 2001. 156
11. J. Wolf. The Placement Optimization Problem: a practical solution to the disk
file assignment problem. In Proceedings of the ACM SIGMETRICS international
conference on Measurement and modeling of computer systems, pages 1–10, 1989.
145
158 Eric Anderson et al.
A Appendix
A.1 Max-Degree-Matching