Inside Dropbox: Understanding Personal Cloud Storage Services
Inside Dropbox: Understanding Personal Cloud Storage Services
Inside Dropbox: Understanding Personal Cloud Storage Services
net/publication/240918599
CITATIONS READS
294 1,290
6 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Aiko Pras on 18 August 2014.
ABSTRACT 1. INTRODUCTION
Personal cloud storage services are gaining popularity. With Recent years have seen the introduction of cloud-based
a rush of providers to enter the market and an increasing of- services [18], offering people and enterprises computing and
fer of cheap storage space, it is to be expected that cloud storage capacity on remote data-centers and abstracting
storage will soon generate a high amount of Internet traffic. away the complexity of hardware management. We witness
Very little is known about the architecture and the perfor- a gold rush to offer storage capabilities on the Internet, with
mance of such systems, and the workload they have to face. players like Microsoft, Google and Amazon directly entering
This understanding is essential for designing efficient cloud the market at the end of April 2012. They face a crowded
storage systems and predicting their impact on the network. scenario against popular solutions like SugarSync, Box.com,
This paper presents a characterization of Dropbox, the UbuntuOne, and Dropbox. The latter, active since 2007,
leading solution in personal cloud storage in our datasets. currently counts over 50 million users, uploading more than
By means of passive measurements, we analyze data from 500 million files daily1 .
four vantage points in Europe, collected during 42 consecu- It is thus not surprising that cloud storage has gained
tive days. Our contributions are threefold: Firstly, we are increasing momentum within research community. Some
the first to study Dropbox, which we show to be the most works explicitly consider system architecture design [12],
widely-used cloud storage system, already accounting for a while others focus on security and privacy issues concern-
volume equivalent to around one third of the YouTube traffic ing the storage of user data [19]. Considering commer-
at campus networks on some days. Secondly, we characterize cial offers, little is known, with most players providing
the workload typical users in different environments gener- proprietary solutions and not willing to share information.
ate to the system, highlighting how this reflects on network Some studies present a comparison among different storage
traffic. Lastly, our results show possible performance bot- providers [13, 17]: by running benchmarks, they focus on
tlenecks caused by both the current system architecture and the user achieved performance, but miss the characteriza-
the storage protocol. This is exacerbated for users connected tion of the typical usage of a cloud service, and the impact of
far from control and storage data-centers. user and system behavior on personal storage applications.
In this paper, we provide a characterization of cloud-
based storage systems. We analyze traffic collected from
Categories and Subject Descriptors two university campuses and two Points of Presence (POP)
C.2 [Computer-Communication Networks]: Miscella- in a large Internet Service Provider (ISP) for 42 consecutive
neous; C.4 [Performance of Systems]: Measurement days. We first devise a methodology for monitoring cloud
Techniques storage traffic, which, being based on TLS encryption, is
not straightforward to be understood. We then focus on
General Terms Dropbox, which we show to be the most widely-used cloud
storage system in our datasets. Dropbox already accounts
Measurement, Performance
for about 100GB of daily traffic in one of the monitored net-
works – i.e., 4% of the total traffic or around one third of the
Keywords YouTube traffic at the same network. We focus first on the
Dropbox, Cloud Storage, Internet Measurement. service performance characterization, highlighting possible
bottlenecks and suggesting countermeasures. Then, we de-
tail user habits, thus providing an extensive characterization
of the workload the system has to face.
Permission to make digital or hard copies of all or part of this work for To be best of our knowledge, we are the first to provide
personal or classroom use is granted without fee provided that copies are an analysis of Dropbox usage on the Internet. The authors
not made or distributed for profit or commercial advantage and that copies of [11] compare Dropbox, Mozy, Carbonite and CrashPlan,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
but only a simplistic active experiment is provided to assess
permission and/or a fee. them. In [16], the possibility of unauthorized data access
IMC’12, November 14–16, 2012, Boston, Massachusetts, USA. 1
Copyright 2012 ACM 978-1-4503-XXXX-X/12/11 ...$15.00. http://www.dropbox.com/news
and the security implications of storing data in Dropbox are
analyzed. We follow a similar methodology to dissect the Table 1: Domain names used by different Dropbox
Dropbox protocol, but focus on a completely different topic. services. Numeric suffixes are replaced by a X letter.
Considering storage systems in general, [8, 9] study security sub-domain Data-center Description
and privacy implications of the deployment of data dedu- client-lb/clientX Dropbox Meta-data
plication – the mechanism in place in Dropbox for avoiding notifyX Dropbox Notifications
the storage of duplicate data. Similarly, [1] presents a per- api Dropbox API control
formance analysis of the Amazon Web Services (AWS) in www Dropbox Web servers
general, but does not provide insights into personal storage. d Dropbox Event logs
Finally, several works characterize popular services, such as dl Amazon Direct links
social networks [7, 15] or YouTube [3, 6]. Our work goes in dl-clientX Amazon Client storage
a similar direction, shedding light on Dropbox and possibly dl-debugX Amazon Back-traces
other related systems. Our main findings are: dl-web Amazon Web storage
api-content Amazon API Storage
• We already see a significant amount of traffic related to
Dropbox, especially on campus networks, where people with
more technical knowledge are found. We expect these sys-
and Linux3 . The basic object in the system is a chunk of
tems to become popular also at home, where penetration is
data with size of up to 4MB. Files larger than that are split
already above 6%.
into several chunks, each treated as an independent object.
• We highlight that Dropbox performance is mainly driven Each chunk is identified by a SHA256 hash value, which is
by the distance between clients and storage data-centers. part of meta-data descriptions of files. Dropbox reduces the
In addition, short data transfer sizes coupled with a per- amount of exchanged data by using delta encoding when
chunk acknowledgment mechanism impair transfer through- transmitting chunks. It also keeps locally in each device a
put, which is as little as 530kbits/s on average. A bundling database of meta-data information (updated via incremen-
scheme, delayed acknowledgments, or a finer placement of tal updates) and compresses chunks before submitting them.
storage servers could be adopted to improve performance. In addition, the client offers the user the ability to control
• Considering home users’ behavior, four groups are clear: the maximum download and upload speed.
7% of people only upload data; around 26% only download, Two major components can be identified in the Dropbox
and up to 37% of people do both. The remaining 30% aban- architecture: the control and the data storage servers. The
don their clients running, seldom exchanging files. former are under direct control of Dropbox Inc., while Ama-
• Interestingly, one of the most appreciated features of zon Elastic Compute Cloud (EC2) and Simple Storage Ser-
Dropbox is the simplified ability to share content: 30% of vice (S3) are used as storage servers. In both cases, sub-
home users have more than one linked device, and 70% share domains of dropbox.com are used for identifying the differ-
at least one folder. At campuses, the number of shared fold- ent parts of the service offering a specific functionality, as
ers increases, with 40% of users sharing more than 5 folders. detailed in Tab. 1. HTTPS is used to access all services,
Our findings show that personal cloud storage applica- except the notification service which runs over HTTP.
tions are data hungry, and user behavior deeply affects their
network requirements. We believe that our results are use- 2.2 Understanding Dropbox Internals
ful for both the research community and ISPs to understand To characterize the usage of the service from passive mea-
and to anticipate the impact of massive adoption of such so- surements, we first gained an understanding of the Dropbox
lutions. Similarly, our analysis of the Dropbox performance client protocol. We performed several active experiments
is a reference for those designing protocols and provisioning to observe what information is exchanged after a particu-
data-centers for similar services, with valuable lessons about lar operation. For instance, among others, we documented
possible bottlenecks introduced by some design decisions. the traffic generated when adding or removing files on local
The remainder of this paper is organized as follows: Sec. 2 folders, when downloading new files and when creating new
provides insight into the Dropbox architecture. Sec. 3 de- folders. During our data collection, Dropbox client version
scribes our data collection and compares the popularity of 1.2.52 was being distributed as the stable version4 .
well-known cloud-based storage systems. Sec. 4 presents a Since most client communications are encrypted with
characterization of Dropbox performance. User habits and TLS, and no description about the protocol is provided by
the generated workload are presented in Sec. 5. While those Dropbox, we set up a local testbed, in which a Linux PC
sections mostly focus on the usage of the Dropbox client running the Dropbox client was instructed to use a Squid
software, Sec. 6 discusses the less popular Web interface. Fi- proxy server under our control. On the latter, the mod-
nally, Sec. 7 concludes this paper, and Appendix A providers ule SSL-bump5 was used to terminate SSL connections and
some additional characteristics of Dropbox storage traffic. save decrypted traffic flows. The memory area where the
Dropbox application stores trusted certificate authorities
was modified at run-time to replace the original Dropbox
2. DROPBOX OVERVIEW Inc. certificate by the self-signed one signing the proxy
server. By means of this setup, we were able to observe
2.1 The Dropbox Client and to understand the Dropbox client communication.
The Dropbox native client is implemented mostly in
3
Python, using third-party libraries such as librsync 2 . The Mobile device applications access Dropbox on demand us-
application is available for Microsoft Windows, Apple OS X ing APIs; those are not considered in this work.
4
http://www.dropbox.com/release_notes
2 5
http://librsync.sourceforge.net/ http://wiki.squid-cache.org/Features/SslBump
Each device linked to Dropbox has a unique identifier
comm
comm
hes]
(host int). Unique identifiers (called namespaces) are also
close
s []
regis
s [h a s
it_ba
it_ba
used for each shared folder. The client identifier is sent
block
_chan
store
store
ter_h
list
list
[...]
tc
tc
block
in notification requests, together with the current list of
h [h a
h [h a
ne e d_
geset
ok
ok
ost
ne e d_
shes]
shes]
therefore, be identified in network traces by passively watch-
ok
ok
ing notification flows. Finally, different devices belong-
Dropbox Amazon Dropbox ing to a single user can be inferred as well, by comparing
client/client-lb dl-client client/client-lb
namespace lists.
Figure 1: An example of the Dropbox protocol. 2.3.2 Meta-data Information Protocol
Authentication and file meta-data administration mes-
For instance, Fig. 1 illustrates the messages we observed sages are exchanged with a separate set of servers,
while committing a batch of chunks. After registering with (client-lb.dropbox.com and/or clientX.dropbox.com).
the Dropbox control center via a clientX.dropbox.com Typically, synchronization transactions start with messages
server, the list command retrieves meta-data updates. As to meta-data servers, followed by a batch of either store
soon as new files are locally added, a commit batch com- or retrieve operations through the Amazon systems. As
mand (on client-lb.dropbox.com) submits meta-data in- data chunks are successfully exchanged, the client sends mes-
formation. This can trigger store operations, performed di- sages to meta-data servers to conclude the transactions (see
rectly with Amazon servers (on dl-clientX.dropbox.com). Fig. 1). Due to an aggressive TCP connection timeout han-
Each chunk store operation is acknowledged by one OK dling, several short TLS connections to meta-data servers
message. As we will see in Sec. 4, this acknowledgment can be observed during this procedure.
mechanism might originate performance bottlenecks. Fi- Server responses to client messages can include general
nally, as chunks are successfully submitted, the client ex- control parameters. For instance, our experiments in the
changes messages with the central Dropbox servers (again testbed reveal that the current version of the protocol limits
on client-lb.dropbox.com) to conclude the transactions. the number of chunks to be transferred to at most 100 per
Note that these messages committing transactions might oc- transaction. That is, if more than 100 chunks need to be
cur in parallel with newer store operations. exchanged, the operation is split into several batches, each
A complete description of the Dropbox protocols is outside of at most 100 chunks. Such parameter shapes the traffic
the scope of this paper. We, however, exploit this knowledge produced by the client, as analysed in Sec. 4.
to tag the passively observed TCP flows with the likely com-
mands executed by the client, even if we have no access to 2.4 Data Storage Flows
the content of the (encrypted) connections. In the follow- As illustrated in Fig. 1, all store and retrieve operations
ing, we describe the protocols used to exchange data with are handled by the Amazon systems. More than 500 distinct
the Dropbox control servers and with the storage servers at domain names (dl-clientX.dropbox.com) point to Amazon
Amazon. servers. A subset of those aliases are sent to clients regularly.
Clients rotate in the received lists when executing storage
2.3 Client Control Flows operations, distributing the workload.
The Dropbox client exchanges control information mostly Typically, storage flows carry either store commands or
with servers managed directly by Dropbox Inc. We iden- retrieve commands. This permits storage flows to be divided
tified three sub-groups of control servers: (i) Notification, in two groups by checking the amount of data downloaded
(ii) meta-data administration, and (iii) system-log servers. and uploaded in each flow. By means of the data collected in
System-log servers collect run-time information about the our test environment, we documented the overhead of store
clients, including exception back-traces (via Amazon, on and retrieve commands and derived a method for labeling
dl-debug.dropbox.com), and other event logs possibly use- the flows. Furthermore, we identified a direct relationship
ful for system optimization (on d.dropbox.com). Since flows between the number of TCP segments with the PSH flag
to those servers are not directly related to the usage of the set in storage flows and the number of transported chunks.
system and do not carry much data, they have not been con- Appendix A presents more details about our methodology as
sidered further. In the following we describe the key TCP well as some results validating that the models built in our
flows to the meta-data and notification servers. test environment represent the traffic generated by real users
satisfactorily. We use this knowledge in the next sections for
2.3.1 Notification Protocol characterizing the system performance and the workload.
The Dropbox client keeps continuously opened a TCP
connection to a notification server (notifyX.dropbox.com), 2.5 Web Interface and Other Protocols
used for receiving information about changes performed else- Content stored in Dropbox can also be accessed through
where. Contrarily to other traffic, notification connections Web interfaces. A separate set of domain names are used
are not encrypted. Delayed HTTP responses are used to im- to identify the different services and can thus be exploited
plement a push mechanism: a notification request is sent to distinguish the performed operations. For example,
by the local client asking for eventual changes; the server URLs containing dl-web.dropbox.com are used when down-
response is received periodically about 60 seconds later in loading private content from user accounts. The domain
case of no change; after receiving it, the client immediately dl.dropbox.com provides public direct links to shared files.
sends a new request. Changes on the central storage are As shown in Sec. 6, the latter is the preferred Dropbox Web
instead advertised as soon as they are performed. interface.
2400
Table 2: Datasets overview.
Number of IP addrs.
Name Type IP Addrs. Vol. (GB)
Campus 1 Wired 400 5,320 1600
Campus 2 Wired/Wireless 2,528 55,054
Home 1 FTTH/ADSL 18,785 509,909 800
Home 2 ADSL 13,723 301,448
0
24/03 31/03 07/04 14/04 21/04 28/04 05/05
In addition, other protocols are available, like the LAN
Synchronization Protocol and the public APIs. However, iCloud SkyDrive Others
Dropbox Google Drive
these protocols do not provide useful information for the
aim of this paper. They are therefore mostly ignored in the (a) IP addresses
remainder of this paper.
100G
10G
3. DATASETS AND POPULARITY
Bytes
1G
3.1 Methodology
100M
We rely on passive measurements to analyze the Dropbox
traffic in operational networks. We use Tstat [5], an open 10M
source monitoring tool developed at Politecnico di Torino, to 24/03 31/03 07/04 14/04 21/04 28/04 05/05
collect data. Tstat monitors each TCP connection, exposing
information about more than 100 metrics6 , including client iCloud SkyDrive Others
Dropbox Google Drive
and server IP addresses, amount of exchanged data, eventual
retransmitted segments, Round Trip Time (RTT) and the (b) Data volume
number of TCP segments that had the PSH flag set [14].
Specifically targeting the analysis of Dropbox traffic, we Figure 2: Popularity of cloud storage in Home 1 .
implemented several additional features. First, given that
Dropbox relies on HTTPS, we extracted the TLS/SSL cer-
tificates offered by the server using a classic DPI approach. routers of a second university, including campus-wide wire-
Our analysis shows that the string *.dropbox.com is used less access points and student houses. In this latter scenario,
to sign all communications with the servers. This is instru- NAT and HTTP proxy-ing are very common, and DNS traf-
mental for traffic classification of the services. Second, we fic was not exposed to the probe. For privacy reasons, our
augmented the exposed information by labeling server IP probes export only flows and the extra information described
addresses with the original Fully Qualified Domain Name in the previous section. All payload data are discarded di-
(FQDN) the client requested to the DNS server [2]. This rectly in the probe. To complete our analysis, a second
is a key feature to reveal information on the server that is dataset was collected in Campus 1 in June/July 2012.
being contacted (see Tab. 1) and allows to identify each spe-
cific Dropbox related service. Third, Tstat was instructed 3.3 Popularity of Different Storage Providers
to expose the list of device identifiers and folder namespaces We present a comparison of the popularity of cloud-based
exchanged with the notification servers. storage systems. We explicitly consider the following ser-
vices: Dropbox, Google Drive, Apple iCloud and Microsoft
3.2 Datasets SkyDrive. Other less known services (e.g., SugarSync,
We installed Tstat at 4 vantage points in 2 European coun- Box.com and UbuntuOne) were aggregated into the Others
tries and collected data from March 24, 2012 to May 5, 2012. group. We rely on both the extracted TLS server name and
This setup provided a pool of unique datasets, allowing us to DNS FQDN to classify flows as belonging to each service.
analyze the use of cloud storage in different environments, We first study the popularity of the different services in
which vary in both the access technology and the typical user terms of unique clients. We use the Home 1 dataset be-
habits. Tab. 2 summarizes our datasets, showing, for each cause IP addresses are statically assigned to households and,
vantage point, the access technologies present in the moni- therefore, are a reliable estimation of the number of installa-
tored network, the number of unique client IP addresses, and tions. Fig. 2(a) reports7 the number of distinct IP addresses
the total amount of data observed during the whole period. that contacted at least once a storage service in a given day.
The datasets labeled Home 1 and Home 2 consist of iCloud is the most accessed service, with about 2,100 house-
ADSL and Fiber to the Home (FTTH) customers of a holds (11.1%), showing the high popularity of Apple devices.
nation-wide ISP. Customers are provided with static IP ad- Dropbox comes second, with about 1,300 households (6.9%).
dresses, but they might use WiFi routers at home to share Other services are much less popular (e.g., 1.7% for Sky-
the connection. Campus 1 and Campus 2 were instead col- Drive). Interestingly, Google Drive appears immediately on
lected in academic environments: Campus 1 mostly moni- the day of its launch (April 24th, 2012).
tors wired workstations in research and administrative of- Fig. 2(b) reports the total data volume for each service
fices of the Computer Science Department in a European in Home 1 . Dropbox tops all other services by one order
university. Campus 2 accounts for all traffic at the border of magnitude (note the logarithmic y-scale), with more than
6 7
See http://tstat.tlc.polito.it for details. A probe outage is visible on April 21, 2012.
Bytes Flows
0.2 1
YouTube Client (storage)
Dropbox Web (storage)
0.15 0.8 API (storage)
Client (control)
Share
Notify (control)
0.1 0.6
Fraction
Web (control)
System log (all)
0.05 0.4 Others
0
0.2
24/03 31/03 07/04 14/04 21/04 28/04 05/05
Date 0
Ca
Ca us
H us
H 1
Ca
Ca us
H us
H 1
om 2
om
om 2
om
m
m 1
m
m 1
p
p
p
p
Figure 3: YouTube and Dropbox in Campus 2 .
e
e2
e
e2
Figure 4: Traffic share of Dropbox servers.
Table 3: Total Dropbox traffic in the datasets.
Name Flows Vol. (GB) Devices
Campus 1 167,189 146 283 7% to 10%) is generated by direct link downloads and the
Campus 2 1,902,824 1,814 6,609 main Web interface (both represented as Web in Fig. 4). In
Home 1 1,438,369 1,153 3,350 home networks, a small but non-negligible volume is seen
Home 2 693,086 506 1,313 to the Dropbox API (up to 4%). Finally, the data volume
caused by control messages is negligible in all datasets.
Total 4,204,666 3,624 11,561
When considering the number of flows, the control servers
are the major contributors (more than 80% of the flows,
depending on the dataset). The difference on the percentage
20GB of data exchanged every day. iCloud volume is limited of notification flows – around 15% in Campus 2 , Home 1
despite the higher number of devices, because the service and Home 2 , and less than 3% in Campus 1 – is caused by
does not allow users to synchronize arbitrary files. SkyDrive the difference in the typical duration of Dropbox sessions in
and Google Drive show a sudden increase in volume after those networks, which will be further studied in Sec. 5.5.
their public launch in April.
Fig. 3 compares Dropbox and YouTube share of the to- 4.2 Server Locations and RTT
tal traffic volume in Campus 2 . Apart from the variation
We showed in previous sections that the Dropbox client
reflecting the weekly and holiday pattern, a high fraction is
protocol relies on different servers to accomplish typical
seen for Dropbox daily. Note that in this network the traffic
tasks such as file synchronization. Dropbox distributes the
exchanged with Dropbox is close to 100GB per working day:
load among its servers both by rotating IP addresses in DNS
that is already 4% of all traffic, or a volume equivalent to
responses and by providing different lists of DNS names to
about one third of YouTube traffic in the same day!
each client. In the following, we want to understand the
These findings highlight an increasing interest for cloud-
geographical deployment of this architecture and its conse-
based storage systems, showing that people are eager to
quences on the perceived RTT.
make use of remote storage space. Cloud-based storage is al-
ready popular, with 6-12% of home users regularly accessing 4.2.1 Server Locations
one or more of the services. Dropbox is by far the most used
Names in Tab. 1 terminated by a numerical suffix are nor-
system in terms of traffic volume. Its overall traffic is sum-
mally resolved to a single server IP address8 and clients are
marized in Tab. 3, where we can see the number of flows,
in charge of selecting which server will be used in a request.
data volume, and devices linked to Dropbox in the moni-
For instance, meta-data servers are currently addressed by
tored networks. Our datasets account for more than 11,000
a fixed pool of 10 IP addresses and notification servers by
Dropbox devices, uniquely identified by their host int (see
a pool of 20 IP addresses. Storage servers are addressed
Sec. 2.3.1). The traffic generated by the Web interface and
by more than 600 IP addresses from Amazon data-centers.
by public APIs is also included. In total, more than 3.5TB
Fig. 5 shows the number of contacted storage servers per
were exchanged with Dropbox servers during our capture.
day in our vantage points. The figure points out that clients
In the following, we restrict our attention to Dropbox only.
in Campus 1 and Home 2 do not reach all storage servers
daily. In both Campus 2 and Home 1 , more servers are in-
4. DROPBOX PERFORMANCE stead contacted because of the higher number of devices on
those vantage points. Routing information suggests that all
4.1 Traffic Breakdown: Storage and Control these control and storage servers are located in the U.S. Our
experiments, however, do not provide a finer granularity re-
To understand the performance of the Dropbox service
garding physical server locations.
and its architecture, we first study the amount of traffic
In order to verify the current Dropbox setup worldwide,
handled by the different sets of servers. Fig. 4 shows the re-
we performed active measurements using the PlanetLab. By
sulting traffic breakdown in terms of traffic volume and num-
selecting nodes from 13 countries in 6 continents, we checked
ber of flows. From the figure, it emerges that the Dropbox
which IP addresses are obtained when resolving the Dropbox
client application is responsible for more than 80% of the
DNS names seen in our passive measurements, as well as the
traffic volume in all vantage points, which shows that this
application is highly preferred over the Web interfaces for 8
Meta-data servers are addressed in both ways, depending
exchanging data. A significant portion of the volume (from on the executed command, via client-lb or clientX.
1000 Store Retrieve
Campus 1 Home 1 1 1
800 Campus 2 Home 2
Server IP addrs.
0.8 0.8
600
0.6 0.6
CDF
400
0.4 0.4
200 Campus 1
0.2 0.2 Campus 2
Home 1
0 Home 2
24/03 31/03 07/04 14/04 21/04 28/04 05/05 0 0
1k 10k 100k 1M 10M100M 1G 1k 10k 100k 1M 10M100M 1G
Date
Flow size (bytes) Flow size (bytes)
0.8 0.8 than 10ms). We assume that they are caused by changes
0.6 0.6 in the IP route, since the same behavior is not noticeable
CDF
Throughput (bits/s)
CDF
Throughput (bits/s)
distributions reinforce our conjecture about the dominance 100k
of deltas and small files in Dropbox usage habits: most flows
are very small and composed by few chunks. Most of the re-
maining flows have the maximum allowed number of chunks 10k
per batch and, therefore, are strongly shaped by the protocol Chunks
design of Dropbox. 1k 1
2- 5
6 - 50
4.4 Storage Throughput 51 - 100
Our measurements in Sec. 4.2 indicate that Dropbox relies 100
on centralized data-centers for control and storage. This 256 1k 4k 16k 64k 256k 1M 4M 16M 64M 400M
raises the question on the service performance for users not Download (bytes)
located near those data-centers. (b) Retrieve
The throughput of the storage operations is certainly one
of the key performance metrics. Fig. 9 depicts the through- Figure 9: Throughput of storage flows in Campus 2 .
put achieved by each storage flow in Campus 2 . The figure
shows separate plots for the retrieve and store operations.
Similar plots would be obtained using Campus 1 dataset.
Home 1 and Home 2 are left out of this analysis since the TCP start-up times and application-layer sequential ac-
access technology (ADSL, in particular) might be a bottle- knowledgments are two major factors limiting the through-
neck for the system in those networks. The x-axis represents put, affecting flows with a small amount of data and flows
the number of bytes transferred in the flow, already subtract- with a large number of chunks, respectively. In both cases,
ing the typical SSL overheads (see Appendix A for details), the high RTT between clients and data-centers amplifies the
while the y-axis shows the throughput calculated as the ratio effects. In the following, those problems are detailed.
between transferred bytes and duration of each flow (note
the logarithmic scales). The duration was accounted as the 4.4.1 TCP Start-up Effects
time between the first TCP SYN packet and the last packet Flows carrying a small amount of data are limited by TCP
with payload in the flow, ignoring connection termination slow start-up times. This is particularly relevant in the
delays. Flows are represented by different marks according analyzed campus networks, since both data link capacity
to their number of chunks. and RTT to storage data-centers are high in these networks.
Overall, the throughput is remarkably low. The aver- Fig. 9 shows the maximum throughput θ for completing the
age throughput (marked with dashed horizontal lines in transfer of a specific amount of data, assuming that flows
the figure) is not higher than 462kbits/s for store flows stayed in the TCP slow start phase. We computed the la-
and 797kbits/s for retrieve flows in Campus 2 (359kbits/s tency as in [4], with initial congestion window of 3 segments
and 783kbits/s in Campus 1 , respectively). In general, the and adjusting the formula to include overheads (e.g., the 3
highest observed throughput (close to 10Mbits/s in both RTTs of SSL handshakes in the current Dropbox setup).
datasets) is only achieved by flows carrying more than 1MB. Fig. 9 shows that θ approximates the maximum through-
Moreover, flows achieve lower throughput as the number of put satisfactorily. This is clear for flows with a single chunk,
chunks increases. This can be seen by the concentration of which suffer less from application-layer impediments. The
flows with high number of chunks in the bottom part of the bound is not tight for retrieve flows, because their latency in-
plots for any given size. cludes at least 1 server reaction time. Note that the through-
Store
10m Table 4: Performance in Campus 1 before and after
2m the deployment of a bundling mechanism.
30s Mar/Apr Jun/Jul
Duration
1 1
Home 1
0.8 Home 2 0.8
Fraction
0.6 0.6
CDF
0.4 0.4
Figure 12: Distribution of the number of devices per Figure 13: Number of namespaces per device.
household using the Dropbox client.
0.8
Campus 1 Home 1
Campus 2 Home 2
with more than 1 device (around 25% of the total), at least 0.6
1 folder is shared among the devices. This further confirms
Fraction
our findings about the typical use of Dropbox among heavy 0.4
users for the synchronization of devices. Since the LAN Sync
Protocol traffic does not reach our probes, we cannot quan- 0.2
tify the amount of bandwidth saved in these households by
0
the use of the protocol. We can conclude, however, that no 24/03 31/03 07/04 14/04 21/04 28/04 05/05
more than 25% of the households are profiting from that.
Date
The remaining users always rely on central storage data-
centers for their data transfers.
Figure 14: Distinct device start-ups per day – frac-
5.3 Shared Folders tion of the number of devices in each probe.
In order to measure to what extent Dropbox is being
used for content sharing or collaborative work, we analyze
of all devices start at least one session every day in home
namespace identifications in Home 1 and Campus 1 traf-
networks, including weekends11 . In campus networks, on
fic (in Home 2 and Campus 2 this information was not ex-
the other hand, there is a strong weekly seasonality.
posed). Fig. 13 shows the distribution of the number of
At a finer time scale (1 hour bins), we observe that the
namespaces per device. By analyzing Campus 1 data in dif-
service usage follows a clear day-night pattern. Fig. 15 de-
ferent dates, it is possible to conclude that the number of
picts the daily usage of the Dropbox client. All plots were
namespaces per device is not stationary and has a slightly
produced by averaging the quantities per interval across all
increasing trend. Fig. 13 was built considering the last ob-
working days in our datasets.
served number of namespaces on each device in our datasets.
Fig. 15(a) shows the fraction of distinct devices that
In both networks the number of users with only 1
start a session in each interval, while Fig. 15(b) depicts the
namespace (the users’ root folder) is small: 13% in
fraction of devices that are active (i.e., are connected to
Campus 1 and 28% in Home 1 . In general, users in
Dropbox) per time interval. From these figures we can see
Campus 1 have more namespaces than in Home 1 . The
that Dropbox usage varies strongly in different locations,
percentage of users having 5 or more namespaces is equal to
following the presence of users in the environment. For in-
50% in the former, and 23% in the latter. When consider-
stance, in Campus 1 , session start-ups have a clear relation
ing only IP addresses assigned to workstations in Campus 1 ,
with employees’ office hours. Session start-ups are better
each device has on average 3.86 namespaces.
distributed during the day in Campus 2 as a consequence
of the transit of students at wireless access points. In home
5.4 Daily Usage networks, peaks of start-ups are seen early in the morning
We characterize whether the use of the Dropbox client has and during the evenings. Overall, all time series of active
any typical seasonality. Fig. 14 shows the time series of the devices (Fig. 15(b)) are smooth, showing that the number
number of distinct devices starting up a Dropbox session in of active users at any time of the day is easily predictable.
each vantage point per day. The time series are normalized
11
by the total number of devices in each dataset. Around 40% Note the exceptions around holidays in April and May.
0.15 1
Campus 1
Campus 2 0.8
Home 1
0.1 Home 2
Fraction
0.6
CDF
0.4 Campus 1
0.05
Campus 2
0.2 Home 1
Home 2
0 0
04 06 08 10 12 14 16 18 20 22 00 02 04 30s 1m 10m 1h 4h 8h 24h 1w
Time (hours) Time
(a) Session start-ups
Figure 16: Distribution of session durations.
0.4
Campus 1
Campus 2
0.3 Home 1 5.5 Session Duration
Home 2
Fraction
60s
60s
Home Gateway Characteristics. In Proceedings of the SSL_alert (PSH) +
SSL_alert + FIN/ACK
FIN/ACK
10th ACM SIGCOMM Conference on Internet
Measurement, IMC’10, pages 260–266, 2010. ACKs ACKs
[11] W. Hu, T. Yang, and J. N. Matthews. The Good, the RST RST
Bad and the Ugly of Consumer Cloud Storage. ACM
SIGOPS Operating Systems Review, 44(3):110–115, (a) Store (b) Retrieve
2010.
[12] A. Lenk, M. Klems, J. Nimis, S. Tai, and Figure 19: Typical flows in storage operations.
T. Sandholm. What’s Inside the Cloud? An
Architectural Map of the Cloud Landscape. In
Proceedings of the 2009 ICSE Workshop on Software APPENDIX
Engineering Challenges of Cloud Computing,
CLOUD’09, pages 23–31, 2009. A. STORAGE TRAFFIC IN DETAILS
[13] A. Li, X. Yang, S. Kandula, and M. Zhang.
CloudCmp: Comparing Public Cloud Providers. In A.1 Typical Flows
Proceedings of the 10th ACM SIGCOMM Conference Fig. 19 shows typical storage flows observed in our testbed.
on Internet Measurement, IMC’10, pages 1–14, 2010. All packets exchanged during initial and final handshakes are
[14] M. Mellia, M. Meo, L. Muscariello, and D. Rossi. depicted. The data transfer phases (in gray) are shortened
Passive Analysis of TCP Anomalies. Computer for the sake of space. Key elements for our methodology,
Networks, 52(14):2663–2676, 2008. such as TCP segments with PSH flag set and flow durations,
[15] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, are highlighted. To confirm that these models are valid in
and B. Bhattacharjee. Measurement and Analysis of real clients, Tstat in Campus 1 was set to record statistics
Online Social Networks. In Proceedings of the 7th about the first 10 messages delimited by TCP segments with
ACM SIGCOMM Conference on Internet PSH flag set. In the following, more details of our methodol-
Measurement, IMC’07, pages 29–42, 2007. ogy and the results of this validation are presented.
[16] M. Mulazzani, S. Schrittwieser, M. Leithner, A.2 Tagging Storage Flows
M. Huber, and E. Weippl. Dark Clouds on the
Storage flows are first identified using FQDNs and SSL
Horizon: Using Cloud Storage as Attack Vector and
certificate names, as explained in Sec. 3.1. After that, they
Online Slack Space. In Proceedings of the 20th
are classified based on the number of bytes sent by each
USENIX Conference on Security, SEC’11, 2011.
endpoint of the TCP connection. The method was built
[17] G. Wang and T. E. Ng. The Impact of Virtualization
based on the assumption that a storage flow is used either
on Network Performance of Amazon EC2 Data
for storing chunks or for retrieving chunks, but never for
Center. In Proceedings of the 29th IEEE INFOCOM, both. This assumption is supported by two factors: (i) when
pages 1–9, 2010.
both operations happen in parallel, Dropbox uses separate
[18] Q. Zhang, L. Cheng, and R. Boutaba. Cloud connections to speed up synchronization; (ii) idle storage
Computing: State-of-the-Art and Research connections are kept open waiting for new commands only
Challenges. Journal of Internet Services and for a short time interval (60s).
Applications, 1:7–18, 2010. Our assumption could be possibly violated during this idle
[19] M. Zhou, R. Zhang, W. Xie, W. Qian, and A. Zhou. interval. In practice, however, this seems to be hardly the
Security and Privacy in Cloud Computing: A Survey. case. Fig. 20 illustrates that by plotting the number of bytes
In Sixth International Conference on Semantics in storage flows in Campus 1 . Flows are concentrated near
Knowledge and Grid, SKG’10, pages 105–112, 2010. the axes, as expected under our assumption.
1G Finally, we quantify the possible error caused by violations
Store of our assumption. In all vantage points, flows tagged as
Retrieve
100M f(u) store download less 1% of the total storage volume. Since
this includes protocol overheads (e.g., HTTP OK messages
10M
Download (bytes)
10k
A.3 Number of Chunks
The number of chunks transported in a storage flow (c) is
1k estimated by counting TCP segments with PSH flag set (s) in
the reverse direction of the transfer, as indicated in Fig. 19.
100 For retrieve flows, c = s−2 . For store flows c = s − 3 or
2
100 1k 10k 100k 1M 10M 100M 1G c = s − 2, depending on whether the connection is passively
Upload (bytes) closed by the server or not. This can be inferred by the
time difference between the last packet with payload from
Figure 20: Bytes exchanged in storage flows in the client and the last one from the server: when the server
Campus 1 . Note the logarithm scales. closes an idle connection, the difference is expected to be
around 1 minute (otherwise, only a few seconds). Tstat
Store Retrieve already records the timestamps of such packets by default.
1 1 We validate this relation by dividing the amount of pay-
Campus 1
0.8 0.8 Campus 2 load (without typical SSL handshakes) in the reverse direc-
Home 1 tion of a transfer by c. This proportion has to be equal to
0.6 0.6 Home 2
the overhead needed per storage operation. Fig. 21 shows
CDF
0.4 0.4 that for the vast majority of store flows the proportion is
about 309 bytes per chunk, as expected. In Home 2 , the
0.2 0.2
apparently misbehaving client described in Sec. 4.3 biases
0 0 the distribution: most flows from this client lack acknowl-
0 100 200 300 400 500 600 0 100 200 300 400 500 600 edgment messages. Most retrieve flows have a proportion
Proportion (bytes) Proportion (bytes) between 362 and 426 bytes per chunk, which are typical sizes
of the HTTP request in this command. The exceptions (3%
Figure 21: Payload in the reverse direction of stor- – 8%) might be caused by packet loss in our probes as well
age operations per estimated number of chunks. as by the flows that both stored and retrieved chunks. Our
method underestimates the number of chunks in those cases.