Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 23

Dynamo: Amazon's Highly Available Key-Value Store

CScI 8101 : Advanced Operating Systems Presented by: Chaithra KN

Clouded Data: Comprehending Scalable Data Management Systems


Managing petabytes of data for millions of users - challenge for big internet based enterprises such as Google, Yahoo!, and Amazon.

Traditional Relational Database Systems unsuitable (expensive, complex data model, repartitioning is complex)
An architectural redesign of data management systems with requirements like: - High scalability - High availability - Low latency - Weaker consistency - Lower application generality

Clouded Data: Comprehending Scalable Data Management Systems

Some services do not require complex querying and management functionality offered by RDBMS - access data by primary-key only Downgrade some of the service guarantees of traditional RDBMS There is a shift towards NoSQL or non-relational data stores key-value stores. But relational databases are still needed. Example - in online stores, shopping cart and session management can be handled with a key-value system but the checkout process needs a traditional database

Amazon's Dynamo in a nutshell

Highly decentralized architecture similar to Peer-to-peer systems Provides key-value only interface. The value could be just a blob, the data store is not aware of what the contents of value are. Targeted mainly at applications that need an always writeable data store where no updates are rejected Built for latency sensitive applications that require atleast 99.9% of read and write operations to be performed within a few hundred milliseconds Sacrifices consistency under certain failure conditions to

System Assumptions and Requirements

Query Model: simple read and write operations to a data item that is uniquely identified by a key no complex querying or accesses ranging multiple data items ACID Properties: Does not provide consistency and isolation guarantees Efficiency: latency requirements which are generally measured at the 99.9th percentile of the distribution built for applications that have stringent SLA requirements Ex: A service that guarantees to provide a resonse within 300ms for 99.9% of its requests for a peak client load of 500 requests per second

Design requirements

Incremental scalability Symmetry Decentralization Heterogenity Conflict resolution is pushed to reads in order to ensure that writes are never rejected.

Consistent Hashing
Partitioning algorithm used to dynamically partition the data. The output range of a hash function is treated as a fixed circular space or ring. Each node in the system is assigned a random value within this space which represents its position in the ring. Data item's key hashed to determine the node which will process it. Virtual nodes each node assigned to multiple positions Advantage - departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.

No need to store system state i.e. no trackers needed to keep track of where my data is going into.

Consistent hashing

Advantages of virtual nodes

If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure Incremental scalability and heterogenity achieved.

Replication

Needed to achieve high availability and durability

Replicates data on N hosts N is configurable

Data stored locally + replicated on coordinator's N-1 clockwise successor nodes.

Preference list list of nodes responsible for storing a particular key.

Vector clocks

Eventual consistency updates propagated to all replicas asynchronously. All replicas eventually become consistent through reconciliation measures taken at the time of read operations. Multiple versions of data exist updates can happen to older versions divergent versions reconciled later. When newer versions subsume previous versions system decides the final version. Version branching conflicting versions of an object the client performs the reconciliation (collapse). One vector clock associated with every version of every object (node,counter) pair. While updating, a client must specify which version it is updating using context obtained from an earlier read. Reconciliation during reads

System interface, execution of get and put


get(key) put(key, context, object) context contains information like version of the object Uses consistency protocol similar to quorum-like systems. R minimum number of nodes that must participate in a successful read operation. W - minimum number of nodes that must participate in a successful write operation. Quorum-like system : R + W > N N top N nodes from the preference list Strict quorum affects availablility and durability.

Selecting a node
Two strategies that a client can use to select a node: (1) route its request through a generic load balancer that will select a node based on load information, or (2) use a partition-aware client library that routes requests directly to the appropriate coordinator nodes. The advantage of the first approach - client does not have to link any code specific to Dynamo in its application

Second strategy - can achieve lower latency because it skips a potential forwarding step.

Sloppy quorum and hinted handoff

Sloppy quorum used to handle temporary server failures and network partitions.

All read and write operations are performed on first N healthy nodes from the preference list, which may or may not always be the first N nodes encountered while walking the consistent hashing ring. * The replacement node will get a hint in its metadata using which it will send replica to the intended recipient has recovered Hinted handoff

* To account for node failures, preference list contains more than N nodes.

Other techniques

Merkel trees to detect inconsistencies between replicas and to minimize the amount of transferred data.

Gossipping Nodes can be added and removed multiple times Gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership.

Implementation

Each storage node has 3 software components: - Request coordination - Membership and failure detection - Local persistence engine

Balancing performance and durability


Buffered read and writes Missing writes if server crashes - One of the N replicas can perform a durable write without affecting performance

Balancing performance and durability

Positives and Negatives


+ No need to store system data separately.No concept of dynamic system state with consistency guarantees on that. + A single node is mapped onto multiple virtual nodes. This leads to uniform data and load distribution. + Merkel tree approach very less data transfer needed to maintain consistency of replicas. + Client applications can tune the value of N, W and R to achieve their desired level of performance, availability and durability. - Static provisioning of hash space can lead to over-provisioning - Conflict reconciliations at application layer to solve inconsistency problems might increase software development cost.

Comparing PNUTS and Dynamo


PNUTS - Hashed / Ordered tables - Generation based versioning - Communication through Pub / Sub YMB infrastructure (optimized for geographically separated replicas) - Partitioning into tablets - Timeline based consistency Dynamo

- Key value pairs - Vector clocks used - Gossip based - Consistent hashing - Eventual consistency and reconciliation

Clouded Data
Ad hoc systems - It remains to be seen if such application level coordination without support from the data management service will be sufficient with more complex operations and heavier workloads. Atomic objects to single data items is this sufficient?

You might also like