Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
Traditional Relational Database Systems unsuitable (expensive, complex data model, repartitioning is complex)
An architectural redesign of data management systems with requirements like: - High scalability - High availability - Low latency - Weaker consistency - Lower application generality
Some services do not require complex querying and management functionality offered by RDBMS - access data by primary-key only Downgrade some of the service guarantees of traditional RDBMS There is a shift towards NoSQL or non-relational data stores key-value stores. But relational databases are still needed. Example - in online stores, shopping cart and session management can be handled with a key-value system but the checkout process needs a traditional database
Highly decentralized architecture similar to Peer-to-peer systems Provides key-value only interface. The value could be just a blob, the data store is not aware of what the contents of value are. Targeted mainly at applications that need an always writeable data store where no updates are rejected Built for latency sensitive applications that require atleast 99.9% of read and write operations to be performed within a few hundred milliseconds Sacrifices consistency under certain failure conditions to
Query Model: simple read and write operations to a data item that is uniquely identified by a key no complex querying or accesses ranging multiple data items ACID Properties: Does not provide consistency and isolation guarantees Efficiency: latency requirements which are generally measured at the 99.9th percentile of the distribution built for applications that have stringent SLA requirements Ex: A service that guarantees to provide a resonse within 300ms for 99.9% of its requests for a peak client load of 500 requests per second
Design requirements
Incremental scalability Symmetry Decentralization Heterogenity Conflict resolution is pushed to reads in order to ensure that writes are never rejected.
Consistent Hashing
Partitioning algorithm used to dynamically partition the data. The output range of a hash function is treated as a fixed circular space or ring. Each node in the system is assigned a random value within this space which represents its position in the ring. Data item's key hashed to determine the node which will process it. Virtual nodes each node assigned to multiple positions Advantage - departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
No need to store system state i.e. no trackers needed to keep track of where my data is going into.
Consistent hashing
If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure Incremental scalability and heterogenity achieved.
Replication
Vector clocks
Eventual consistency updates propagated to all replicas asynchronously. All replicas eventually become consistent through reconciliation measures taken at the time of read operations. Multiple versions of data exist updates can happen to older versions divergent versions reconciled later. When newer versions subsume previous versions system decides the final version. Version branching conflicting versions of an object the client performs the reconciliation (collapse). One vector clock associated with every version of every object (node,counter) pair. While updating, a client must specify which version it is updating using context obtained from an earlier read. Reconciliation during reads
get(key) put(key, context, object) context contains information like version of the object Uses consistency protocol similar to quorum-like systems. R minimum number of nodes that must participate in a successful read operation. W - minimum number of nodes that must participate in a successful write operation. Quorum-like system : R + W > N N top N nodes from the preference list Strict quorum affects availablility and durability.
Selecting a node
Two strategies that a client can use to select a node: (1) route its request through a generic load balancer that will select a node based on load information, or (2) use a partition-aware client library that routes requests directly to the appropriate coordinator nodes. The advantage of the first approach - client does not have to link any code specific to Dynamo in its application
Second strategy - can achieve lower latency because it skips a potential forwarding step.
Sloppy quorum used to handle temporary server failures and network partitions.
All read and write operations are performed on first N healthy nodes from the preference list, which may or may not always be the first N nodes encountered while walking the consistent hashing ring. * The replacement node will get a hint in its metadata using which it will send replica to the intended recipient has recovered Hinted handoff
* To account for node failures, preference list contains more than N nodes.
Other techniques
Merkel trees to detect inconsistencies between replicas and to minimize the amount of transferred data.
Gossipping Nodes can be added and removed multiple times Gossip-based protocol propagates membership changes and maintains an eventually consistent view of membership.
Implementation
Each storage node has 3 software components: - Request coordination - Membership and failure detection - Local persistence engine
Buffered read and writes Missing writes if server crashes - One of the N replicas can perform a durable write without affecting performance
- Key value pairs - Vector clocks used - Gossip based - Consistent hashing - Eventual consistency and reconciliation
Clouded Data
Ad hoc systems - It remains to be seen if such application level coordination without support from the data management service will be sufficient with more complex operations and heavier workloads. Atomic objects to single data items is this sufficient?