Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 30
Amazon Dynamo DB
- a term project presentation
CSc 244 | CSU Sacramento Spring 2014 Instructor : Dr. Ying Jin 8 th May, 2014 Team 1 : Chetan Nagarkar Saransh Agenda NoSQL background CAP theorem AWS offerings Dynamo DB Dynamo DB Architecture How to use Pricing Advantages Limitations References Questions
2 NoSQL background Traditional databases have a concrete schema.
4 Types of NoSQL systems 1. Key-Value Pair : Every item is a set of key-value pair Voldemort(LinkedIn), DynamoDB(Amazon), Redis(VMWare), etc. 2. Document Oriented : Each key is paired with a complex data structure called document. Documents may contain multiple key-value pairs CouchDB, MongoDB, etc. 3. Column Family Stores : Key value is mapped to set of columns optimized for query over large datasets. Cassandra(Facebook), HyperTable, Hbase, BigTable(Google) etc. 4. Graph Databases : Contains information about network, social connections Neo4j, FlockDB(Twitter).
5 ACID properties in traditional RDBMs
Atomicity Requires each transaction to be all or nothing Consistency Brings databases from one valid state to another Isolation Takes care of concurrent execution and avoid overwritting Durability Commited transactions will remain so in the event of power loss, crashes etc
6 CAP theorem It states : A distributed system possibly cannot guarantee all of the following simultaneously Consistency: All nodes see the same data at the same time
Availability: The system is always on, i.e., every request receives a response, whether it serves the request or not.
Partition Tolerance: System continues to operate in case of system failure or message loss. 7 8 Amazon Web Services(AWS) offerings Launched in 2006, AWS is used by thousands of companies today for cloud based computing services. Includes an array of services: Remote Computing(EC2), Identity(IAM), Database services(SimpleDB, RDS, ElastiCache, DynamoDB, etc.) 8 Regions and Availability zones : 9 Code Name ap-northeast-1 Asia Pacific (Tokyo) Region ap-southeast-1 Asia Pacific (Singapore) Region ap-southeast-2 Asia Pacific (Sydney) Region eu-west-1 EU (Ireland) Region sa-east-1 South America (Sao Paulo) Region us-east-1 US East (Northern Virginia) Region us-west-1 US West (Northern California) Region us-west-2 US West (Oregon) Region Dynamo DB Dynamo DB is a managed and eventually consistent NoSQL database service to support highly available data access. A flexible data model with key/attribute pairs. No schema required. Optimized for availability to maximize : Data consistency Durability Performance
ALL THIS WITHOUT THE OPERATIONAL BURDEN 10 What is Fully Managed? Never worry about: Hardware provisioning Cross-availability zone replication Hardware and Software updates Monitoring and handling of hardware failures Replicas automatically generated.
11 DynamoDB Architecture True distributed architecture Data is spread across hundreds of servers called storage nodes Hundreds of servers form a cluster in the form of a ring Client application can connect using one of the two approaches Routing using a load balancer Client-library that reflects Dynamos partitioning scheme and can determine the storage host to connect DynamoDB is designed to be always writable storage solution Allows multiple versions of data on multiple storage nodes Conflict resolution happens while reads and NOT during writes Syntactic conflict resolution Symantec conflict resolution
Put(key,context object) Determines where replicas of the object should be placed based on key,& write replicas to disk Context Info is stored along with object so as to verify validity of context object If at least W-1 nodes respond, write is successful Get(key)- Operation locates all object replicas associated with key in storage system Returns a single or list of object with conflict version along with context Client reconcile divergent versions and supersede current version with based on context Node handling read and write operation is called coordinator For reads and writes dynamo uses consistency protocol Consistency protocol has 3 variables N Number of replicas of data to be read or written W Number of nodes that must participate in successful write operation R Number of machines contacted in read operation R+W > N Latency determined by slowest of R or W replicas. Thus R&W less than N DynamoDB System Interface get() and put() DynamoDB Architecture - Partitioning Data is partitioned over multiple hosts called storage nodes (ring) Uses consistent hashing to dynamically partition data across storage hosts Two problems associated with Consistent hashing Hashing of storage hosts can cause imbalance of data and load Consistent hashing treats every storage host as same capacity
A [3,4] B [1] C[2] 1 2 3 4 3 4 A[4] B[1] 1 2 D[2,3] Initial Situation Situation after C left and D joined DynamoDB Architecture - Partitioning Solution Virtual hosts mapped to physical hosts (tokens) Number of virtual hosts for a physical host depends on capacity of physical host Virtual nodes are function-mapped to physical nodes
A B C D H I J C B Physical Host Virtual Host DynamoDB Architecture Scaling Gossip protocol used for node membership Every second, storage node randomly contact a peer to bilaterally reconcile persisted membership history Doing so, membership changes are spread and eventually consistent membership view is formed When a new members joins, adjacent nodes adjust their object and replica ownership When a member leaves adjacent nodes adjust their object and replica and distributes keys owned by leaving member A B C D E F G H Replication factor = 3 DynamoDB Architecture Data Versioning Eventual Consistency data propagates asynchronously Its possible to have multiple version on different storage nodes If latest data is not available on storage node, client will update the old version of data Conflict resolution is done using vector clock Vector clock is a metadata information added to data when using get() or put() Vector clock effectively a list of (node, counter) pairs One vector clock is associated with every version of every object One can determine two version has causal ordering or parallel branches by examining vector clocks Client specify vector clock information while reading or writing data in the form of context Dynamo tries to resolve conflict among multiple version using syntactic reconciliation If syntactic reconciliation does not work, client has to resolve conflict using semantic reconciliation
18 Strongly consistent vs Eventually Consistent Multiple copies of same item to ensure durability Takes time for update to propagate multiple copies Eventually consistent After write happens, immediate read may not give latest value Strongly consistent Requires additional read capacity unit Gets most up-to-date version of item value Eventually consistent read consumes half the read capacity unit as strongly consistent read.
How to use AWS provides SDKs to interact with DynamoDB for the following platforms/languages: Android iOS Java .Net
In Java, there are two means to interact with your DynamoDB table: AWS Management Console AWS Toolkit for Eclipse 20
Python PHP Node.js Ruby Management Console 21 Console create table 22 Inside Amazon DynamoDB Primary key (mandatory for every table) Hash or Hash + Range Data model in the form of tables Data stored in the form of items (name value attributes) Secondary Indexes for improved performance Local secondary index Global secondary index Scalar data type (number, string etc) or multi-valued data type (sets)
key=value key=value key=value key=value Table Item (64KB max) Attributes Create Instance in Java 24 Using AWS SDK: AWS Toolkit for Eclipse 25 Pricing In the free tier, AWS offers 100 MB of free data storage per month. Read capacity: 1 read = 4 KB data Write capacity: 1 write = 1 KB data With eventually consistant reads, you get twice reads/sec In free tier you get 5 writes/sec & 10 reads/sec Or 432,000 writes and 864,000 writes per day or 40 million data operations per month. 26 Advantages Easy Administration: No h/w or s/w provisioning, setup or configuration Flexible: Secondary Indexes: query on whichever attribute Fast, predictable performance: SSDs, single digit latency Built-in fault tolerance Stong consistency and atomic counters Automatic Data Replication: Data stored across 3 availability zones Security: AWS Identity Access Management(IAM) CloudWatch Alarms: Alarms are raised if you are near to crossing provisioned throughput or storage. Provisioned throughput/Cost effective: Pay how you use
27 Limitations 64KB limit on item size (row size) 1 MB limit on fetching data Pay more if you want strongly consistent data Size is multiple of 4KB (provisioning throughput wastage) Cannot join tables Indexes should be created during table creation only No triggers or server side scripts Limited comparison capability (no not_null, contains etc)