DynamoDB Design Patterns
DynamoDB Design Patterns
DynamoDB Design
Patterns
Siva Raghupathy
4/27/2012
Page 0 of 13
Amazon Web Services - DynamoDB Design Patterns
Table of Contents
Introduction .................................................................................................................................................. 2
Storing items over 64K .................................................................................................................................. 2
Problem ..................................................................................................................................................... 2
Solution ..................................................................................................................................................... 3
Break large attribute(s) across multiple DynamoDB items................................................................... 3
Store large attributes in S3 ................................................................................................................... 3
Gotchas ..................................................................................................................................................... 4
Searching non-key attributes (creating Secondary Indexes) ........................................................................ 4
Problem ..................................................................................................................................................... 4
Solution ..................................................................................................................................................... 4
Gotchas ..................................................................................................................................................... 5
Storing and analyzing time series (log) data ................................................................................................. 6
Problem ..................................................................................................................................................... 6
Solution ..................................................................................................................................................... 6
Modeling one-to-many (1: N) and many-to-many (M:N) relationships ....................................................... 7
Problem ..................................................................................................................................................... 7
Solution ..................................................................................................................................................... 7
Avoiding hot keys and hot partitions ............................................................................................................ 8
Hot key example ....................................................................................................................................... 8
Problem ................................................................................................................................................. 8
Solution ................................................................................................................................................. 9
Hot partition (write) example ................................................................................................................... 9
Problem ................................................................................................................................................. 9
Solution ............................................................................................................................................... 10
Hot partition (read) example .................................................................................................................. 10
Problem ............................................................................................................................................... 10
Solution ............................................................................................................................................... 10
Summary ................................................................................................................................................. 11
CloudSearch enabling DynamoDB data ...................................................................................................... 11
Problem ................................................................................................................................................... 11
Solution ................................................................................................................................................... 11
Conclusion ................................................................................................................................................... 13
Page 1 of 13
Amazon Web Services - DynamoDB Design Patterns
Introduction
Amazon DynamoDB is a fully managed non-relational (NoSQL) database services built from the ground-
up for massive scale, predictable performance, low latency and high durability.
While DynamoDB automatically handles most of the heavy lifting when it comes to database
administration and scaling it imposes some constraints. For example, the size of an item cannot exceed
64 kilo bytes (KB). A table must have a primary key. The primary key must be a single hash key or a
hash-range composite key. Data can be efficiently accessed only via the primary key. There are no
secondary indexes. Developers should avoid hot key and hot partitions to achieve massive scale. These
constraints (for the most part) ensure that a developer writes applications that scale with predictable
performance and avoid bad surprises as the work load increases.
This paper illustrates design patterns for working around these constrains. For example, you can store
greater than 64KB pertaining to an item in multiple items in a table using a hash-range key. You can
create and maintain your own secondary indexes. Or you can CloudSearch enable your DynamoDB data.
It also outlines best practices for storing time series data, avoiding hot keys and hot partitions and
modeling 1:N and M:N relationships.
If the item can grow large because it contains a possibly large list of attributes, consider pivoting it from
horizontal to vertical, i.e. switching from hash to hash-range.
Let us consider the following scenario. You have an application that stores inbox messages in
DynamoDB. Here is the inbox schema:
INBOX_MESSAGE
MESSAGE-ID
MAILBOX-ID
DATE
FROM
TO
SUBJECT
BODY
While most messages may fit within 64K there is a chance that the message body can be greater than
64K.
Page 2 of 13
Amazon Web Services - DynamoDB Design Patterns
Solution
You can use one of the following techniques for storing larger items:
You can insert the header (PART= 0) first and then insert additional parts as additional items with just
the BODY. When getting a message, Query for hash key (Example, MESSAGE-ID= 1) with no range key
condition. This will return PART = 0, 1,2 & 3. You can assemble the BODY parts in your application.
Option2: If the range key is in use, you can store the overflow items in a separate table
(INBOX_MESSAGE_BODY_OVERFLOW).
INBOX_MESSAGE
MAILBOX-ID DATE
(hash key) (range key)
1 ‘12/12/2011’ MESSAGE-ID = ‘some UUID’
FROM = ‘user1’
TO = ‘user2’
SUBJECT = ‘DynamoDB Best practices’
BODY= ‘The first few Kbytes…..’
INBOX_MESSAGE_BODY_OVERFLOW
MESSAGE-ID PART
(hash key) (range key)
1 1 BODY = ‘ the next 64k’
1 2 BODY = ‘ the next 64k’
1 3 EOM
Page 3 of 13
Amazon Web Services - DynamoDB Design Patterns
For example, if you are modeling an email inbox using a DynamoDB table (INBOX_MESSAGE), you
can store the message body in S3.
INBOX_MESSAGE
MESSAGE-ID
(hash key)
1 FROM = ‘user1’
TO = ‘user2’
DATE = ‘12/12/2011’
SUBJECT = ‘DynamoDB Best practices’
BODY= ‘The first few Kbytes…..’
BODY_OVERFLOW = ‘S3bucket+key’
In the above example, a section of the message body is stored as a DynamoDB attribute BODY. The
overflow is stored in S3 and the BODY_OVERFLOW attribute’s value is set to S3 bucket+key.
Gotchas
- Since writes across multiple DynamoDB items is not transactional, client needs to handle any failure
scenarios.
- Since both DynamoDB and S3 do not support (distributed) transactions there may be orphaned S3
objects or DynamoDB items.
- S3 keys can be 1024 bytes long. S3 buckets can be 255 characters long. Depending on the bucket
and key names you choose you may chew-up 1024+255 (1279) bytes maximum.
Solution
You can create your own secondary index tables. For example, let us consider a table:
USERS
UID FNAME LNAME
(hash key)
101 Tim White
201 Tim Black
301 Ted White
401 Keith Brown
501 Keith White
601 Keith Black
Page 4 of 13
Amazon Web Services - DynamoDB Design Patterns
LNAME_index:
LNAME UID
(hash key) (range key)
White 101
Black 201
White 301
Brown 401
White 501
Black 601
Note: using FNAME or LNAME as the hash-key in the USERS you could have avoided creating one of
these secondary index tables.
To answers Query1:
- Query FNAME_index table, where FNAME (hash key) = ‘Tim’ and do not specify the range key
condition. This will return UID = (101, 201).
- Then look up USERS_table using BatchGetItem .
To answers Query2:
- Query LNAME_index table, where LNAME (hash key) = ‘Black’ and do not specify the range key
condition. This will return UID = (201, 601).
- Then look up USERS_table using BatchGetItem .
Gotchas
- Since the hash keys of the index tables can only be queried for equality your cannot do range
queries i.e. give me all items in the Users table where FNAME begins with ‘T’
- When you write data in the primary table, you need to maintain the secondary index tables.
Page 5 of 13
Amazon Web Services - DynamoDB Design Patterns
Solution
Pre-create N tables, one for each time-period (e.g. 1 per hour/day/month). Periodically add new tables
and delete older tables. Deleting a table is much more efficient and costs much less than deleting old
items within a table. But the application must handle the sharding across N tables. Usually new tables
are write-heavy and older tables are read-heavy. Having multiple tables makes it easy to adjust the
provisioned throughput levels according to your needs.
Let us assume you are storing historic date in a table with the following schema:
You can pre-create one table per month and insert records into the appropriate table based on
timestamp.
JAN2012
Event_id Timestamp Attribute1 …. Attribute N
(Hash key)
FEB2012
Event_id Timestamp Attribute1 …. Attribute N
(Hash key)
DEC2012
Event_id Timestamp Attribute1 …. Attribute N
(Hash key)
For example, if the Timestamp is “2012-01-17 01:35:05” for a row it will go into the JAN2012 table. Since
DynamoDB automatically partitions by the hash key (Event_id), you can provision the required read or
right throughput. For example, you may provision higher reads and writes per second for the current
month but much less for the previous months. When you create the JAN2013 table you can use Amazon
Elastic MapReduce and Hive to archive the JAN2012 table to S3. Then you can delete the JAN2012 table.
Page 6 of 13
Amazon Web Services - DynamoDB Design Patterns
Problem
You can only create two type of tables in DynamoDB, a table with a hash key or a table with a composite
hash range key. How can I use this to model entities and 1: N or M: N relationships between entities in
an application?
Solution
Let us consider an on-line gaming example where there are two entities: USERS & GAMES. There is a
many-to-many (M: N) relationship between a user and a game i.e one user can play many games and
game can have many users. The application wants to store user profile, game metadata, highest scores
that a user posted in a game, plus a leader board. This can be modeled using DynamoDB tables as
follows:
User scores etc. can be kept in a USER_GAMES table with a Hash-range key:
Hash Key Range key Attributes
UserId = bob GameId = Game1 HighScore = 10500, ScoreDate = 2011-10-20
UserId = fred GameId = Game2 HIghScore = 12000, ScoreDate = 2012-01-10
UserId = bob GameId = Game3 HighScore = 20000, ScoreDate = 2012-02-12
Note: The range key is Score+UserID, where the UserID suffix is added to make the row unique
in the event two users in a game had the same score.
The USER_GAMES table models the 1: N relationship between a user and game. The Query API can use
be used to retrieve all the games played by a user.
If you want to find the HighScores achieved by a user bob, you can Query the USER_GAMES table for
hash key = bob and the range key set to a specific game or with no range key if you want to get high
scores for all the games.
Page 7 of 13
Amazon Web Services - DynamoDB Design Patterns
Similarly to find all the users of a game you need a GAME_USERS table:
Hash Key Range key
GameId = Game1 UserId = bob
GameId = Game2 UserId = fred
GameId = Game3 UserId = bob
Even if you don’t need to find all the users for a game, you need the GAME_USERS table to maintain
referential integrity i.e. if a game is deleted you can find all the users that played the game from the
GAME_USERS table and use it to delete the unwanted data in the USER_GAMES table.
The USER_GAMES table along with GAME_USERS allows us to model the many-to-many (M: N)
relationship between game and users. Both tables are also needed to maintain referential integrity even
if you want to just model a 1: N relationship between games and users or vice versa.
Summary:
One-to-may (1: N) and many-to-may (M: N) relationships between two entities can be modeled in
DynamoDB using 2 join-tables. The first join table has the first entity’s key as the hash key and the
second entity’s key as the range key. The second join-table has the inverse.
MOVIE:
MNAME
(hash key)
Avatar TicketCount =
Page 8 of 13
Amazon Web Services - DynamoDB Design Patterns
32,000,000
Solution
Use multiple keys (aliases) instead of a single hot key. You can generate these aliases by prefixing or
suffixing random number (N) within a known range to the hot hash key. We suggest a large range such
as N = 100 to 200.
For example “Avatar” ticket count can be stored in multiple aliases (1_Avatar, 2_Avatar, … N_Avatar).
MNAME
(hash key)
1_Avatar TicketCount =
4,000,000
2_Avatar TicketCount =
2,000,000
3_Avatar TicketCount =
4,000,000
….
N_Avatar TicketCount =
4,000,000
While updating Ticket Count for “Avatar” the application can randomly pick a prefix from the set 1 to N
and update one of the N possible keys. This will spread the IO across multiple keys and potentially
multiple partitions.
To answer the question: Find the ticket count for “Avatar”, the application can use the BatchGetItem API
to retrieve ticket counts for all the aliases (1_Avatar, 2_Avatar, 3_Avatar,…, N_Avatar) and sum them.
Sorted data:
UserId Timestamp …
(hash key) (range key)
U1 1
U1 2
U1 …
U1 1000000000
U2 1
U2 2
Page 9 of 13
Amazon Web Services - DynamoDB Design Patterns
U2 …
U2 1000000000
…
Solution
To avoid this issue you can export data sorted by timestamp or any other column that will cause a
random distribution of hash keys. Think of this as shuffling your card deck before distributing the cards!
Solution
Queries that return a large number of items have the potential of draining a significant amount of read
capacity units in a partition, potentially putting it in read outage for a short time. In order to avoid this
you can put the large attribute in a side or overflow table, and fetch it with a batch get based on the
query results. Since the # of read units consumed by a query is = (cumulative size of items) (rounded up
to 1kb increments), by making the items much smaller it’s possible to consume a small number of read
capacity units even if the # of items returned is large.
The size of LargeAttribute is 2K. A Query (ID(hash key) = XXX, [optional range key conditions]) that
returns 1000 items can drain about 2000+ read capacity units from a partition. This may exhaust the
Page 10 of 13
Amazon Web Services - DynamoDB Design Patterns
read capacity units in a partition and put the partition in read outage especially when there are multiple
concurrent queries going to the same partition (ID(hash key) = XXX).
In order to avoid this issue you can use the following scheme and put the large attribute in an overflow
table:
TABLE1
ID (hash key) Attribute1
Timestamp (range key) Attribute2
Attribute3
….
AttributeN
LargeAttributeUUID
TABLE1_OVERFLOW
LargeAttributeUUID LargeAttribute
In this case if the query returns 1000 items and assuming each attribute is about 10 bytes the item size
may be around 70 bytes (including 16 bytes for LargeAttributeUUID) and hence the query may consume
70 read units on TABLE1 instead of 2000 reads units as discussed earlier. The application can do a batch
get on the list of LargeAttributeUUIDs returned by the query. The batch get will distribute the IO
uniformly across multiple partition of the overflow table (TABLE1_OVERFLOW). Even though the total
read units consumed might be the same between the two tables (TABLE1 & TABLE1_OVERFLOW) you
can avoid hot partition.
Summary
To avoid hot keys distribute workload across more keys.
To avoid hot partitions distribute workload across multiple partitions.
To avoid hot partitions distribute workload to overflow table
Solution
An application calling DyanmoDB PutItem, UpdateItem, DeleteItem can log versioned changes (vi a API
ReturnValues with ALL_NEW, etc in some cases) into a DynamoDB update stream table. DynamoDB
changes should be mapped to either “type:”add” or “type”:”delete” SDF record - PutItem is a add,
UpdateItem is a delete followed by a add, DeleteItem maps to a delete. Use DynamoDB EMR/Hive
handler to convert the update stream table into CVS or directly into a SDF file and dump it to S3. This
SDF files can be loaded into CloudSearch for updating indexes. You should use one DynamoDB update
stream table per day (or per week etc) and drop the old tables after moving it to S3 via EMR.
Here is an example:
Page 11 of 13
Amazon Web Services - DynamoDB Design Patterns
Let us assume I am maintaining movies in a DynamoDB Movies table. When the DynamoDB client
application (in your app tier) updates the movies tables it logs the changes in a
Movies_update_stream_<day> table. I am using one table per day here. At the end of the day, I export
the data to S3 and from there import this into CloudSearch.
Page 12 of 13
Amazon Web Services - DynamoDB Design Patterns
Conclusion
DynamoDB should be used when your applications do not require complex queries or transactions, need
to scale now or in the future, demand low latency and high durability.
While, items greater than 64K in size can be stored DynamoDB across multiple items using the range
key, large blobs should be stored in S3.
For storing time-series data, applications should create a table per day, per week etc. and move old
tables into S3 using the DynamoDB Hive handler.
You can create secondary index tables in DynamoDB for efficiently searching non-key attributes. You can
also use CloudSearch for indexing and searching DynamoDB attributes.
You can avoid hot keys by picking the right key structure. You can avoid hot partitions by placing large
attributes in a separate overflow table and/or randomizing the workload across partitions (hash-keys).
One-to-may (1: N) and many-to-may (M: N) relationships between two entities can be modeled in
DynamoDB using join-tables.
As shown above most of the constraints imposed by DynamoDB can be worked around using a
combination of these design patterns. While these constrains create more work for the developer
upfront, they (for the most part) force a developer to write applications that scale with predictable
performance and avoid nasty surprises as the work load increases.
Page 13 of 13