0% found this document useful (0 votes)
68 views

DynamoDB Design Patterns

This document discusses design patterns for working with Amazon DynamoDB, a NoSQL database service. It addresses issues like items exceeding the 64KB size limit, searching non-key attributes, modeling time series data, and avoiding hot keys and partitions. Some solutions include breaking large items across multiple records, using secondary tables, and storing large attributes in Amazon S3 for reference in DynamoDB. The goal is to help developers design applications that scale efficiently within DynamoDB's constraints.

Uploaded by

Allan Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

DynamoDB Design Patterns

This document discusses design patterns for working with Amazon DynamoDB, a NoSQL database service. It addresses issues like items exceeding the 64KB size limit, searching non-key attributes, modeling time series data, and avoiding hot keys and partitions. Some solutions include breaking large items across multiple records, using secondary tables, and storing large attributes in Amazon S3 for reference in DynamoDB. The goal is to help developers design applications that scale efficiently within DynamoDB's constraints.

Uploaded by

Allan Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Amazon Web Services - DynamoDB Design Patterns

DynamoDB Design
Patterns
Siva Raghupathy
4/27/2012

Page 0 of 13
Amazon Web Services - DynamoDB Design Patterns

Table of Contents
Introduction .................................................................................................................................................. 2
Storing items over 64K .................................................................................................................................. 2
Problem ..................................................................................................................................................... 2
Solution ..................................................................................................................................................... 3
Break large attribute(s) across multiple DynamoDB items................................................................... 3
Store large attributes in S3 ................................................................................................................... 3
Gotchas ..................................................................................................................................................... 4
Searching non-key attributes (creating Secondary Indexes) ........................................................................ 4
Problem ..................................................................................................................................................... 4
Solution ..................................................................................................................................................... 4
Gotchas ..................................................................................................................................................... 5
Storing and analyzing time series (log) data ................................................................................................. 6
Problem ..................................................................................................................................................... 6
Solution ..................................................................................................................................................... 6
Modeling one-to-many (1: N) and many-to-many (M:N) relationships ....................................................... 7
Problem ..................................................................................................................................................... 7
Solution ..................................................................................................................................................... 7
Avoiding hot keys and hot partitions ............................................................................................................ 8
Hot key example ....................................................................................................................................... 8
Problem ................................................................................................................................................. 8
Solution ................................................................................................................................................. 9
Hot partition (write) example ................................................................................................................... 9
Problem ................................................................................................................................................. 9
Solution ............................................................................................................................................... 10
Hot partition (read) example .................................................................................................................. 10
Problem ............................................................................................................................................... 10
Solution ............................................................................................................................................... 10
Summary ................................................................................................................................................. 11
CloudSearch enabling DynamoDB data ...................................................................................................... 11
Problem ................................................................................................................................................... 11
Solution ................................................................................................................................................... 11
Conclusion ................................................................................................................................................... 13

Page 1 of 13
Amazon Web Services - DynamoDB Design Patterns

Introduction
Amazon DynamoDB is a fully managed non-relational (NoSQL) database services built from the ground-
up for massive scale, predictable performance, low latency and high durability.

While DynamoDB automatically handles most of the heavy lifting when it comes to database
administration and scaling it imposes some constraints. For example, the size of an item cannot exceed
64 kilo bytes (KB). A table must have a primary key. The primary key must be a single hash key or a
hash-range composite key. Data can be efficiently accessed only via the primary key. There are no
secondary indexes. Developers should avoid hot key and hot partitions to achieve massive scale. These
constraints (for the most part) ensure that a developer writes applications that scale with predictable
performance and avoid bad surprises as the work load increases.

This paper illustrates design patterns for working around these constrains. For example, you can store
greater than 64KB pertaining to an item in multiple items in a table using a hash-range key. You can
create and maintain your own secondary indexes. Or you can CloudSearch enable your DynamoDB data.
It also outlines best practices for storing time series data, avoiding hot keys and hot partitions and
modeling 1:N and M:N relationships.

Storing items over 64K


Problem
Amazon DynamoDB item size, which includes the sum of attribute-name and attribute-value lengths,
cannot exceed 64 Kilo Bytes (KB). But there are scenarios where you may need to store > 64K for an
item.

If the item can grow large because it contains a possibly large list of attributes, consider pivoting it from
horizontal to vertical, i.e. switching from hash to hash-range.

Let us consider the following scenario. You have an application that stores inbox messages in
DynamoDB. Here is the inbox schema:

INBOX_MESSAGE
MESSAGE-ID
MAILBOX-ID
DATE
FROM
TO
SUBJECT
BODY

The MESSAGE-ID is a globally unique ID. This can be a primary key.


An alternate primary key is MAILBOX-ID + DATE.

While most messages may fit within 64K there is a chance that the message body can be greater than
64K.

Page 2 of 13
Amazon Web Services - DynamoDB Design Patterns

Solution
You can use one of the following techniques for storing larger items:

Break large attribute(s) across multiple DynamoDB items

Option1: Use a range-key for overflowing the > 64K item.


INBOX_MESSAGE
MESSAGE-ID PART
(hash key) (range key)
1 0 FROM = ‘user1’
TO = ‘user2’
DATE = ‘12/12/2011’
SUBJECT = ‘DynamoDB Best practices’
BODY = ‘The first few Kbytes…..’
1 1 BODY = ‘ the next 64k’
1 2 BODY = ‘ the next 64k’
1 3 EOM

You can insert the header (PART= 0) first and then insert additional parts as additional items with just
the BODY. When getting a message, Query for hash key (Example, MESSAGE-ID= 1) with no range key
condition. This will return PART = 0, 1,2 & 3. You can assemble the BODY parts in your application.

Option2: If the range key is in use, you can store the overflow items in a separate table
(INBOX_MESSAGE_BODY_OVERFLOW).

INBOX_MESSAGE
MAILBOX-ID DATE
(hash key) (range key)
1 ‘12/12/2011’ MESSAGE-ID = ‘some UUID’
FROM = ‘user1’
TO = ‘user2’
SUBJECT = ‘DynamoDB Best practices’
BODY= ‘The first few Kbytes…..’

INBOX_MESSAGE_BODY_OVERFLOW
MESSAGE-ID PART
(hash key) (range key)
1 1 BODY = ‘ the next 64k’
1 2 BODY = ‘ the next 64k’
1 3 EOM

Store large attributes in S3


Suggest doing this approach only when the size of the attribute is very large. Else leaving them in
DynamoDB will result in lower latencies.

- Create an S3 bucket in the same AWS region.

Page 3 of 13
Amazon Web Services - DynamoDB Design Patterns

- Store the large attribute in the S3 bucket by


o Generating an unique S3 key (this can be table key+attributeName)
o Putting the key and attribute value in S3
- Store the S3 reference (key+bucket name) as a DynamoDB table’s attribute value

For example, if you are modeling an email inbox using a DynamoDB table (INBOX_MESSAGE), you
can store the message body in S3.

INBOX_MESSAGE
MESSAGE-ID
(hash key)
1 FROM = ‘user1’
TO = ‘user2’
DATE = ‘12/12/2011’
SUBJECT = ‘DynamoDB Best practices’
BODY= ‘The first few Kbytes…..’
BODY_OVERFLOW = ‘S3bucket+key’

In the above example, a section of the message body is stored as a DynamoDB attribute BODY. The
overflow is stored in S3 and the BODY_OVERFLOW attribute’s value is set to S3 bucket+key.

Gotchas
- Since writes across multiple DynamoDB items is not transactional, client needs to handle any failure
scenarios.
- Since both DynamoDB and S3 do not support (distributed) transactions there may be orphaned S3
objects or DynamoDB items.
- S3 keys can be 1024 bytes long. S3 buckets can be 255 characters long. Depending on the bucket
and key names you choose you may chew-up 1024+255 (1279) bytes maximum.

Searching non-key attributes (creating Secondary Indexes)


Problem
DynamoDB does not support secondary indexes for efficiently searching non-key attributes. How can I
solve this issue?

Solution
You can create your own secondary index tables. For example, let us consider a table:

USERS
UID FNAME LNAME
(hash key)
101 Tim White
201 Tim Black
301 Ted White
401 Keith Brown
501 Keith White
601 Keith Black

Page 4 of 13
Amazon Web Services - DynamoDB Design Patterns

To efficiently answer the following queries:


Query1: Get me all the items in USERS_table where FNAME = ‘Tim’
Query2: Give me all the items in USERS_table where LNAME = ‘Black’
You can create a secondary index table for FNAME and LNAME respectively:
FNAME_index
FNAME UID
(hash key) (range key)
Tim 101
Tim 201
Ted 301
Keith 401
Keith 501
Keith 601

LNAME_index:
LNAME UID
(hash key) (range key)
White 101
Black 201
White 301
Brown 401
White 501
Black 601

Note: using FNAME or LNAME as the hash-key in the USERS you could have avoided creating one of
these secondary index tables.

To answers Query1:
- Query FNAME_index table, where FNAME (hash key) = ‘Tim’ and do not specify the range key
condition. This will return UID = (101, 201).
- Then look up USERS_table using BatchGetItem .

To answers Query2:
- Query LNAME_index table, where LNAME (hash key) = ‘Black’ and do not specify the range key
condition. This will return UID = (201, 601).
- Then look up USERS_table using BatchGetItem .

Gotchas
- Since the hash keys of the index tables can only be queried for equality your cannot do range
queries i.e. give me all items in the Users table where FNAME begins with ‘T’
- When you write data in the primary table, you need to maintain the secondary index tables.

Page 5 of 13
Amazon Web Services - DynamoDB Design Patterns

Storing and analyzing time series (log) data


Problem
Your application stores historic or time series data. You want to keep data for the last N time-periods
and delete older data. Classic scenarios are:
- Logging – where you may want to delete or age-out old data i.e. only keep logs for the last N
days/weeks/months.
- Round-Robin or circular buffers where you write over the circular buffer so you don’t use excessive
storage.
- Historic data tables – Current data is updatable while old data is read-only

Solution
Pre-create N tables, one for each time-period (e.g. 1 per hour/day/month). Periodically add new tables
and delete older tables. Deleting a table is much more efficient and costs much less than deleting old
items within a table. But the application must handle the sharding across N tables. Usually new tables
are write-heavy and older tables are read-heavy. Having multiple tables makes it easy to adjust the
provisioned throughput levels according to your needs.

Let us assume you are storing historic date in a table with the following schema:

Event_id Timestamp Attribute1 …. Attribute N


(Hash key)

You application needs to keep the last 12 months of historic data.

You can pre-create one table per month and insert records into the appropriate table based on
timestamp.

JAN2012
Event_id Timestamp Attribute1 …. Attribute N
(Hash key)

FEB2012
Event_id Timestamp Attribute1 …. Attribute N
(Hash key)

DEC2012
Event_id Timestamp Attribute1 …. Attribute N
(Hash key)

For example, if the Timestamp is “2012-01-17 01:35:05” for a row it will go into the JAN2012 table. Since
DynamoDB automatically partitions by the hash key (Event_id), you can provision the required read or
right throughput. For example, you may provision higher reads and writes per second for the current
month but much less for the previous months. When you create the JAN2013 table you can use Amazon
Elastic MapReduce and Hive to archive the JAN2012 table to S3. Then you can delete the JAN2012 table.

Page 6 of 13
Amazon Web Services - DynamoDB Design Patterns

Modeling one-to-many (1: N) and many-to-many (M:N) relationships

Problem
You can only create two type of tables in DynamoDB, a table with a hash key or a table with a composite
hash range key. How can I use this to model entities and 1: N or M: N relationships between entities in
an application?

Solution
Let us consider an on-line gaming example where there are two entities: USERS & GAMES. There is a
many-to-many (M: N) relationship between a user and a game i.e one user can play many games and
game can have many users. The application wants to store user profile, game metadata, highest scores
that a user posted in a game, plus a leader board. This can be modeled using DynamoDB tables as
follows:

User profiles are kept in a USERS table with a hash key:


Hash key Attributes
UserId = bob Email = bob@gmail.com, JoinDate = 2011-11-15
UserId = fred Email = fred@yahoo.com , JoinDate = 2011-12-01, Sex = M

Game metadata can be kept in a GAMES table with a hash key:


Hash key Attributes
GameId = Game1 LaunchData = 2011-10-15, Version = 2,
GameId = Game2 LaunchDate = 2010-05-12, Version = 3,
GameId = Game3 LaunchDate = 2012-01-20, version = 1

User scores etc. can be kept in a USER_GAMES table with a Hash-range key:
Hash Key Range key Attributes
UserId = bob GameId = Game1 HighScore = 10500, ScoreDate = 2011-10-20
UserId = fred GameId = Game2 HIghScore = 12000, ScoreDate = 2012-01-10
UserId = bob GameId = Game3 HighScore = 20000, ScoreDate = 2012-02-12

A LEADERBOARD board can be implemented in table with a hash-range key:


Hash Key Range key Attributes
GameId = Game1 Score_UserID = 000010500_bob UserId = bob
GameId = Game2 Score_UserID = 000012000 _fred UserId = fred

Note: The range key is Score+UserID, where the UserID suffix is added to make the row unique
in the event two users in a game had the same score.

The USER_GAMES table models the 1: N relationship between a user and game. The Query API can use
be used to retrieve all the games played by a user.

If you want to find the HighScores achieved by a user bob, you can Query the USER_GAMES table for
hash key = bob and the range key set to a specific game or with no range key if you want to get high
scores for all the games.

Page 7 of 13
Amazon Web Services - DynamoDB Design Patterns

Similarly to find all the users of a game you need a GAME_USERS table:
Hash Key Range key
GameId = Game1 UserId = bob
GameId = Game2 UserId = fred
GameId = Game3 UserId = bob

Even if you don’t need to find all the users for a game, you need the GAME_USERS table to maintain
referential integrity i.e. if a game is deleted you can find all the users that played the game from the
GAME_USERS table and use it to delete the unwanted data in the USER_GAMES table.

The USER_GAMES table along with GAME_USERS allows us to model the many-to-many (M: N)
relationship between game and users. Both tables are also needed to maintain referential integrity even
if you want to just model a 1: N relationship between games and users or vice versa.

Summary:
One-to-may (1: N) and many-to-may (M: N) relationships between two entities can be modeled in
DynamoDB using 2 join-tables. The first join table has the first entity’s key as the hash key and the
second entity’s key as the range key. The second join-table has the inverse.

Avoiding hot keys and hot partitions


To achieve the provisioned throughput (read and write capacity units) associated with a table Amazon
DynamoDB automatically partitions items in a table by the hash key element of a table’s primary key
and distributes these partitions across multiple machines. The provisioned throughput is evenly divided
among the partitions and reserved. There is no sharing (or bursting) across partitions and hence no noisy
neighbor issues. Thus if an application’s workload is evenly distributed across the hash key space and
thereby all the partitions then the table can sustain the full provisioned throughput. But if the workload
is skewed around a few hot keys or partitions then the throughput that can be sustained by the table
will be limited by those hot partitions (where the hot keys reside). This will result in provisioned
throughput exceeded exception errors even though the read or write request rates are well below the
provisioned throughput.

Hot key example


Problem
Your BoxOffice application has a MOVIE table where you maintain ticket count and ticket sales of
movies. This table has a hash key “MovieName”. There is a hot movie in town “Avatar” and the
BoxOffice application is doing 2,000 updates per second for this movie. Even though you provisioned
10,000 writes capacity units for the MOVIE table your application is getting
ProvisionedThroughputExceededExceptions. i.e. the partition where the “Avatar” key resides cannot
sustain 2000 write capacity units.

MOVIE:
MNAME
(hash key)
Avatar TicketCount =

Page 8 of 13
Amazon Web Services - DynamoDB Design Patterns

32,000,000

Solution
Use multiple keys (aliases) instead of a single hot key. You can generate these aliases by prefixing or
suffixing random number (N) within a known range to the hot hash key. We suggest a large range such
as N = 100 to 200.

For example “Avatar” ticket count can be stored in multiple aliases (1_Avatar, 2_Avatar, … N_Avatar).

MNAME
(hash key)
1_Avatar TicketCount =
4,000,000
2_Avatar TicketCount =
2,000,000
3_Avatar TicketCount =
4,000,000
….
N_Avatar TicketCount =
4,000,000

While updating Ticket Count for “Avatar” the application can randomly pick a prefix from the set 1 to N
and update one of the N possible keys. This will spread the IO across multiple keys and potentially
multiple partitions.

To answer the question: Find the ticket count for “Avatar”, the application can use the BatchGetItem API
to retrieve ticket counts for all the aliases (1_Avatar, 2_Avatar, 3_Avatar,…, N_Avatar) and sum them.

Hot partition (write) example


Problem
You are loading data into a table with a composite hash-range key from multiple workers. The data is
sorted by UserId (hash key) and Timestamp (range key). This results in bursts of inserts into the
partition where UserId “U1” is stored and then another burst into partition where UserId “U2” is stored
etc. You are getting ProvisionedThroughputExceededException errors even through the aggregate
throughput is less that provisioned throughput associated with the table. This is because of hot
partitions. i.e. the rate of write exceeds the per partition provisioned throughput.

Sorted data:
UserId Timestamp …
(hash key) (range key)
U1 1
U1 2
U1 …
U1 1000000000
U2 1
U2 2

Page 9 of 13
Amazon Web Services - DynamoDB Design Patterns

U2 …
U2 1000000000

Solution
To avoid this issue you can export data sorted by timestamp or any other column that will cause a
random distribution of hash keys. Think of this as shuffling your card deck before distributing the cards!

Data sorted by Timestamp:


User id Timestamp …
(hash key)
U1 1
U2 1
… …
U1 2
U2 2
… …

Hot partition (read) example


Problem
I have a table with has hash-range composite key with one or more large attributes. I have queries that
return a lot of data. Sometime the queries return ProvisionedThroughputExceededException errors.
Why?

Solution
Queries that return a large number of items have the potential of draining a significant amount of read
capacity units in a partition, potentially putting it in read outage for a short time. In order to avoid this
you can put the large attribute in a side or overflow table, and fetch it with a batch get based on the
query results. Since the # of read units consumed by a query is = (cumulative size of items) (rounded up
to 1kb increments), by making the items much smaller it’s possible to consume a small number of read
capacity units even if the # of items returned is large.

For example let us consider the following table:


TABLE1
ID (hash key) Attribute1
Timestamp (range key) Attribute2
Attribute3
….
AttributeN
LargeAttribute

The size of LargeAttribute is 2K. A Query (ID(hash key) = XXX, [optional range key conditions]) that
returns 1000 items can drain about 2000+ read capacity units from a partition. This may exhaust the

Page 10 of 13
Amazon Web Services - DynamoDB Design Patterns

read capacity units in a partition and put the partition in read outage especially when there are multiple
concurrent queries going to the same partition (ID(hash key) = XXX).

In order to avoid this issue you can use the following scheme and put the large attribute in an overflow
table:

TABLE1
ID (hash key) Attribute1
Timestamp (range key) Attribute2
Attribute3
….
AttributeN
LargeAttributeUUID

TABLE1_OVERFLOW
LargeAttributeUUID LargeAttribute

In this case if the query returns 1000 items and assuming each attribute is about 10 bytes the item size
may be around 70 bytes (including 16 bytes for LargeAttributeUUID) and hence the query may consume
70 read units on TABLE1 instead of 2000 reads units as discussed earlier. The application can do a batch
get on the list of LargeAttributeUUIDs returned by the query. The batch get will distribute the IO
uniformly across multiple partition of the overflow table (TABLE1_OVERFLOW). Even though the total
read units consumed might be the same between the two tables (TABLE1 & TABLE1_OVERFLOW) you
can avoid hot partition.

Summary
 To avoid hot keys distribute workload across more keys.
 To avoid hot partitions distribute workload across multiple partitions.
 To avoid hot partitions distribute workload to overflow table

CloudSearch enabling DynamoDB data


Problem
I want to use DynamoDB as my high throughput, high durability and low latency data store. Yet, I want
the flexibility of CloudSearch to full text index/search my data. How do I enable this scenario?

Solution
An application calling DyanmoDB PutItem, UpdateItem, DeleteItem can log versioned changes (vi a API
ReturnValues with ALL_NEW, etc in some cases) into a DynamoDB update stream table. DynamoDB
changes should be mapped to either “type:”add” or “type”:”delete” SDF record - PutItem is a add,
UpdateItem is a delete followed by a add, DeleteItem maps to a delete. Use DynamoDB EMR/Hive
handler to convert the update stream table into CVS or directly into a SDF file and dump it to S3. This
SDF files can be loaded into CloudSearch for updating indexes. You should use one DynamoDB update
stream table per day (or per week etc) and drop the old tables after moving it to S3 via EMR.

Here is an example:

Page 11 of 13
Amazon Web Services - DynamoDB Design Patterns

Let us assume I am maintaining movies in a DynamoDB Movies table. When the DynamoDB client
application (in your app tier) updates the movies tables it logs the changes in a
Movies_update_stream_<day> table. I am using one table per day here. At the end of the day, I export
the data to S3 and from there import this into CloudSearch.

Propagating incremental changes DynamoDB Movies table to


CloudSearch

An example SDF record generated from Movies_update_stream_2012_04_16 table:

Page 12 of 13
Amazon Web Services - DynamoDB Design Patterns

Conclusion
DynamoDB should be used when your applications do not require complex queries or transactions, need
to scale now or in the future, demand low latency and high durability.

While, items greater than 64K in size can be stored DynamoDB across multiple items using the range
key, large blobs should be stored in S3.

For storing time-series data, applications should create a table per day, per week etc. and move old
tables into S3 using the DynamoDB Hive handler.

You can create secondary index tables in DynamoDB for efficiently searching non-key attributes. You can
also use CloudSearch for indexing and searching DynamoDB attributes.

You can avoid hot keys by picking the right key structure. You can avoid hot partitions by placing large
attributes in a separate overflow table and/or randomizing the workload across partitions (hash-keys).

One-to-may (1: N) and many-to-may (M: N) relationships between two entities can be modeled in
DynamoDB using join-tables.

As shown above most of the constraints imposed by DynamoDB can be worked around using a
combination of these design patterns. While these constrains create more work for the developer
upfront, they (for the most part) force a developer to write applications that scale with predictable
performance and avoid nasty surprises as the work load increases.

Page 13 of 13

You might also like