The DynamoDB Book
The DynamoDB Book
The DynamoDB Book
Alex DeBrie
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1. What is DynamoDB?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4. Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3. Advanced Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3. Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4. Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.7. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2. Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3. Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6. Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15.3. Adding a new entity type into an existing item collection . 259
15.4. Adding a new entity type into a new item collection. . . . . . 261
1
data into a single table!
When I got home that night, I watched the video recording of the
same session. Sure enough, my ears weren’t deceiving me. Rick
Houlihan was an assassin, and DynamoDB was his weapon of
choice.
Over the next month, I spent my free time trying to decipher the
ins and outs of Rick’s talk. I compared my notes against DynamoDB
documentation and tried replicating his models in my own
examples. The more I read, the more excited I became. My
Christmas holiday was focused on sharing this knowledge with
others, and in January 2018, I published DynamoDBGuide.com, a
website aimed at sharing my newfound love for DynamoDB.
In the time since hitting publish on that site, I’ve learned much
more about DynamoDB. DynamoDBGuide.com will get you
started, but it’s not going to take you to the top of the mountain.
This is the book I wish I had when I started with DynamoDB. The
first few chapters will warm you up with the basics on DynamoDB
features and characteristics. But to paraphrase James Carville, "It’s
the data model, stupid!" The hard part about DynamoDB is making
the shift from an RDBMS mindset to a NoSQL mindset. We’ll go
deep on DynamoDB data modeling in this book, from discussion of
the DynamoDB API, to various strategies to use when using
DynamoDB, to five full-length walkthroughs. There are some
things you can only learn by doing, but it doesn’t hurt to have a
guide along the way.
And that Rick Houlihan guy? He now has the most popular
re:Invent session year after year, enjoys a cult following on Twitter,
and somehow agreed to write the foreword to this book.
2
Acknowledgements
This book would not have been possible without help from so
many people. I’m sure to leave out many of them.
3
Thanks to my parents, siblings, in-laws, and extended family
members that stood by me as I moved from lawyer to developer to
"self-employed" (unemployed?) author. I’m grateful for the support
from all of you.
4
Foreword
For some, data modeling is a passion. Identifying relationships,
structuring the data, and designing optimal queries is like solving a
complex puzzle. Like a builder when a job is done and a structure is
standing where an empty lot used to be, the feelings are the same:
satisfaction and achievement, validation of skill, and pride in the
product of hard work.
Data has been my life for a long time. Throughout a career that has
spanned three decades, data modeling has been a constant. I cannot
remember working on a project where I did not have a hand in the
data layer implementation, and if there is one place in the stack I
am most comfortable, it is there. The data layer is not usually
considered to be the most exciting component, which meant I
owned it. Determining the best way to model the data, crafting the
most efficient queries, and generally making things run faster and
cheaper has always been a big part of my developer life.
5
data. Create business objects in code and let the underlying ORM
tooling manage the database for you. This type of abstraction can
produce reasonable results when relational databases are queried to
produce “views” of data required by the application. A “view” is by
definition an abstraction of the data model since it is a
representation of data stored across many tables and is not
persisted in the same form that the data is presented. The
developer specifies a set of queries to populate business objects, and
then they can leverage the ORM framework to handle the more
complex process of managing data stored in multi-table relational
models.
What this all means is in the end the data actually is the code. There
is no getting away from the fact that abstraction and optimization
6
are polar opposites from a design perspective. Data model
abstraction makes it easy for the developer to deliver a working
system, but abstraction requires a system that makes assumptions
about access patterns which leads to less than optimal solutions. It is
not possible to be agnostic to all access patterns and at the same
time be optimized for any but the simplest. In the end DBA skills
are always required to make things run smoothly at scale. There is
just no avoiding it.
7
Today it becomes immediately apparent after the most cursory
analysis that the cost dynamics of managing applications at scale
have completely reversed. Storage is literally pennies per gigabyte
and CPU time is where the money is being spent. In today’s IT
environment, the relational database has become a liability, not an
asset, and the CPU cost of executing ad hoc queries at high
transaction rates against large datasets has become a barrier to scale.
The era of Big Data has pushed the traditional relational database
platforms to the breaking point and beyond, and has produced
requirements that simply cannot be met by legacy relational
database technology.
8
which makes NoSQL a perfect choice.
After all is said and done, I have come to realize that there is one
underlying principle when it comes to NoSQL. At the core of all
NoSQL databases there is a collection of disparate objects, tied
together by indexed common attributes, and queried with
conditional select statements to produce result sets. NoSQL does
not join data across tables, it effectively achieves the same result by
storing everything in one table and using indexes to group items. I
will never get tired of watching a team I am working with turn the
corner and have what I call the “light bulb” moment when they
9
finally understand data modeling for NoSQL, and how truly
powerful and flexible it really is.
Data is code.
Rick Houlihan
10
Chapter 1. What is DynamoDB?
The future is here, it’s just not evenly-distributed.
The future is here, but some people are still using legacy databases.
I’ve been using DynamoDB for the last five years, and I can’t go
back to any other database. The billing model, the permissions
system, the scalability, the connection model—it’s an improvement
from other databases in so many ways.
But first, let’s cut loose the inaccurate information you may have
picked up around DynamoDB and NoSQL over the years.
You may have heard that DynamoDB can only handle simple
access patterns. Insert an individual item and read it back; anything
11
more complex, you’ll need to use a "real" database.
12
companies rely on DynamoDB to handle high-volume transactions
against core data at scale.
13
to your existing model. In the event you need to modify existing
records, there’s a straight-forward pattern for doing so.
But truly schemaless data is madness. Good luck reading out the
garbage you’ve written into your table.
While it’s true that DynamoDB won’t enforce a schema to the extent
that a relational database will, you will still need a schema
somewhere in your application. Rather than validating your data at
the database level, it will now be an application-level concern.
You still need to plan your data model. You still need to think about
object properties. Don’t use NoSQL as an excuse to skimp on your
job.
14
databases. Let’s see some of the features that distinguish
DynamoDB from other databases.
DynamoDB has support for two similar data models. First, you can
use DynamoDB as a key-value store. Think of a key-value store like
a giant, distributed hash table (in your programming language of
choice, it may be called a dictionary or a map). A hash table
contains a large number of elements, each of which are uniquely
identifiable by a key. You can get, set, update, and delete these
elements by referring to its primary key. Hash tables are a
commonly-used data structure because of their fast, consistent
performance no matter the size of the data set.
The problem with a key-value store is that you can only retrieve
one record at a time. But what if you want to retrieve multiple
records? For example, I might want to read the ten most recent
readings for my IoT sensor, or I may want to fetch a Customer and
all of the Customer’s Orders over the last 6 months.
To handle these more complex access patterns, you can also use
DynamoDB as a wide-column store. A wide-column store is like a
super-charged version of a hash table where the value for each
record in your hash table is a B-tree. A B-tree is another
commonly-used data structure that allows you to quickly find a
particular item in the data structure while also allowing for range
queries. Think of a B-tree like a phone book. It is relatively easy to
15
open a phone book and find the entry for DeBrie, Alex. It’s also
straightforward to find all entries with the last name "Cook" or to
find all entries between "Benioff, Marc" and "Bezos, Jeff".
To my readers born after 1995, a "phone book" was a real-life physical book
containing the names, addresses, and phone numbers of everyone in a
particular location. The phone book was in alphabetical order by last name.
Imagine that! This was before fancy things like Google and Facebook rendered
phone books obsolete for anything other than kindling.
16
will continue to get lightning-fast response times as your database
grows. You can scale your database to 10 TB, and you won’t see any
performance degradation from when your database was only 10GB.
There are DynamoDB users with tables over 100TB, and they still
see response times in the single-digit milliseconds.
The HTTP-based model can be a bit slower than the persistent TCP
connection model for some requests, since there isn’t a readily-
available connection to use for requests. However, persistent
connections have downsides as well. You need some initialization
time to create the initial connection. Further, holding a persistent
connection requires resources on the database server, so most
databases limit the number of open connections to a database. For
example, PostgreSQL, the popular open-source relational database
engine, sets the default number of maximum connections to 100.
• IAM authentication
17
DynamoDB uses AWS IAM for authentication and authorization of
database requests rather than a username and password model that
is common with other database systems.
If you want to get more granular, you can even limit IAM
permissions such that the authenticated user may only operate on
DynamoDB items with certain primary keys or may only view
certain attributes on allowed keys.
• Infrastructure-as-code friendly
18
For most databases, managing the database via infrastructure-as-
code is an awkward endeavor. You can do part of the work to
prepare your database for use in your application—such as creating
the database and configuring network access—using a tool like
Terraform or CloudFormation. However, there’s another set of
tasks, such as creating database users, initializing tables, or
performing table migrations, that don’t fit well in an infrastructure-
as-code world. You will often need to perform additional
administrative tasks on your database outside your infrastructure-
as-code workflow.
19
There are two amazing things about DynamoDB’s pricing model.
The first is that you can tweak your read and write throughput
separately. Usually your database is fighting over the same
resources for handling read and write workflows. Not so with
DynamoDB. If you have a write-heavy workload, you can crank up
your write throughput while leaving your read throughput low.
If you don’t know your access patterns or don’t want to take the
time to capacity plan your workload, you can use On-Demand
Pricing from DynamoDB. With this pricing model, you pay per
request rather than provisioning a fixed amount of capacity. The
per-request price is higher than the provisioned mode, but it can
still save you money if you have a spiky workload that doesn’t take
full advantage of your provisioned capacity.
The best part is that you can switch between pricing models over
time. I recommend that you start with On-Demand Pricing as you
develop a baseline traffic level for your application. Once you feel
like you have a strong understanding of your traffic needs, you can
switch to defining your Provisioned Throughput to lower costs.
20
microservices and partly due to the popularization of technologies
like Apache Kafka and AWS Lambda. There are a number of efforts
to retrofit change data capture onto existing databases in order to
enable other services to react to changes in a database. With
DynamoDB, this feature is built directly into the core and doesn’t
require difficult workarounds and additional infrastructure
maintenance.
• Fully-managed
Just as more and more companies are moving from on-prem data
centers to cloud-based compute, we’re seeing people prefer fully-
managed database solutions due to the complexity required in
keeping a database running. AWS has an enormous team of world-
leading experts in building and maintaining critical infrastructure.
It’s likely they are able to handle this better than you and in a more
cost-effective manner.
21
DynamoDB. The next question to ask is when you should choose
DynamoDB in your application. There’s been a proliferation of
database options in recent years, and you should carefully consider
your application needs when deciding on your database.
In the last few years, two areas have been driving the adoption of
DynamoDB:
1.2.1. Hyperscale
The first core use case for DynamoDB is for hyper-scale
applications. This is what DynamoDB was made for. To understand
this, let’s have a history lesson.
From the 1970s through the year 2000, the relational database
management system (RDBMS) reigned supreme. Oracle, Microsoft
SQL Server, and IBM’s DB2 were popular proprietary choices,
while MySQL and PostgreSQL took off as open-source offerings.
The relational database was built for a world where storage was the
limiting factor. Memory and disk were expensive in the early
22
computing age, so relational databases optimized for storage. Data
would be written once, in a normalized fashion, to avoid
duplicating information. Developers learned SQL (which stands for
"Structured Query Language") as a way to reassemble related bits of
data spread across multiple locations. SQL and database
normalization were focal points of developer education.
In the early 2000s, we started to see the first signs of cracks in the
RDBMS monopoly. A few factors contributed to this breakup. First,
the price of storage fell through the floor. Hard drives went from
costing $200,000 per GB in 1980 to $0.03 per GB in 2014.
Conserving on storage was no longer economically necessary.
Amazon.com has been known for its "Cyber Monday" deals. This is
an online version of "Black Friday", the major shopping day right
after Thanksgiving where people are gearing up for the holiday
season. As Cyber Monday became more and more popular,
Amazon had trouble keeping up with the load on their
infrastructure. In 2004, this came to a head with a number of
scaling challenges that led Amazon to rethink some of their core
infrastructure.
The public got a peek behind the curtain of this new infrastructure
through the Dynamo Paper. Published by a group of Amazon.com
engineers in 2007, the Dynamo Paper described a new kind of
database. It rethought a number of key assumptions underlying
relational databases based on the needs of modern applications.
23
creating Dynamo were:
24
form of DynamoDB in 2012.
The first paradigm shift was due to the plummeting cost of storage
and the increasing performance needs due to the internet. The
second paradigm shift was in the rise of what I call 'hyper-
ephemeral compute'. Again, let’s look at the full history to
understand this change.
For the last forty years, the basic compute model for applications
has been relatively consistent. Generally, you have some amount of
CPU, memory, and disk available somewhere as a server. You
execute a command to start a long-running process to run your
application. Your application might reach out to other resources,
such as a database server, to help perform its actions. At some point,
your application will be restarted or killed.
Each of the shifts above was important, but you still had the same
general shape: a long-running compute instance that handled
multiple requests over its lifetime.
25
popularized the notion of event-driven compute. You upload your
code to AWS, and AWS will execute it in response to an event.
Because our compute may not be created until there’s a live event,
ready and waiting to be processed, we need the compute
provisioning speed to be as fast as possible. AWS Lambda has
optimized a lot of this by making it lightning fast to pull down your
code and start its execution, but you need to do your part as well.
And doing your part means avoiding long initialization steps. You
don’t have the time to set up persistent database connection pools.
26
since you don’t need network partitioning, your ephemeral
compute is able to access your database without first setting up the
proper network configuration.
27
1.3. Comparisons to other databases
• Relational databases
• MongoDB
• Apache Cassandra
28
web application frameworks. Finally, a relational database gives you
a ton of flexibility in your data model, and you can iterate on your
application more easily.
29
For me, I still choose DynamoDB every time because I think the
benefits of serverless applications are so strong. That said, you need
to really learn DynamoDB patterns, think about your access
patterns upfront, and understand when you should deviate from
DynamoDB practices built for scale.
While these indexes give you additional power, they come at a cost.
Using these special indexes are likely to hurt you as your data
scales. When you’re talking about an immense scale, you need to
use a targeted, specialized tool rather than a more generic one. It’s
like the difference between a power saw and a Swiss army knife—
the Swiss army knife is more adaptable to more situations, but the
power saw can handle some jobs that a Swiss army knife never
30
could.
The other thing to consider is the tradeoff between lock-in and the
hosting options. If you’re concerned about cloud provider lock-in,
MongoDB is a better bet as you can run it on AWS, Azure, GCP, or
the RaspberryPi in your closet. Again, my bias is toward not
worrying about lock-in. AWS has, thus far, acted in the best
interests of its users, and there are tremendous benefits of going all-
in on a single cloud.
You should also consider how you’re going to host your MongoDB
database. You can choose to self-host on your server instances, but I
would strongly recommend against it for reasons further discussed
in the Apache Cassandra section below. Your data is the most
valuable part of your application, and you should rely on someone
with some expertise to protect it for you.
31
1.3.3. DynamoDB vs. Apache Cassandra
Apache Cassandra is an open-source NoSQL database that was
created at Facebook and donated to the Apache Foundation. It’s
most similar to DynamoDB in terms of the data model. Like
DynamoDB, it uses a wide-column data model. Most of the data
modeling recommendations for DynamoDB also apply to
Cassandra, with some limited exceptions for the difference in
feature set between the two databases.
If the data model is the same and lock-in isn’t a concern to you, the
only difference remaining is how you host the database. If you
choose DynamoDB, you get a fully-managed database with no
server provisioning, failover management, automated backups, and
granular billing controls. If you choose Cassandra, you have to hire
a team of engineers whose entire job it is to make sure a giant
cluster of machines with your company’s most crucial asset doesn’t
disappear.
32
is available.
33
Chapter 2. Core Concepts in
DynamoDB
Chapter Summary
This chapter covers the core concepts of DynamoDB. For those
new to DynamoDB, understanding these concepts and
terminology will help with the subsequent sections of this book.
Sections
1. Basic vocabulary
2. A deeper look at primary keys & secondary indexes
3. The importance of item collections
Now that we know what DynamoDB is and when we should use it,
let’s learn some of the key concepts within DynamoDB. This
chapter will introduce the vocabulary of DynamoDB—tables, items,
attributes, etc.--with comparisons to relational databases where
relevant. Then we’ll take a deeper look at primary keys, secondary
indexes, and item collections, which are three of the foundational
concepts in DynamoDB.
34
of them.
2.1.1. Table
The first basic concept in DynamoDB is a table. A DynamoDB table
is similar in some ways to a table in a relational database or a
collection in MongoDB. It is a grouping of records that
conceptually belong together.
35
2.1.2. Item
An item is a single record in a DynamoDB table. It is comparable to
a row in a relational database or a document in MongoDB.
2.1.3. Attributes
A DynamoDB item is made up of attributes, which are typed data
values holding information about the element. For example, if you
had an item representing a User, you might have an attribute
named "Username" with a value of "alexdebrie".
36
programming language. Each element in a set must be the same
type, and there are three set types: string sets, number sets, and
binary sets. Sets are useful for tracking uniqueness in a particular
domain. In the GitHub example in Chapter 21, we use a set to
track the different reactions (e.g. heart, thumbs up, smiley face)
that a User has attached to a particular issue or pull request.
You will likely use scalars for most of your attributes, but the set
and document types are very powerful. You can use sets to keep
track of unique items, making it easy to track the number of distinct
elements without needing to make multiple round trips to the
database. Likewise, the document types are useful for several
things, particularly when denormalizing the data in your table.
Each item in your table must include the primary key. If you
attempt to write an item without the primary key, it will be rejected.
Further, each item in your table is uniquely identifiable by its
primary key. If you attempt to write an item using a primary key
that already exists, it will overwrite the existing item (unless you
explicitly state that it shouldn’t overwrite, in which case the write
37
will be rejected).
Primary key selection and design is the most important part of data
modeling with DynamoDB. Almost all of your data access will be
driven off primary keys, so you need to choose them wisely. We
will discuss primary keys further in this chapter and in subsequent
chapters.
When you create a secondary index on your table, you specify the
primary keys for your secondary index, just like when you’re
creating a table. AWS will copy all items from your main table into
the secondary index in the reshaped form. You can then make
queries against the secondary index.
The first four concepts can be seen in the following table containing
some example users:
38
We have three records in our example. All the records together are
our table (outlined in red). An individual record is called an item.
You can see the item for Jeff Bezos outlined in blue. Each item has a
primary key of Username, which is outlined in green. Finally, there
are other attributes, like FirstName, LastName, and Birthdate, which
are outlined in black.
In this section, we’ll cover the two kinds of primary keys, the two
kinds of secondary indexes, and the concept of projection in
secondary indexes.
39
2.2.1. Types of primary keys
In DynamoDB, there are two kinds of primary keys:
You may occasionally see a partition key called a "hash key" and a
sort key called a "range key". I’ll stick with the "partition key" and
"sort key" terminology in this book.
The type of primary key you choose will depend on your access
patterns. A simple primary key allows you to fetch only a single
item at a time. It works well for one-to-one operations where you
are only operating on individual items.
40
• Local secondary indexes
• Global secondary indexes
A local secondary index uses the same partition key as your table’s
primary key but a different sort key. This can be a nice fit when
you are often filtering your data by the same top-level property but
have access patterns to filter your dataset further. The partition key
can act as the top-level property, and the different sort key
arrangements will act as your more granular filters.
There are a few other differences to note between local and global
secondary indexes. For global secondary indexes, you need to
provision additional throughput for the secondary index. The read
and write throughput for the index is separate from the core table’s
throughput. This is not the case for local secondary indexes, which
use the throughput from the core table.
41
consistency. Data is replicated from the core table to global
secondary indexes in an asynchronous manner. This means it’s
possible that the data returned in your global secondary index does
not reflect the latest writes in your main table. The delay in
replication from the main table to the global secondary indexes
isn’t large, but it may be something you need to account for in your
application.
On the other hand, local secondary indexes allow you to opt for
strongly-consistent reads if you want it. Strongly-consistent reads
on local secondary indexes consume more read throughput than
eventually-consistent reads, but they can be beneficial if you have
strict requirements around consistency.
42
2.3. The importance of item collections
One example I’ll use a few times in this book is a table that includes
actors and actresses and the movies in which they’ve played roles.
We could model this with a composite primary key where the
partition key is Actor and the sort key is Movie.
There are four movie roles in this table. Notice that two of those
movie roles have the same partition key: Tom Hanks. Those two
movie role items are said to be in the same item collection.
Likewise, the single movie role for Natalie Portman is in an item
collection, even though it only has one item in it.
Item collections are important for two reasons. First, they are useful
43
for partitioning. DynamoDB partitions your data across a number
of nodes in a way that allows for consistent performance as you
scale. However, all items with the same partition key will be kept on
the same storage node. This is important for performance reasons.
DynamoDB partitioning is discussed further in the next chapter.
2.4. Conclusion
You shouldn’t be an expert in these topics yet, but they are the
foundational building blocks of DynamoDB. Almost all of your data
modeling will be focused on designing the right primary key and
secondary indexes so that you’re building the item collections to
handle your needs.
44
Chapter 3. Advanced Concepts
Chapter Summary
This chapter covers advanced concepts in DynamoDB. While
these concepts aren’t strictly necessary, they are useful for
getting the most out of DynamoDB.
Sections
• DynamoDB Streams
• Time-to-live (TTL)
• Partitions
• Consistency
• DynamoDB Limits
• Overloading keys and indexes
45
data modeling with DynamoDB. Finally, the concept of overloaded
keys and indexes is a data modeling concept that will be used
frequently in your data modeling.
46
With DynamoDB streams, you can create a stream of data that
includes a record of each change to an item in your table.
Whenever an item is written, updated, or deleted, a record
containing the details of that record will be written to your
DynamoDB stream. You can then process this stream with AWS
Lambda or other compute infrastructure.
47
DynamoDB streams enable a variety of use cases, from using
DynamoDB as a work queue to broadcasting event updates across
microservices. The combination of DynamoDB Streams with
serverless compute with AWS Lambda gives you a fully-managed
system to react to database changes.
48
your specified attribute. This timestamp should state the time after
which the item should be deleted. DynamoDB will periodically
review your table and delete items that have your TTL attribute set
to a time before the current time.
One nice feature about TTLs is that you don’t need to use it for all
items in your table. For items that you don’t want to automatically
expire, you can simply not set the TTL attribute on the item. This
can be useful when you are storing items with different mechanics
in the table. Imagine an access keys table where user-generated
tokens are active until intentionally deactivated, while machine-
generated tokens are expired after ten minutes.
3.3. Partitions
49
and it does that by sharding your data across multiple server
instances.
50
chapter.
3.4. Consistency
51
The primary node for a partition holds the canonical, correct data
for the items in that node. When a write request comes in, the
primary node will commit the write and commit the write to one of
two secondary nodes for the partition. This ensures the write is
saved in the event of a loss of a single node.
52
After the primary node responds to the client to indicate that the
write was successful, it then asynchronously replicates the write to a
third storage node.
53
each partition. These secondary nodes serve a few purposes. First,
they provide fault-tolerance in case the primary node goes down.
Because that data is stored on two other nodes, DynamoDB can
handle a failure of one node without data loss.
With that in mind, let’s look at the two consistency options available
with DynamoDB:
54
• Strong consistency
• Eventual consistency
With strong consistency, any item you read from DynamoDB will
reflect all writes that occurred prior to the read being executed. In
contrast, with eventual consistency, it’s possible the item(s) you read
will not reflect all prior writes.
Finally, there are two times you need to think about consistency
with DynamoDB.
First, whenever you are reading data from your base table, you can
choose your consistency level. By default, DynamoDB will make an
eventually-consistent read, meaning that your read may go to a
secondary node and may show slightly stale data. However, you can
opt into a strongly-consistent read by passing
ConsistentRead=True in your API call. An eventually-consistent
read consumes half the write capacity of a strongly-consistent read
and is a good choice for many applications.
55
use in a table’s name to the maximum length of a partition key.
Most of these are minutia that we won’t cover here. I want to cover
a few high-salience limits that may affect how you model your data
in DynamoDB.
This limit will affect you most commonly as you denormalize your
data. When you have a one-to-many relationship, you may be
tempted to store all the related items on the parent item rather than
splitting this out. This works for many situations but can blow up if
you have an unbounded number of related items.
56
This 1MB limit is crucial to keeping DynamoDB’s promise of
consistent single-digit response times. If you have a request that
will address more than 1MB of data, you will need to paginate
through the results by making follow-up requests to DynamoDB.
In the next chapter, we’ll see why the 1MB limit is crucial for
DynamoDB to ensure you don’t write a query that won’t scale.
This is pretty high traffic volume, and not many users will hit it,
though it’s definitely possible. If this is something your application
could hit, you’ll need to look into read or write sharding your data.
57
indexes. If the items in a global secondary index for a partition key
exceed 10 GB in total storage, they will be split across different
partitions under the hood. This will happen transparently to you—
one of the significant benefits of a fully-managed database.
In the examples we’ve shown thus far, like the Users table or the
Movie Roles table in the previous chapter, we’ve had pretty simple
examples. Our tables had just a single type of entity, and the
primary key patterns were straightforward.
For an example of what this looks like, imagine you had a SaaS
application. Organizations signed up for your application, and each
Organization had multiple Users that belonged to the Organization.
Let’s start with a table that just has our Organization items in it:
58
In the image above, we have two Organization items—one for
Berkshire Hathaway and one for Facebook. There are two things
worth noting here.
First, notice how generic the names of the partition key and sort
key are. Rather than having the partition key named 'OrgName`,
the partition key is titled PK, and the sort key is SK. That’s because
we will also be putting User items into this table, and Users don’t
have an OrgName. They have a UserName.
Second, notice that the PK and SK values have prefixes. The pattern
for both is ORG#<OrgName>. We do this for a few reasons. First, it
helps to identify the type of item that we’re looking at. Second, it
helps avoid overlap between different item types in a table.
Remember that a primary key must be unique across all items in a
table. If we didn’t have this prefix, we could run into accidental
overwrites. Imagine if the real estate company Keller Williams
signed up for our application, and the musician Keller Williams was
a user of our application. The two could overwrite each other!
Let’s edit our table to add Users now. A table with both
Organization and User entities might look as follows:
59
Here we’ve added three Users to our existing Organization items.
Our User items use a PK value of ORG#<OrgName> and an SK value of
USER#<UserName>.
This concept of using generic names for your primary keys and
using different values depending on the type of item is known as
overloading your keys. You will do this with both your primary keys
and your secondary indexes to enable the access patterns you need.
If you feel confused, that’s OK. In the coming chapters, we’ll slowly
build up to how and why you want to use key overloading to handle
your access patterns. I wanted to introduce the idea here so that
you’re not confused if you see PK and SK examples in subsequent
chapters.
60
3.7. Conclusion
61
Chapter 4. The Three API
Action Types
Chapter Summary
This chapter covers the types of API actions in DynamoDB.
You will learn about the three types of API actions and when
they are useful.
Sections
1. Background on the DynamoDB API
2. Item-based actions
3. Queries
4. Scans
5. How DynamoDB enforces efficient data access
62
In contrast, you usually interact with DynamoDB by using the AWS
SDK or a third-party library in your programming language of
choice. These SDKs expose a few API methods to write to and read
from your DynamoDB table.
In this chapter, we’ll learn about the core API actions with
DynamoDB. The API for DynamoDB is small but powerful. This
63
makes it easy to learn the core actions while still providing a
flexible way to model and interact with your data.
1. Item-based actions
2. Queries
3. Scans
The API actions are divided based on what you’re operating on.
In the sections below, we’ll walk through the details of these three
categories.
64
There are three rules around item-based actions. First, the full
primary key must be specified in your request. Second all actions to
alter data—writes, updates, or deletes—must use an item-based
action. Finally, all item-based actions must be performed on your
main table, not a secondary index.
Single-item actions must include the entire primary key of the item(s) being
referenced.
In addition to the core single-item actions above, there are two sub-
categories of single-item API actions—batch actions and transaction
actions. These categories are used for reading and writing multiple
DynamoDB items in a single request. While these operate on
multiple items at once, I still classify them as item-based actions
because you must specify the exact items on which you want to
operate. The separate requests are split up and processed once they
hit the DynamoDB router, and the batch requests simply save you
from making multiple trips.
There is a subtle difference between the batch API actions and the
transactional API actions. In a batch API request, your reads or
writes can succeed or fail independently. The failure of one write
won’t affect the other writes in the batch.
With the transactional API actions, on the other hand, all of your
reads or writes will succeed or fail together. The failure of a single
write in your transaction will cause the other writes to be rolled
back.
65
4.2. Query
The second category of API actions is the Query API action. The
Query API action lets you retrieve multiple items with the same
partition key. This is a powerful operation, particularly when
modeling and retrieving data that includes relations. You can use
the Query API to easily fetch all related objects in a one-to-many
relationship or a many-to-many relationship.
You can use the Query operation on either your base table or a
66
secondary index. When making a Query, you must include a
partition key in your request. In our example, this can be useful to
find all the roles an actor has played.
items = client.query(
TableName='MoviesAndActors',
KeyConditionExpression='#actor = :actor',
ExpressionAttributeNames={
'#actor': 'Actor'
},
ExpressionAttributeValues={
':actor': { 'S': 'Tom Hanks' }
}
)
Put aside the funky syntax in the key condition expression for
now—we cover that in more detail in the following chapters.
This Query would return two items—Tom Hanks in Cast Away and
Tom Hanks in Toy Story.
67
Remember that all items with the same partition key are in the
same item collection. Thus, the Query operation is how you
efficiently read items in an item collection. This is why you
carefully structure your item collections to handle your access
patterns. This will be a major theme of the subsequent chapters.
While the partition key is required, you may also choose to specify
conditions on the sort key in a Query operation. In our example,
imagine we want to get all of Tom Hanks' roles in movies where the
title is between A and M in the alphabet. We could use the following
Query action:
items = client.query(
TableName='MoviesAndActors',
KeyConditionExpression='#actor = :actor AND #movie BETWEEN :a AND :m',
ExpressionAttributeNames={
'#actor': 'Actor',
'#movie': 'Movie'
},
ExpressionAttributeValues={
':actor': { 'S': 'Tom Hanks' },
':a': { 'S': 'A' },
':m': { 'S': 'M' }
}
)
68
This would return a single item—Tom Hanks in Cast Away—as it is
the only item that satisfies both the partition key requirement and
the sort key requirement.
As mentioned, you can use the Query API on either the main table
or a secondary index. While we’re here, let’s see how to use the
Query operation on a secondary index.
With our example, we can query movie roles by the actor’s name.
But what if we want to query by a movie? Our current pattern
doesn’t allow this, as the partition key must be included in every
request.
69
Our secondary index will look like this:
Notice that we have the same four items as in our previous table.
The primary key has changed, but the data is the same.
70
items = client.query(
TableName='MoviesAndActors',
IndexName='MoviesIndex'
KeyConditionExpression='#movie = :movie',
ExpressionAttributeNames={
'#movie': 'Movie'
},
ExpressionAttributeValues={
':movie': { 'S': 'Toy Story' }
}
)
You’ll use the Query operation quite heavily with DynamoDB. It’s
an efficient way to return a large number of items in a single
request. Further, the conditions on the sort key can provide
powerful filter capabilities on your table. We’ll learn more about
conditions later in this chapter.
71
4.3. Scan
The final kind of API action is the Scan. The Scan API is the
bluntest tool in the DynamoDB toolbox. By way of analogy, item-
based actions are like a pair of tweezers, deftly operating on the
exact item you want. The Query call is like a shovel—grabbing a
larger amount of items but still small enough to avoid grabbing
everything.
A Scan will grab everything in a table. If you have a large table, this
will be infeasible in a single request, so it will paginate. Your first
request in a Scan call will read a bunch of data and send it back to
you, along with a pagination key. You’ll need to make another call,
using the pagination key to indicate to DynamoDB where you left
off.
There are a few special occasions when it’s appropriate to use the
Scan operation, but you should seldom use it during a latency-
sensitive job, such as an HTTP request in your web application.
The times you may consider using the Scan operation are:
72
In sum, don’t use Scans.
The DynamoDB API may seem limited, but it’s very intentional.
The key point to understand about DynamoDB is that it won’t let
you write a bad query. And by 'bad query', I mean a query that will
degrade in performance as it scales.
73
This is why all the single-item actions and the Query action
require a partition key. No matter how large your table becomes,
including the partition key makes it a constant time operation to
find the item or item collection that you want.
All the single-item actions also require the sort key (if using a
composite primary key) so that the single-item actions are constant
time for the entire operation. But the Query action is different. The
Query action fetches multiple items. So how does the Query action
stay efficient?
Note that the Query action only allows you to fetch a contiguous
block of items within a particular item collection. You can do
operations like >=, <=, begins_with(), or between, but you can’t do
contains() or ends_with(). This is because an item collection is
ordered and stored as a B-tree. Remember that a B-tree is like a
phone book or a dictionary. If you go to a dictionary, it’s trivial to
74
find all words between "hippopotamus" and "igloo". It’s much
harder to find all words that end in "-ing".
Finally, just to really put a cap on how slow an operation can be,
DynamoDB limits all Query and Scan operations to 1MB of data in
total. Thus, even if you have an item collection with thousands of
items and you’re trying to fetch the entire thing, you’ll still be
bounded in how slow an individual request can be. If you want to
fetch all those items, you’ll need to make multiple, serial requests to
DynamoDB to page through the data. Because this is explicit—
you’ll need to write code that uses the LastEvaluatedKey
parameter—it is much more apparent to you when you’re writing
an access pattern that won’t scale.
In shorter form, let’s review those steps and the time complexity of
each one again:
75
4.5. Conclusion
In the second category is the Query API action. The Query action
operates on a single item collection in a DynamoDB table. It allows
you to fetch multiple items in a single request and will be crucial
for handling relationships and advanced access patterns.
Finally, the third category contains the Scan API action. In general,
you should avoid the Scan API as it operates on your entire table.
However, there are some useful patterns that use Scans, and we’ll
see those in upcoming chapters.
Now that we know these broad ideas about the DynamoDB API,
we’ll look at some more practical advice on the DynamoDB API in
the next two chapters.
76
Chapter 5. Using the
DynamoDB API
Chapter Summary
This chapter looks at the mechanics of using the DynamoDB
API. You learn some high-level tips for interacting with
DynamoDB in your application code.
Sections
1. Learn how expression names and values work.
2. Don’t use an ORM.
3. Understand the optional properties on individual requests
That said, it is helpful to know a few of the tricky bits and hidden
areas of the DynamoDB API. We’ll focus on those here.
77
5.1. Learn how expression names and
values work
The first tip around using the DynamoDB API is that you should
spend the time necessary to understand how expression attribute
names and values work. Expressions are necessary to efficiently
query data in DynamoDB. The mechanics around expressions are
covered in the next chapter, but we’ll do some introductory work
on expression names and values here.
items = client.query(
TableName='MoviesAndActors',
KeyConditionExpression='#actor = :actor AND #movie BETWEEN :a AND :m',
ExpressionAttributeNames={
'#actor': 'Actor',
'#movie': 'Movie'
},
ExpressionAttributeValues={
':actor': { 'S': 'Tom Hanks' },
':a': { 'S': 'A' },
':m': { 'S': 'M' }
}
)
The first thing to note is that there are two types of placeholders.
Some placeholders start with a #, like #actor and #movie, and other
placeholders start with a ":", like :actor, :a, and :m.
78
The ones that start with colons are your expression attribute values.
They are used to represent the value of the attribute you are
evaluating in your request. Look at the
ExpressionAttributeValues property in our request. You’ll see
that there are matching keys in the object for the three colon-
prefixed placeholders in our request.
ExpressionAttributeValues={
':actor': { 'S': 'Tom Hanks' },
':a': { 'S': 'A' },
':m': { 'S': 'M' }
}
But why do we need this substitution? Why can’t we just write our
attribute values directly into the expression?
79
DynamoDB server, the parsing logic is much more complicated. It
will need to parse out all the attribute values by matching opening
and closing brackets. This can be a complicated operation,
particularly if you are writing a nested object with multiple levels.
By splitting this out in a separate property, DynamoDB doesn’t
have to parse this string.
Now let’s look at the placeholders that start with a #. These are your
expression attribute names. These are used to specify the name of the
attribute you are evaluating in your request. Like
ExpressionAttributeValues, you’ll see that we have
corresponding properties in our ExpressionAttributeNames
parameter in the call.
80
• Bucket
• By
• Count
• Month
• Name
• Timestamp
• Timezone
81
SQL as well, but they won’t be as performant as writing the SQL
yourself. Your choice in that situation may vary.
First, ODMs push you to model data incorrectly. ORMs make some
sense in a relational world because there’s a single way to model
data. Each object type will get its own table, and relations are
handled via foreign keys. Fetching related data involves following
the foreign key relationships.
This isn’t the case with DynamoDB. All of your object types are
crammed into a single table, and sometimes you have multiple
object types in a single DynamoDB item. Further, fetching an object
and its related objects isn’t straightforward like in SQL—it will
depend heavily on the design of your primary key.
The second reason to avoid ODMs is that it doesn’t really save you
much time or code compared to the basic AWS SDK. Part of the
benefit of ORMs is that it removes giant SQL strings from your
application code. That’s not happening with DynamoDB.
DynamoDB is API-driven, so you’ll have a native method for each
API action you want to perform. Your ORM will mostly be
replicating the same parameters as the AWS SDK, with no real gain
in ease or readability.
There are two exceptions I’d make to my "No ODM / Use the bare
SDK" stance. The first exception is for tools like the Document
Client provided by AWS in their Node.js SDK. The Document
Client is a thin wrapper on top of the core AWS SDK. The usage is
quite similar to the core SDK—you’ll still be doing the direct API
actions, and you’ll be specifying most of the same parameters. The
difference is in working with attribute values.
82
With the core AWS SDK, you need to specify an attribute type
whenever you’re reading or writing an item. This results in a lot of
annoying boilerplate, like the following:
With the Document Client, attribute types are inferred for you.
This means you can use regular JavaScript types, and they’ll get
converted into the proper attribute type by the Document Client
before sending it to the DynamoDB server.
83
The second exception to my policy involves libraries like Jeremy
Daly’s DynamoDB Toolbox for Node.js. DynamoDB Toolbox is
explicitly not an ORM. However, it does help you define entity
types in your application and map those to your DynamoDB table.
It’s not going to do all the work to query the table for you, but it
does simplify a lot of the boilerplate around interacting with
DynamoDB.
• ConsistentRead
• ScanIndexForward
• ReturnValues
• ReturnConsumedCapacity
• ReturnItemCollectionMetrics
The first three properties could affect the items you receive back in
your DynamoDB request. The last two can return additional metric
information about your table usage.
84
Let’s review each of them.
5.3.1. ConsistentRead
In the Advanced Concepts chapter, we discussed the different
consistency modes DynamoDB provides. By default, reads from
DynamoDB are eventually consistent, meaning that the reads will
likely, but not definitely, reflect all write operations that have
happened before the given read operation.
For some use cases, eventual consistency may not be good enough.
You may want to ensure that you’re reading an item that contains
all writes that have occurred to this point. To do that, you would
need to indicate you want a strongly-consistent read.
• GetItem
• BatchGetItem
• Query
• Scan
There are two more aspects to consistent reads that are worth
knowing. First, opting into a strongly-consistent read consumes
more read request units than using an eventually-consistent read.
Each read request unit allows you to read an item up to 4KB in size.
If you use the default of an eventually-consistent read, your read
request units will be cut in half. Thus, reading a single item of up to
4KB in size would only cost you half of a read request unit. Opting
into a strongly-consistent read will consume the full read request
unit.
85
The second thing to note about consistency is related to secondary
indexes. Note that the ConsistentRead property is available on the
Query and Scan operations, which are the two API actions you can
use with secondary indexes. If you are using a local secondary index,
you may opt into strong consistency by passing
ConsistentRead=True. However, you may not request strong
consistency when using a global secondary index. All reads from a
global secondary index are eventually consistent.
5.3.2. ScanIndexForward
The second API property to know is the ScanIndexForward
property. This property is available only on the Query operation,
and it controls which way you are reading results from the sort key.
86
A common access pattern for your application may be to find the
most recent 20 readings for a particular sensor. To do this, you
would use a Query where the partition key is the ID for the sensor
you want.
87
Using the ScanIndexForward can help find the most recent
timestamps or for reading items in reverse alphabetical order. It
can also help with "pre-joining" your data in one-to-many
relationships. For more on this, check out the one-to-many
relationship strategies in Chapter 11.
5.3.3. ReturnValues
When working with items in DynamoDB, you may execute an
operation against an item without knowing the full current state of
the item. A few examples of this are:
88
information about the existing or updated item from DynamoDB
after the call finishes. For example, if you were incrementing a
counter during an update, you may want to know the value of the
counter after the update is complete. Or if you deleted an item, you
may want to view the item as it looked before deletion.
• PutItem
• UpdateItem
• DeleteItem
• TransactWriteItem (here, the property is referred to as
ReturnValuesOnConditionCheckFailure)
89
For an example of the ReturnValues attribute in action, check out
the auto-incrementing integers strategy in Chapter 16 and the
implementation in the GitHub example in Chapter 21.
5.3.4. ReturnConsumedCapacity
The fourth property we’ll review is different from the first three, as
it won’t change the item(s) that will be returned from your request.
Rather, it is an option to include additional metrics with your
request.
There are a few levels of detail you can get with the consumed
capacity. You can specify ReturnConsumedCapacity=INDEXES, which
will include detailed information on not only the consumed
capacity used for your base table but also for any secondary indexes
that were involved in the operation. If you don’t need that level of
precision, you can specify ReturnConsumedCapacity=TOTAL to
simply receive an overall summary of the capacity consumed in the
operation.
The returned data for your consumed capacity will look similar to
the following:
90
'ConsumedCapacity': {
'TableName': 'GitHub',
'CapacityUnits': 123.0,
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'GitHub': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
},
'GlobalSecondaryIndexes': {
'GSI1': {
'ReadCapacityUnits': 123.0,
'WriteCapacityUnits': 123.0,
'CapacityUnits': 123.0
}
}
}
First, you may use these metrics if you are in the early stages of
designing your table. Perhaps you feel pretty good about your table
design. To confirm it will work as intended, you can build out a
prototype and throw some sample traffic to see how it will perform.
During your testing, you can use the consumed capacity metrics to
get a sense of which access patterns use a lot of capacity or to
understand how much capacity you’ll need if your traffic goes to 10
or 100 times your testing.
91
5.3.5. ReturnItemCollectionMetrics
The final API property we’ll cover is
ReturnItemCollectionMetrics. Like the ReturnConsumedCapacity
property, this will not change the item(s) returned from your
request. It will only include additional metadata about your table.
If you have a local secondary index on a table, all the items in the
local secondary index are included as part of the same item
collection as the base table. Thus, in the example below, I have a
local secondary index on my Actors and Movies table where the key
schema has a partition key of Actor and a sort key of Year.
92
Recall that local secondary indexes must use the same partition key
as the base table. When calculating the item collection size for Tom
Hanks, I would include both the size of the items in the base table
and the size of the items in the local secondary index.
Here’s the critical part: If your table has a local secondary index,
then a single item collection cannot be larger than 10GB in size. If
you try to write an item that would exceed this limit, the write will
be rejected.
93
'ItemCollectionMetrics': {
'ItemCollectionKey': {
'S': 'USER#alexdebrie'
},
'SizeEstimateRangeGB': [
123.0,
]
}
5.4. Summary
94
Chapter 6. Expressions
Chapter Summary
This chapter covers DynamoDB expressions that are used in
the DynamoDB API. You will learn about the five kinds of
expressions and when they are helpful.
Sections
1. Key Condition Expressions
2. Filter Expressions
3. Projection Expressions
4. Condition Expressions
5. Update Expressions
As you’re working with the DynamoDB API, a lot of your time will
be spent writing various kinds of expressions. Expressions are
statements that operate on your items. They’re sort of like mini-
SQL statements. In this chapter, we’ll do a deep dive into
expressions in DynamoDB.
95
• Projection Expressions: Used in all read operations to describe
which attributes you want to return on items that were read
• Condition Expressions: Used in write operations to assert the
existing condition (or non-condition) of an item before writing to
it
• Update Expressions: Used in the UpdateItem call to describe the
desired updates to an existing item
The key condition expression is the expression you’ll use the most.
It is used on every Query operation to describe which items you
want to fetch.
Remember that the Query API call must include a partition key in
the request. It may also include conditions on the sort key in the
request. The key condition expression is how you express these
parameters in your request. A key condition can be used only on
elements of the primary key, not on other attributes on the items.
Let’s start with a simple Query request using just the partition key.
Returning to our popular Movies and Actors example in a previous
chapter, the Query request below shows the
96
KeyConditionExpression, the ExpressionAttributeNames, and the
ExpressionAttributeValues for fetching all movie roles for Natalie
Portman:
result = dynamodb.query(
TableName='MovieRoles',
KeyConditionExpression="#a = :a",
ExpressionAttributeNames={
"#a": "Actor"
},
ExpressionAttributeValues={
":a": { "S": "Natalie Portman" }
}
)
You can also use conditions on the sort key in your key condition
expression. This is useful for finding a specific subset of your data.
All elements with a given partition key are sorted according to the
sort key (hence the name). If your sort key is a number, it will be
ordered numerically. If your sort key is a string, it will be sorted
according to its UTF-8 bytes, which is essentially alphabetically,
with provision for characters not in the alphabet and placing all
uppercase letters before lowercase. For more on sorting in
DynamoDB, see Chapter 13.
You can use simple comparisons in your sort key conditions, such
as greater than (>), less than (<), or equal to (=). In our example, we
could use it to filter our query results by the name of the movie
title:
97
result = dynamodb.query(
TableName='MovieRoles',
KeyConditionExpression="#a = :a AND #m > :title",
ExpressionAttributeNames={
"#a": "Actor",
"#m": "Movie"
},
ExpressionAttributeValues={
":a": { "S": "Natalie Portman" },
":title": { "S": "N" }
}
)
In this example, we are specifying that the Movie value must come
before the letter "N".
98
We have five orders here. The middle three are from the same
CustomerId: 36ab55a589e4. Imagine that customer wants to view all
orders between January 10 and January 20. We could handle that
with the following Query:
result = dynamodb.query(
TableName='CustomerOrders',
KeyConditionExpression="#c = :c AND :ot BETWEEN :start and :end",
ExpressionAttributeNames={
"#c": "CustomerId",
"#ot": "OrderTime"
},
ExpressionAttributeValues={
":c": { "S": "36ab55a589e4" },
":start": { "S": "2020-01-10T00:00:00.000000" },
":end": { "S": "2020-01-20T00:00:00.000000" }
}
)
While you can use greater than, less than, equal to, or begins_with,
one little secret is that every condition on the sort key can be
99
expressed with the BETWEEN operator. Because of that, I almost
always use it in my expressions.
Let’s see an example with our Movies & Actors table. Recall that our
base table uses Actor as the partition key and Movie as the sort key.
100
However, that means if I want to filter on something like Year or
Genre, I can’t do it with my key condition expression. This is where
the filter expression helps me out. If I wanted to find all of Tom
Hank’s movies where the Genre was "Drama", I could write the
following Query:
result = dynamodb.query(
TableName='MovieRoles',
KeyConditionExpression="#actor = :actor",
FilterExpression="#genre = :genre"
ExpressionAttributeNames={
"#actor": "Actor",
"#genre": "Genre"
},
ExpressionAttributeValues={
":actor": { "S": "Tom Hanks" },
":genre": { "S": "Drama" }
}
)
The filter expression can also be used in the Scan operation. In our
101
table example, we could use the same Genre=Drama filter expression
to find the two Dramas in our table—Tom Hanks in Cast Away and
Natalie Portman in Black Swan.
With a tiny table that includes five small items, it seems like the
filter expression is a way to enable any variety of access patterns on
your table. After all, we can filter on any property in the table, just
like SQL!
This is not the case in reality. To understand this, you need to know
how DynamoDB’s read limits interact with the expression
evaluation ordering.
However, the key point to understand is that the Query and Scan
operations will return a maximum of 1MB of data, and this limit is
applied in step 1, before the filter expression is applied.
102
Imagine our Movies & Actors table was 1 GB in size but that the
combined size of the Drama movies was only 100KB. You might
expect the Scan operation to return all the Drama movies in one
request since it is under the 1MB limit. However, since the filter
expression is not applied until after the items are read, your client
will need to page through 1000 requests to properly scan your
table. Many of these requests will return empty results as all non-
matching items have been filtered out.
A filter expression isn’t a silver bullet that will save you from
modeling your data properly. If you want to filter your data, you
need to make sure your access patterns are built directly into your
primary keys. Filter expressions can save you a bit of data sent over
the wire, but it won’t help you find data more quickly.
103
some away, it can be easier to handle than in your API request to
DynamoDB. This is more of personal preference.
3. Better validation around time-to-live (TTL) expiry. When using
DynamoDB TTL, AWS states that items are generally deleted
within 48 hours of their TTL expiry. This is a wide range! If
you’re counting on expiration as a part of your business logic,
you could get incorrect results. To help guard against this, you
could write a filter expression that removes all items that should
have been expired by now, even if they’re not quite expired yet.
To see an example of this in action, check out Chapter 17.
Let’s alter our Movies & Actors example to imagine that each item
included a CoverImage attribute that represented the cover artwork
for the movie as a giant blob of binary data. Now our table looks as
follows:
104
For many of our requests, the large CoverImage attribute may not
be necessary. Sending nearly 350KB of image data per item is a
waste of bandwidth and memory in my application.
result = dynamodb.query(
TableName='MovieRoles',
KeyConditionExpression="#actor = :actor",
ProjectionExpression: "#actor, #movie, #role, #year, #genre"
ExpressionAttributeNames={
"#actor": "Actor",
"#movie": "Movie",
"#role": "Role",
"#year": "Year",
"#genre": "Genre"
},
ExpressionAttributeValues={
":actor": { "S": "Tom Hanks" }
}
)
105
attribute, I prefer to just use expression attribute names for all
attributes.
The first three expressions we’ve discussed are for read operations.
The next two expressions are for write operations. We’ll review
condition expressions first.
106
• To avoid overwriting an existing item when using PutItem;
• To prevent an UpdateItem operation from putting an item in a
bad state, such as reducing an account balance below 0;
• To assert that a given user is the owner of an item when calling
DeleteItem.
107
Condition expressions can operate on any attribute on your item,
not just those in the primary key. This is because condition
expressions are used with item-based actions where the item in
question has already been identified by passing the key in a
different parameter.
Let’s walk through a few quick examples where these can be useful.
result = dynamodb.put_item(
TableName='Users',
Item={
"Username": { "S": "bountyhunter1" },
"Name": { "S": "Boba Fett" },
"CreatedAt": { "S": datetime.datetime.now().isoformat()
},
ConditionExpression: "attribute_not_exists(#username)",
ExpressionAttributeNames={
"#username": "Username"
}
)
108
include a ConditionExpression parameter that asserts that there is
no item with the same username.
result = dynamodb.update_item(
TableName='WorkQueue',
Key={
"PK": { "S": "Tracker" }
}
ConditionExpression: "size(#inprogress) <= 10",
UpdateExpression="Add #inprogress :id",
ExpressionAttributeNames={
"#inprogress": "InProgress"
},
ExpressionAttributeValues={
":id": { "SS": [ <jobId> ] }
}
)
109
When we have a job that we want to start the guarded state, we run
an UpdateItem call to attempt to add the job ID into the InProgress
set. In our UpdateItem call, we specify a condition expression that
the current size of the InProgress set is less than or equal to 10. If
it’s more than 10, we have the maximum number of jobs in that
stage and need to wait until a spot is open.
110
Each organization has an item in the table that describes the current
subscription plan they’re on (SubscriptionType), as well as a set of
usernames that have the authority to change the subscription plan
or the payment details (Admins). Before changing the billing details,
you need to confirm that the user making the request is an admin.
result = dynamodb.update_item(
TableName='BillingDetails',
Key={
"PK": { "S": 'Amazon' }
}
ConditionExpression="contains(#a, :user)",
UpdateExpression="Set #st :type",
ExpressionAttributeNames={
"#a": "Admins",
"#st": "SubscriptionType"
},
ExpressionAttributeValues={
":user": { "S": 'Jeff Bezos' },
":type": { "S": 'Pro' }
}
)
111
DynamoDB transactions can help us here. The TransactWriteItem
API allows you to use up to 10 items in a single request. They can be
a combination of different write operations—PutItem, UpdateItem,
or DeleteItem—or they can be ConditionChecks, which simply
assert a condition about a particular item.
Let’s see how a ConditionCheck can help us with our problem. Let’s
assume a similar setup as the previous example. In this example, a
user wants to delete their subscription altogether. Further, the list of
admins is kept on a different item because the admins are used to
verify a number of different requests.
result = dynamodb.transact_write_items(
TransactItems=[
{
"ConditionCheck": {
"Key": {
"PK": { "S": "Admins#<orgId>" }
},
"TableName": "SaasApp",
ConditionExpression: "contains(#a, :user)",
ExpressionAttributeNames={
"#a": "Admins"
},
ExpressionAttributeValues={
":user": { "S": <username> }
}
}
},
{
"Delete": {
"Key": {
"PK": { "S": "Billing#<orgId>" }
},
"TableName": "SaasApp"
}
}
]
}
112
Second, there is a Delete operation to remove the billing record for
this organization. These operations will succeed or fail together. If
the condition check fails because the user is not an administrator,
the billing record will not be deleted.
Note that when using the UpdateItem API, you will only alter the
properties you specify. If the item already exists in the table, the
attributes that you don’t specify will remain the same as before the
update operation. If you don’t want this behavior, you should use
the PutItem API, which will completely overwrite the item with
only the properties you give it.
You may use any combination of these four verbs in a single update
statement, and you may use multiple operations for a single verb.
113
If you have multiple operations for a single verb, you only state the
verb once and use commas to separate the clauses, as shown below:
In the operation above, we’re setting both the Name and UpdatedAt
attributes to new values.
In the example above, we’re using our same SET clause to update
the Name and UpdatedAt properties, and we’re using a REMOVE
clause to delete the InProgress attribute from our item.
114
result = dynamodb.update_item(
TableName='Users',
Key={
"Username": { "S": "python_fan" }
}
UpdateExpression="SET #picture :url",
ExpressionAttributeNames={
"#picture": "ProfilePictureUrl"
},
ExpressionAttributeValues={
":url": { "S": <https://....> }
}
)
result = dynamodb.update_item(
TableName='Users',
Key={
"Username": { "S": "python_fan" }
}
UpdateExpression="REMOVE #picture",
ExpressionAttributeNames={
"#picture": "ProfilePictureUrl"
}
)
115
want to delete it from their item.
result = dynamodb.update_item(
TableName='PageViews',
Key={
"Page": { "S": "ContactUsPage" }
}
UpdateExpression="SET #views = #views + :inc",
ExpressionAttributeNames={
"#views": "PageViews"
},
ExpressionAttributeValues={
":inc": { "N": "1" }
}
)
116
multiple times in the same period.
Let’s see how that works in practice using the example below.
result = dynamodb.update_item(
TableName='Users',
Key={
"Username": { "S": "python_fan" }
}
UpdateExpression="SET #phone.#mobile :cell",
ExpressionAttributeNames={
"#phone": "PhoneNumbers",
"#mobile": "MobileNumber"
},
ExpressionAttributeValues={
":cell": { "S": "+1-555-555-5555" }
}
)
Using the SET verb, you can operate directly on a nested property
in your map attribute. In our example, we’re setting the
PhoneNumber.MobileNumber property in our update. Like the
increment example, this saves us from making multiple requests
first to read the existing attribute and then update the full attribute.
117
6.5.5. Adding and removing from a set
The final example I want to cover is manipulating items in a set
type attribute. We saw an example in the Condition Expression
section, where we were checking that a user was an administrator
before allowing a particular update expression. Now, let’s see how
we can add and remove administrators from that set attribute.
First, you can add a new user to the administration set using the
ADD verb:
result = dynamodb.update_item(
TableName="SaasApp",
Key={ "PK": { "S": "Admins#<orgId>" },
UpdateExpression="ADD #a :user",
ExpressionAttributeNames={
"#a": "Admins"
},
ExpressionAttributeValues={
":user": { "SS": ["an_admin_user"] }
}
)
Similarly, you could remove elements from the set with the
REMOVE verb:
118
result = dynamodb.update_item(
TableName="SaasApp",
Key={ "PK": { "S": "Admins#<orgId>" },
UpdateExpression="REMOVE #a :user",
ExpressionAttributeNames={
"#a": "Admins"
},
ExpressionAttributeValues={
":user": { "SS": ["an_admin_user"] }
}
)
This is the exact same as our ADD operation, but with the verb
switched to REMOVE.
Note that you can add and remove multiple elements to a set in a
single request. Simply update your expression attribute value to
contain multiple items, as shown below:
result = dynamodb.update_item(
TableName="SaasApp",
Key={ "PK": { "S": "Admins#<orgId>" },
UpdateExpression="ADD #a :user",
ExpressionAttributeNames={
"#a": "Admins"
},
ExpressionAttributeValues={
":user": { "SS": ["an_admin_user", "another_user"] }
}
)
6.6. Summary
119
expressions and update expressions, are used for write-based
operations. Understanding the mechanics of these operations,
particularly the key condition expression and the write-based
expressions, is crucial for effective data modeling with DynamoDB.
120
Chapter 7. How to approach
data modeling in DynamoDB
Chapter Summary
This chapter details the process of building a data model with
DynamoDB. First, it includes some key differences between
data modeling with a relational database and data modeling
with DynamoDB. Then, it describes the steps you should follow
to create an efficient data model in DynamoDB.
Sections
1. Differences with relational databases
2. Steps for modeling with DynamoDB
In this chapter and the next few chapters, we will drill in on data
modeling with DynamoDB. This chapter will look at how data
modeling with a NoSQL database is different than a relational
database, as well as the concrete steps you should follow when
modeling with DynamoDB.
121
7.1. Differences with relational databases
It’s a bad idea to model your data in DynamoDB the same way you
model your data in a relational database. The entire point of using a
NoSQL datastore is to get some benefit you couldn’t get with a
relational database. If you model the data in the same way, you not
only won’t get that benefit but you will also end up with a solution
that’s worse than using the relational database!
Below are a few key areas where DynamoDB differs from relational
databases.
7.1.1. Joins
In a relational database, you use the JOIN operator to connect data
from different tables in your query. It’s a powerful way to
reassemble your data and provide flexibility in your access patterns.
122
You won’t find information about joins in the DynamoDB
documentation because there are no joins in DynamoDB. Joins are
inefficient at scale, and DynamoDB is built for scale. Rather than
reassembling your data at read time with a join, you should
preassemble your data in the exact shape that is needed for a read
operation.
In the next chapter, we’ll take a look at how you can get join-like
functionality with DynamoDB.
7.1.2. Normalization
Data normalization. First, second, and third normal form. You
probably learned these concepts as you got started with relational
databases. For those unfamiliar with the technical jargon,
'normalization' is basically the database version of the popular code
mantra of "Don’t Repeat Yourself" (or, "DRY"). If you have data that
is duplicated across records in a table, you should split the record
out into a separate table and refer to that record from your original
table.
Basics of Normalization
123
and second normal form (2NF) all the way to sixth normal form
(6NF), with a few unnumbered forms like elementary key normal
form (EKNF) in between.
The first three normal forms—1NF, 2NF, and 3NF—are the most
commonly used ones in applications, so that’s all we’ll cover here.
Imagine you have an online clothing store. You might have a single,
denormalized table that stores all of your items. In its denormalized
form, it looks as follows:
124
We no longer have a Categories column in our Items table. Rather,
we’ve added two tables. First, we split Categories out into a separate
table with two columns: CategoryId and Category. Then, we have a
linking table for ItemsCategories, which maps each Item to its
respective categories.
With the data split across three tables, we would need to use joins to
pull it back together. The join to retrieve an item with its categories
might look as follows:
SELECT *
FROM items
JOIN items_categories ON items.item = items_categories.item AND items.size =
items_categories.size
JOIN categories ON items_categories.category_id = categories.category_id
WHERE items.item = "Nebraska hat"
125
This would retrieve the item and all item categories for the
Nebraska hat.
126
We have removed Size and Price columns from the Items table
and moved them to an ItemsPrices table. Now our Items table’s key
is Item, and all non-key attributes in the Items table depend on the
Item. The ItemsPrices table contains the various entries for sizes
and prices.
127
Accordingly, let’s split them out into a separate table to achieve
third normal form:
Let’s put it all together to see our full data model in third normal
form:
128
Rather than a single Items table, we now have five different tables.
There is heavy use of IDs to link different tables together, and
there’s no duplication of data across rows or columns.
Now that we understand what normalization is, let’s see why it’s
useful.
129
Benefits of Normalization
130
Why denormalize with DynamoDB
131
In a relational database, each entity type is put into a different table.
Customers will be in one table, CustomerOrders will be in another
table, and InventoryItems will be in a third table. Each table has a
defined set of columns that are present on every record in the
table—Customers have names, dates of birth, and an email address,
while CustomerOrders have an order date, delivery date, and total
cost.
You put multiple entity types in a single table. Rather than having a
Customers table and a CustomerOrders table, your application will
have a single table that includes both Customers and
CustomerOrders (and likely other entities as well). Then, you
design your primary key such that you can fetch both a Customer
and the matching CustomerOrders in a single request.
There are a few implications of this approach that will feel odd
coming from a relational background.
Second, the attributes on your records will vary. You can’t count on
each item in a table having the same attributes, as you can in a
relational database. The attributes on a Customer are significantly
132
different than the attributes on a CustomerOrder. This isn’t much
of an issue for your application code, as the low-level details should
be abstracted away in your data access logic. However, it can add
some complexity when browsing your data in the console or when
exporting your table to an external system for analytical processing.
We’ll take a deeper look at the what and why behind single-table
design in the next chapter.
7.1.4. Filtering
Data access is one large filtering problem. You rarely want to
retrieve all data records at a time. Rather, you want the specific
record or records that match a set of conditions.
133
upfront but will provide long-term benefits as it scales. This is how
DynamoDB is able to provide consistently fast performance as you
scale. The sub-10 millisecond response times you get when you
have 1 gigabyte of data is the same response time you get as you
scale to a terabyte of data and beyond. The same cannot be said
about relational databases.
In this section, you will learn the steps for modeling with
DynamoDB. At a high level, these steps are:
This process can take some time to learn, particularly for those that
are used to modeling data in relational, SQL-based systems. Even
for experienced DynamoDB users, modeling your data can take a
few iterations as you work to create a lean, efficient table.
134
7.2.1. Create an entity-relationship diagram (ERD)
The first step in your data modeling process is to create an entity-
relationship diagram, or ERD. If you got a CS degree in college,
ERDs may be old hat to you. If you didn’t get a CS degree, don’t
worry—neither did I! ERDs are learnable and will make it easier for
you to think about your data.
The ERD for your Notes application would look like this:
First, note that there are two rectangular boxes, titled "Users" and
"Notes". These are the entities in your application. Usually, these are
the nouns you use when talking about your application—Users,
Notes, Orders, Organizations, etc. Note that each entity lists a
number of attributes for that entity. Our User entity has a
username, email address, and date created, while the Note entity
has a note Id, an owner, a title, a date created, and a body.
135
Second, notice that there is a diamond with some lines connecting
the User and Note entities. This indicates a relationship between the
two entities. The line shown indicates a one-to-many relationship,
as one User can own many notes.
136
relationships are configured via foreign keys. You design your data
in a way that can accomodate flexible query patterns in the future.
This is not the case when data modeling in DynamoDB. You design
your data to handle the specific access patterns you have, rather
than designing for flexibility in the future.
Thus, your next step after creating an ERD is to define your data
access patterns. All of them. Be specific and thorough. Failure to do
this correctly may lead to problems down the line as you find your
DynamoDB table isn’t as flexible to new patterns as your relational
database was.
There are two different strategies you can use for building these
access patterns. One is the API-centric approach, which is common
if you’re planning to implement a REST API. With the API-centric
approach, you list out each of the API endpoints you want to
support in your application, as well as the expected shape you
would like to return in your response.
The second approach for building your access patterns is the UI-
centric approach. This is better if you’re doing more server-side
rendering or if you’re using a 'backends-for-frontends' approach to
an API. With the UI-centric approach, you look at each of the
screens in your application and the URLs that will match those
screens. As you do so, identify the different bits of information you
need to assemble to build out the screen. Jot those down, and those
become your access patterns.
137
Entity Access Pattern Index Parameters Notes
Create Session
Get Session
In the left-hand side of the chart, describe the access pattern you
have. If you have unique needs around an access pattern, be sure to
list those in the Notes column.
As you design your data model, you will fill in the right-hand side
of your chart. This column describes the DynamoDB API call you
will use and any details about the call—the table or index you use,
the parameters used in your API call, and any notes.
I cannot express strongly enough how important this step is. You
can handle almost any data model with DynamoDB provided that
you design for your access patterns up front. The biggest problem I
see users face is failing to account for their patterns up front, then
finding themselves stuck once their data model has solidified.
More often than not, these users blame DynamoDB for the
problem, when the problem was with how the person used the tool.
If you tried to use a screwdriver to rake leaves, would you say the
screwdriver is a useless tool, or would you say you used it for the
wrong job?
138
The primary key is the foundation of your table, so you should start
there. I model my primary key using the following four steps.
First, I create an 'entity chart' that is used to track the different types
of items I’ll have in my table. The entity chart tracks the type of
item, the primary key structure for each item, and any additional
notes and properties.
Entity PK SK
Repo
Issue
Pull Request
Fork
Comment
Reaction
User
Organization
Payment Plan
Table 5. GitHub model entity chart
As you build out your data model, the rows in your entity chart
may change. You may remove entities that appear in your ERD
because they aren’t tracked as separate items in DynamoDB. For
example, you may use a list or map attribute type to represent
related objects on a particular item. This denormalized approach
means there’s not a separate item for each of the related entities.
You can see more about this strategy in Chapter 11 on one-to-many
relationship strategies.
139
many relationship where you need to add an item type to represent
the relationship between entities. You could also add an item type
solely to ensure uniqueness on a particular attribute for an item
type. You can see the many-to-many example in the GitHub
example in Chapter 21, and you can see the uniqueness pattern in
the e-commerce example in Chapter 19.
The last step is to start designing the primary key format for each
entity type. Make sure you satisfy the uniqueness requirements
first. If you have some additional flexibility in your key design after
handling uniqueness, try to solve some of the "fetch many" access
patterns you have.
Remember that your client must know the primary key at read
time or otherwise make costly additional queries to figure out the
primary key.
140
A common anti-pattern I see people use is to add a CreatedAt
timestamp into their primary key. This will help to ensure the
primary key for your item is unique, but will that timestamp be
available when you need to retrieve or update that item? If not, use
something that will be available or find how you will make this
information available to the client.
Entity PK SK
Customer CUSTOMER#<CustomerId> METADATA#<CustomerId>
This will help you maintain clarity as you build out your access
patterns chart and will serve as a nice artifact for development and
post-development.
141
When you’re just starting out with data modeling in DynamoDB, it
can be overwhelming to think about where to start. Resist the urge
to give up. Dive in somewhere and start modeling. It will take a few
iterations, even for experienced DynamoDB users.
As you gain experience, you’ll get a feel for which areas in your
ERD are going to be the trickiest to model and, thus, the places you
should think about first. Once you have those modeled, the rest of
the data model often falls into place.
New users often want to add a secondary index for each read
pattern. This is overkill and will cost more. Instead, you can
overload your secondary indexes just like you overload your
primary key. Use generic attribute names like GSI1PK and GSI1SK
for your secondary indexes and handle multiple access patterns
within a single secondary index.
142
7.3. Conclusion
143
Chapter 8. The What, Why, and
When of Single-Table Design in
DynamoDB
Chapter Summary
When modeling with DynamoDB, use as few tables as possible.
Ideally, you can handle an entire application with a single table.
This chapter discusses the reasons behind single-table design.
Sections
1. What is single-table design
2. Why is single-table design needed
3. The downsides of single-table design
4. Two times where the downsides of single-table design
outweigh the benefits
144
• What is single-table design
• Why is single-table design needed
• The downsides of single-table design
• Two times where the downsides of single-table design outweigh
the benefits
At the end of this section, we’ll also do a quick look at some other,
smaller benefits of single-table design.
145
Each Order belongs to a certain Customer, and you use foreign
keys to refer from a record in one table to a record in another.
These foreign keys act as pointers—if I need more information
about a Customer that placed a particular Order, I can follow the
foreign key reference to retrieve items about the Customer.
146
To follow these pointers, the SQL language for querying relational
databases has a concept of joins. Joins allow you to combine records
from two or more tables at read-time.
147
But as an application developer, you still need some of the benefits
of relational joins. And one of the big benefits of joins is the ability
to get multiple, heterogenous items from your database in a single
request.
148
8.1.3. The solution: pre-join your data into item
collections
So how do you get fast, consistent performance from DynamoDB
without making multiple requests to your database? By pre-joining
your data using item collections.
You can see there are two items for Tom Hanks—Cast Away and
Toy Story. Because they have the same partition key of Tom Hanks,
they are in the same item collection.
149
make sure all Order records live in the same item collection as the
User record to which they belong.
150
First, there is some operational overhead with each table you have
in DynamoDB. Even though DynamoDB is fully-managed and
pretty hands-off compared to a relational database, you still need to
configure alarms, monitor metrics, etc. If you have one table with
all items in it rather than eight separate tables, you reduce the
number of alarms and metrics to watch.
While these two benefits are real, they’re pretty marginal. The
operations burden on DynamoDB is quite low, and the pricing will
only save you a bit of money on the margins. Further, if you are
using DynamoDB On-Demand pricing, you won’t save any money
by going to a multi-table design.
151
DynamoDB:
152
When modeling a single-table design in DynamoDB, you start with
your access patterns first. Think hard (and write down!) how you
will access your data, then carefully model your table to satisfy
those access patterns. When doing this, you will organize your
items into collections such that each access pattern can be handled
with as few requests as possible—ideally a single request.
Once you have your table modeled out, then you put it into action
and write the code to implement it. Done properly, this will work
great. Your application will be able to scale infinitely with no
degradation in performance.
153
and wants you to use other, purpose-built databases for OLAP. To
do this, you’ll need to get your data from DynamoDB into another
system.
If you have a single table design, getting it into the proper format
for an analytics system can be tricky. You’ve denormalized your
data and twisted it into a pretzel that’s designed to handle your
exact use cases. Now you need to unwind that table and re-
normalize it so that it’s useful for analytics.
— Forrest Brazeal
154
• in new applications where developer agility is more important
than application performance;
• in applications using GraphQL.
We’ll explore each of these below. But first I want to emphasize that
these are exceptions, not general guidance. When modeling with
DynamoDB, you should be following best practices. This includes
denormalization, single-table design, and other proper NoSQL
modeling principles. And even if you opt into a multi-table design,
you should understand single-table design to know why it’s not a
good fit for your specific application.
155
absolutely require the scaling capabilities of DynamoDB to start,
and you may not know how your application will evolve over time.
156
Notice how there are two separate requests to DynamoDB. First,
there’s a request to fetch the User, then there’s a follow up request
to fetch the Orders for the given User. Because multiple requests
must be made and these requests must be made serially, there’s
going to be a slower response time for clients of your backend
application.
For some use cases, this may be acceptable. Not all applications
need to have sub-30ms response times. If your application is fine
with 100ms response times, the increased flexibility and easier
analytics for early-stage use cases might be worth the slower
performance.
For the past few years, many applications have opted for a REST-
based API on the backend and a single-page application on the
frontend. It might look as follows:
157
In a REST-based API, you have different resources which generally
map to an entity in your application, such as Users or Orders. You
can perform CRUD-like (Create, Read, Update, Delete) operations
on these resources by using different HTTP verbs, like GET, POST,
or PUT, to indicate the operation you want to perform.
158
In the example above, the client has to make two requests—one to
get the User, and one to get the most recent Orders for a user.
With GraphQL, you can fetch all the data you need for a page in a
single request. For example, you might have a GraphQL query that
looks as follows:
In the block above, we’re making a query to fetch the User with id
112233, then we’re fetching certain attributes about the user
(including firstName, lastName, and addresses), as well as all of the
orders that are owned by that user.
159
The web browser makes a single request to our backend server. The
contents of that request will be our GraphQL query, as shown
below the server. The GraphQL implementation will parse the
query and handle it.
In a way, this mirrors our discussion earlier about why you want to
use single-table design with DynamoDB. We only want to make a
single request to DynamoDB to fetch heterogenous items, just like
the frontend wants to make a single request to the backend to fetch
160
heterogenous resources. It sounds like a match made in heaven.
For different types in your query, such as User and Order in our
example, you would usually have a resolver that would make a
database request to resolve the value. The resolver would be given
some arguments to indicate which instances of that type should be
fetched, and then the resolver will fetch and return the data.
161
In this flow, our backend is making multiple, serial requests to
DynamoDB to fulfill our access pattern. This is exactly what we’re
trying to avoid with single-table design!
None of this goes to say that you can’t use DynamoDB with
GraphQL—you absolutely can. I just think it’s a waste to spend time
on a single-table design when using GraphQL with DynamoDB.
Because GraphQL entities are resolved separately, I think it’s fine to
model each entity in a separate table. It will allow for more
flexibility and make it easier for analytics purposes going forward.
162
8.4. Conclusion
163
Chapter 9. From modeling to
implementation
Chapter Summary
This chapter reviews converting your DynamoDB data model
from concept to code. It includes tips for a successful, efficient,
and extendable implementation.
Sections
1. Separate application attributes from indexing attributes
2. Implement your data model at the very boundary of your
application
3. Don’t reuse attributes across multiple indexes
4. Add a "Type" attribute to every item
5. Write scripts to help debug access patterns
6. Shorten attribute names to save storage
164
implement your DynamoDB data model in your application code.
response = client.get_item(**kwargs)
print(response['Item'])
"""
{
"PK": { "S": "USER#alexdebrie" },
"SK": { "S": "USER#alexdebrie" },
"GSI1PK": { "S": "ORG#facebook" },
"GSI1SK": { "S": "USER#alexdebrie" },
"Username": { "S": "alexdebrie" },
"FirstName": { "S": "Alex" },
"LastName": { "S": "DeBrie" },
"OrganizationName": { "S": "Facebook" },
...
}
"""
Notice that the first four attributes are all related to my DynamoDB
data model but have no meaning in my application business logic. I
refer to these as 'indexing attributes', as they’re only there for
indexing your data in DynamoDB. Additionally, the next three
items are properties that are actually useful in my application—
attributes like Username, FirstName, etc.
165
I advise you to keep a separation between these two kinds of
attributes. Your application attributes will often inform parts of
your indexing attributes. Here, the Username attribute is used to fill
in pieces of the PK, SK, and GSI1SK templates.
In the previous example, we printed out one of the User items that
may be saved in our DynamoDB table. You’ll notice there are two
things that are odd about that item. First, it includes the indexing
attributes that we just discussed. Second, each attribute value is a
map with a single key and value indicating the DynamoDB type
and the value for the attribute.
166
user = data.get_user(username='alexdebrie')
print(user)
# User(username="alexdebrie", first_name="Alex", ...)
def get_user(username):
resp = client.get_item(
TableName='AppTable',
Key={ 'PK': { 'S': f'USER#{username}' }}
)
return User(
username=resp['Item']['Username']['S'],
first_name=resp['Item']['FirstName']['S'],
last_name=resp['Item']['LastName']['S'],
)
167
9.3. Don’t reuse attributes across multiple
indexes
The next tip I have is partly a data modeling tip but it fits well here.
As discussed above, you will have indexing attributes that are solely
for properly indexing your data in DynamoDB. Let’s print out our
User object from above to see those attributes again.
response = client.get_item(**kwargs)
print(response['Item'])
"""
{
"PK": { "S": "USER#alexdebrie" },
"SK": { "S": "USER#alexdebrie" },
"GSI1PK": { "S": "ORG#facebook" },
"GSI1SK": { "S": "USER#alexdebrie" },
"Username": { "S": "alexdebrie" },
"FirstName": { "S": "Alex" },
"LastName": { "S": "DeBrie" },
"OrganizationName": { "S": "Facebook" },
...
}
"""
When you do this, you may notice that SK and GSI1SK are the same
value. And because they’re the same value, you may be tempted to
skip adding GSI1SK altogether and make your GSI1 index use a key
schema of GSI1PK and SK.
Don’t do this.
168
Save yourself the pain. For each global secondary index you use,
give it a generic name of GSI<Number>. Then, use GSI<Number>PK
and GSI<Number>SK for your attribute types.
169
In the migration strategies discussed in Chapter 15, we discuss that
you may need to modify existing items to decorate them with new
indexing attributes. To do that, you do a background ETL operation
that scans your table, finds the items you need to modify, and adds
the attributes.
When doing this, you usually only need to update certain entity
types. I like to use a filter expression on the Type attribute when
scanning my table to ensure I’m only getting the items that I need.
It simplifies the logic in my ETL script.
170
Rather than use the DynamoDB console, I recommend writing little
scripts that can be used to debug your access patterns. These scripts
can be called via a command-line interface (CLI) in the terminal. A
script should take the parameters required for an access pattern—a
username or an order ID—and pass that to your data access code.
Then it can print out the results.
# scripts/get_user.py
import click
import data
@click.command()
@click.option('--username', help='Username of user to retrieve.')
def get_user(username):
user = data.get_user(username)
print(user)
if __name__ == '__main__':
get_user()
For a simple access pattern to fetch a single item, it may not seem
that helpful. However, if you’re retrieving multiple related items
from a global secondary index with complex conditions on the sort
key, these little scripts can be lifesavers. Write them at the same
time you’re implementing your data model.
171
there is a different approach you can take to save on storage. This is
a pretty advanced pattern that I would recommend only for the
largest tables and for those that are heavily into the DynamoDB
mindset.
In line with this, you can abbreviate your attribute names when
saving items to DynamoDB to reduce storage costs. For example,
imagine the following code to save a User in your application:
172
Again, this is an advanced pattern. For the marginal application, the
additional attribute names won’t be a meaningful cost difference.
However, if you plan on storing billions and trillions of items in
DynamoDB, this can make a difference with storage.
9.7. Conclusion
When you’re working with DynamoDB, the lion’s share of the work
will be upfront before you ever write a line of code. You’ll build
your ERD and design your entities to handle your access patterns as
discussed in Chapter 7.
After you’ve done the hard design work, then you need to convert it
to your application code. In this chapter, we reviewed some tips for
handling that implementation. We saw how you should implement
your DynamoDB code at the very boundary of your application.
We also learned about conceptually separating your application
attributes from your indexing attributes. Then we saw a few tips
around saving your sanity later on by including a Type attribute on
all items and writing scripts to help explore your data. Finally, we
saw an advanced tip for saving costs on storage by abbreviating
your attribute names.
173
Chapter 10. The Importance of
Strategies
We’ve covered a lot of ground already about the basics of
DynamoDB, and the second half of this book contains a number of
data modeling examples. There’s no substitute for seeing a full
example of a DynamoDB table in action to really grok how the
modeling works.
174
With DynamoDB, on the other hand, there are multiple ways to
approach the problem, and you need to use judgment as to which
approach works best for your situation. DynamoDB modeling is
more art than science—two people modeling the same application
can have vastly different table designs.
In the next six chapters, we’ll look at some of the strategies you can
use in modeling with DynamoDB.
175
Chapter 11. Strategies for one-
to-many relationships
Chapter Summary
Your application objects will often have a parent-child
relationship with each other. In this chapter, we’ll see different
approaches for modeling one-to-many relationships with
DynamoDB.
Sections
1. Denormalization by using a complex attribute
2. Denormalization by duplicating data
3. Composite primary key + the Query API action
4. Secondary index + the Query API action
5. Composite sort keys with hierarchical data
176
With one-to-many relationships, there’s one core problem: how do
I fetch information about the parent entity when retrieving one or
more of the related entities?
177
The basics of normalization are discussed in Chapter 7, but there
are a number of areas where denormalization is helpful with
DynamoDB.
In a relational database, you would model this with two tables using
a foreign key to link the tables together, as follows:
178
Notice that each record in the Addresses table includes a
CustomerId, which identifies the Customer to which this Address
belongs. You can follow the pointer to the record to find
information about the Customer.
179
Because MailingAddresses contains multiple values, it is no longer
atomic and, thus, violates the principles of first normal form.
180
A single DynamoDB item cannot exceed 400KB of data. If the
amount of data that is contained in your complex attribute is
potentially unbounded, it won’t be a good fit for denormalizing
and keeping together on a single item.
181
key. In a relational database, this might be an auto-incrementing
primary key. In DynamoDB, this is the primary key that we
discussed in previous chapters.
Note: In reality, a book can have multiple authors. For simplification of this
example, we’re assuming each book has exactly one author.
This works in a relational database as you can join those two tables
at query-time to include the author’s biographical information
when retrieving details about the book.
182
Notice that there are multiple Books that contain the biographical
information for the Author Stephen King. Because this information
won’t change, we can store it directly on the Book item itself.
Whenever we retreive the Book, we will also get information about
the parent Author item.
There are two main questions you should ask when considering this
strategy:
Even if the data you’re duplicating does change, you still may
decide to duplicate it. The big factors to consider are how often the
data changes and how many items include the duplicated
information.
183
those subsequent reads. When the duplicated data does change,
you’ll need to work to ensure it’s changed in all those items.
Let’s use one of the examples from the beginning of this section. In
a SaaS application, Organizations will sign up for accounts. Then,
184
multiple Users will belong to an Organization and take advantage of
the subscription.
Entity PK SK
Organizations ORG#<OrgName> METADATA#<OrgName
>
Users ORG#<OrgName> USER#<UserName>
Table 7. SaaS App entity chart
185
Nadella, and Jeff Bezos.
Outlined in red is the item collection for items with the partition
key of ORG#MICROSOFT. Notice how there are two different item
types in that collection. In green is the Organization item type in
that item collection, and in blue is the User item type in that item
collection.
This primary key design makes it easy to solve four access patterns:
While all four of these access patterns can be useful, the second
access pattern—Retrieve an Organization and all Users within the
Organization—is most interesting for this discussion of one-to-
many relationships. Notice how we’re emulating a join operation in
SQL by locating the parent object (the Organization) in the same
item collection as the related objects (the Users). We are pre-joining
our data by arranging them together at write time.
186
This is a pretty common way to model one-to-many relationships
and will work for a number of situations. For examples of this
strategy in practice, check out the e-commerce example in Chapter
19 or the GitHub example in Chapter 21.
You may need to use this pattern instead of the previous pattern
because the primary keys in your table are reserved for another
purpose. It could be some write-specific purpose, such as to ensure
uniqueness on a particular property, or it could be because you
have hierarchical data with a number of levels.
For the latter situation, let’s go back to our most recent example.
Imagine that in your SaaS application, each User can create and
save various objects. If this were Google Drive, it might be a
Document. If this were Zendesk, it might be a Ticket. If it were
Typeform, it might be a Form.
Let’s use the Zendesk example and go with a Ticket. For our cases,
let’s say that each Ticket is identified by an ID that is a combination
of a timestamp plus a random hash suffix. Further, each ticket
belongs to a particular User in an Organization.
187
the previous strategy, as follows:
The problem with this is that it really jams up my prior use cases. If
I want to retrieve an Organization and all its Users, I’m also
retrieving a bunch of Tickets. And since Tickets are likely to vastly
exceed the number of Users, I’ll be fetching a lot of useless data and
making multiple pagination requests to handle our original use
case.
188
ORG#<OrgName>#USER#<UserName>.
Notice that our Ticket items are no longer interspersed with their
parent Users in the base table. Further, the User items now have
additional GSI1PK and GSI1SK attributes that will be used for
indexing.
189
This secondary index has an item collection with both the User
item and all of the user’s Ticket items. This enables the same access
patterns we discussed in the previous section.
In the last two strategies, we saw some data with a couple levels of
hierarchy—an Organization has Users, which create Tickets. But
what if you have more than two levels of hierarchy? You don’t want
to keep adding secondary indexes to enable arbitrary levels of
fetching throughout your hierarchy.
190
key on our table. The term composite sort key means that we’ll be
smashing a bunch of properties together in our sort key to allow for
different search granularity.
Let’s see how this looks in a table. Below are a few items:
In our table, the partition key is the country where the Starbucks is
located. For the sort key, we include the State, City, and ZipCode,
with each level separated by a #. With this pattern, we can search at
four levels of granularity using just our primary key!
191
begins_with(SK, '<State>#<City>'.
4. Find all locations in a given country, state, city, and zip code.
Use a Query with a condition expression of PK = <Country> AND
begins_with(SK, '<State>#<City>#<ZipCode>'.
This composite sort key pattern won’t work for all scenarios, but it
can be great in the right situation. It works best when:
• You have many levels of hierarchy (>2), and you have access
patterns for different levels within the hierarchy.
• When searching at a particular level in the hierarchy, you want
all subitems in that level rather than just the items in that level.
192
Strategy Notes Relevant examples
Denormalize + complex Good when nested Chapter 18
attribute objects are bounded and
are not accessed directly
Denormalize + duplicate Good when duplicated Movie roles table
data is immutable or
infrequently changing
Primary key + Query Most common. Good for Chapters 19 & 20
API multiple access patterns
both the parent and
related entities.
Secondary index + Similar to primary key Chapters 19 & 20
Query API strategy. Good when
primary key is needed
for something else.
Composite sort key Good for deeply nested Chapter 20
hierarchies where you
need to search through
multiple levels of the
hierarchy
Table 8. One-to-many relationship strategies
193
Chapter 12. Strategies for
many-to-many relationships
Chapter Summary
This chapter contains four strategies for modeling many-to-
many relationships in your DynamoDB table.
Sections
1. Shallow duplication
2. Adjacency list
3. Materialized graph
4. Normalization & multiple requests
194
query both sides of the relationship. If your table has students &
classes, you may have one access pattern where you want to fetch a
student and the student’s schedule, and you may have a different
access pattern where you want to fetch a class and all the students in
the class. This is the main challenge of many-to-many access
patterns.
• Shallow duplication
• Adjacency list
• Materialized graph
• Normalization & multiple requests
195
12.1. Shallow duplication
The first strategy we will use is one that I call "shallow duplication".
Let’s see an example where this would be helpful and then discuss
when to use it.
One of your access patterns is to fetch a class and all of the students
in the class. However, when fetching information about a class, you
don’t need detailed information about each student in the class. You
only need a subset of information, such as a name or an ID. Our
user interface will then provide a link to click on the student for
someone that wants more detailed information about the student.
196
In this example, there are five items. The top three items are
Student items that have records about the students enrolled at the
school. The bottom two items are two Class items that track
information about a particular class.
Note that this strategy only handles one side of the many-to-many
relationship, as you still need to model out how to fetch a student
and all classes in which the student is enrolled. However, by using
197
this pattern, you have taken out one side of the many-to-many
relationship. You can now handle the other side using one of the
one-to-many relationship strategies from the previous chapter.
In our example table, we’ll have three types of items with the
following primary key patterns:
Movies:
• PK: MOVIE#<MovieName>
• SK: MOVIE#<MovieName>
Actors:
• PK: ACTOR#<ActorName>
• SK: ACTOR#<ActorName>
Roles:
• PK: MOVIE#<MovieName>
198
• SK: ACTOR#<ActorName>
The Movie and Actor items are top-level items, and the Role item
represents the many-to-many relationship between Movies and
Actors.
With this base table configuration, notice that our Movie and Role
items are in the same item collection. This allows us to fetch a
movie and actor roles that played in the movie with a single request
by making a Query API call that uses PK = MOVIE#<MovieName> in
the key condition expression.
We can then add a global secondary index that flips the composite
key elements. In the secondary index, the partition key is SK and
199
the sort key is PK. Our index looks as follows:
Now our Actor item is in the same item collection as the actor’s
Role items, allowing us to fetch an Actor and all roles in a single
request.
The nice part about this strategy is that you can combine mutable
information with immutable information in both access patterns.
For example, the Movie item has some attributes that will change
over time, such as the total box office receipts or the IMDB score.
Likewise, the Actor item has a mutable attribute like the total
number of upvotes they have received. With this setup, we can edit
the mutable parts—the Movie and Actor items—without editing the
immutable Role items. The immutable items are copied into both
item collections, giving you a full look at the data while keeping
200
updates to a minimum.
This pattern works best when the information about the relationship
between the two is immutable. In this case, nothing changes about
the role that an actor played after the fact. This makes it an ideal fit
for this pattern.
201
Notice that Node ID of 156 is the Person node for Alex DeBrie.
Instead of including all attributes about me in a single item, I’ve
broken it up across multiple items. I have one item that includes
information about the day I was married and another item that
includes information about my job. I’ve done similar things with
my wife and with Atticus Finch.
Then, you can use a secondary index to reshuffle those items and
group them according to particular relationships, as shown below:
202
Notice that there are new groupings of nodes and edges. In the
partition for the date of May 28, 2011, both my wife and I have an
edge in there to represent our wedding. You could imagine other
items in there to represent births, deaths, or other important
events.
203
12.4. Normalization and multiple requests
The most common example I use for this one is in a social media
application like Twitter. A user can follow multiple users, and one
user can be followed by multiple users. Twitter has an access
pattern where they need to fetch all people that a given user is
following. For example, here is what I see when I look at people I’m
following:
204
Notice that the inimitable Paul Chin, Jr. shows up here along with
some other great AWS and Serverless folks. Importantly, it contains
information that will change over time. For each person I’m
following, it includes their display name (which can change) as well
as their profile description.
205
information fresh, we would need to update a user’s follower items
each time the user changed their display name or profile. This
could add a ton of write traffic as some users have thousands or
even millions of followers!
Rather than having all that write thrashing, we can do a little bit of
normalization (eek!) in our DynamoDB table. First, we’ll store two
types of items in our table: Users and Following items. The
structure will look as follows:
Users:
• PK: USER#<Username>
• SK: USER#<Username>
Following:
• PK: USER#<Username>
• SK: FOLLOWING#<Username>
Notice that we have a User item record for Alex DeBrie, Paul Chin
Jr., and Gillian Armstrong. We also have a Following item
indicating that Alex DeBrie is following Paul Chin Jr. and Gillian
206
Armstrong. However, that Following item is pretty sparse, as it
contains only the basics about the relationship between the two
users.
When Alex DeBrie wants to view all the users he’s following, the
Twitter backend will do a two-step process:
1. Use the Query API call to fetch the User item and the initial
Following items in Alex DeBrie’s item collection to find
information about the user and the first few users he’s following.
2. Use the BatchGetItem API call to fetch the detailed User items
for each Following item that was found for the user. This will
provide the authoritative information about the followed user,
such as the display name and profile.
Social media is my go-to example for this one, but a second one is
in an online e-commerce store’s shopping cart. You have customers
that can purchase multiple products, and a product can be
purchased by multiple customers. As a customer adds an item to
the cart, you can duplicate some information about the item at that
time, such as the size, price, item number, etc. However, as the user
goes to check out, you need to go back to the authoritative source to
find the current price and whether it’s in stock. In this case, you’ve
done a shallow duplication of information to show users the
number of items in their cart and the estimated price, but you don’t
check the final price until they’re ready to check out.
207
12.5. Conclusion
208
Chapter 13. Strategies for
filtering
Chapter Summary
Filtering is one of the key properties of any database. This
chapter shows common filtering patterns in DynamoDB.
Sections
1. Filtering with the partition key
2. Filtering with the sort key
3. Composite sort key
4. Sparse indexes
5. Filter expressions
6. Client-side filtering
209
and index your primary keys in order to get the most out of
DynamoDB. Note that in this sense, I use 'primary key' to include
both the primary key of your base table and of any secondary
indexes.
The first and easiest way to filter data in DynamoDB is with the
partition key of your primary key. Recall from Chapter 3 that the
partition key is responsible for determining which storage node will
store your item. When a request is received by DynamoDB, it will
hash the partition key to find which storage node should be used to
save or read the requested item.
210
This partitioning is done to keep data access fast, no matter the size
of your data. By relying heavily on this partition key, DynamoDB
starts every operation with an O(1) lookup that reduces your
dataset from terabytes or more down to a single storage node that
contains a maximum of 10GB. As your data continues to grow,
you’ll benefit from this constant-time operation.
Be sure to use this partition key as you think about data design and
filtering. In previous chapters, we’ve used the example of a table
that includes Movies and Actors. In this example table, the partition
key was the Actor name:
211
In our Query API requests, we can use the partition key to filter
down to the specific actor we want. This makes it easy to quickly
retrieve all of Natalie Portman’s movies, if needed.
212
With this secondary index, we can use the Query API to find all
actors that have been in a particular movie by specifying the movie
as the partition key in our request.
The partition key has to be the starting point for your filtering.
Other than the costly Scan operation, all DynamoDB APIs require
the partition key in the request. Make sure you assemble items that
are retrieved together in the same partition key.
There are a couple of different ways that you can filter with the sort
key. Let’s look at two of them:
213
Notice that the orders use the CustomerId as the partition key,
which groups them by customer. You could then use the sort key to
find orders for a customer within a certain time range. For
example, if you wanted all orders for customer 36ab55a589e4 that
were placed between January 11 and February 1, 2020, you could
write the following Query:
result = dynamodb.query(
TableName='CustomerOrders',
KeyConditionExpression="#c = :c AND #ot BETWEEN :start and :end",
ExpressionAttributeNames={
"#c": "CustomerId",
"#ot": "OrderTime"
},
ExpressionAttributeValues={
":c": { "S": "36ab55a589e4" },
":start": { "S": "2020-01-11T00:00:00.000000" },
":end": { "S": "2020-02-01T00:00:00.000000" }
}
)
214
The key point with this strategy is that the sort key itself is
inherently meaningful. It shows a date, and you’re filtering within a
certain date range.
215
Notice that there is an item collection that uses
REPO#alexdebrie/dynamodb-book as the partition key. The first
three items in that table are Issue items, where the sort key pattern
is ISSUE#<IssueNumber>. Then there is a Repo item whose sort key
is the same as the partition key (REPO#alexdebrie/dynamodb-book).
Finally, there are two Star items whose sort key pattern is
STAR#<Username>.
When we want to fetch the Repo item and all its Issue items, we
need to add conditions on the sort key so that we are filtering out
the Star items. To do this, we’ll write a Query like the following:
216
result = dynamodb.query(
TableName='GitHubTable',
KeyConditionExpression="#pk = :pk AND #sk <= :sk",
ExpressionAttributeNames={
"#pk": "PK",
"#sk": "SK"
},
ExpressionAttributeValues={
":pk": { "S": "REPO#alexdebrie#dynamodb-book" },
":sk": { "S": "REPO#alexdebrie#dynamodb-book" }
},
ScanIndexForward=True
)
With the condition on the sort key value, we’re saying to only look
for items whose sort key is less than or equal to
REPO#alexdebrie#dynamodb-book. Further, we’re using
ScanIndexForward=True to read the items in descending order.
This means we’ll start at the Repo item and work backwards to
remove all Star items from our query.
When you want to fetch a Repo and all of its Stars, you would use
217
the opposite pattern: assert that the sort key is greater than or equal
to REPO#alexdebrie#dynamodb-book to start at the Repo and work
through the Star items.
The key difference between this pattern and the simple filtering
pattern with the sort key is that there’s no inherent meaning in the
sort key values. Rather, the way that I’m sorting is a function of how
I decided to arrange my items within a particular item collection.
218
The terminology can get confusing here. A composite primary key is a technical
term of when a primary key has two elements: a partition key and a sort key. A
composite sort key is a term of art to indicate a sort key value that contains two
or more data elements within it.
Let’s see an example that uses a composite sort key, then we’ll
discuss when it’s helpful.
219
combination of the OrderStatus attribute and the OrderDate
attribute. The value for this OrderStatusDate attribute is equal to
those two attributes separated by a #, as shown below.
Now we can use the Query API to quickly find what we want. To
find all CANCELLED orders for customer 2b5a41c0 between July 1,
2019 and September 30, 2019, you would write the following
220
Query:
result = dynamodb.query(
TableName='CustomerOrders',
IndexName="OrderStatusDateGSI",
KeyConditionExpression="#c = :c AND #osd BETWEEN :start and :end",
ExpressionAttributeNames={
"#c": "CustomerId",
"#ot": "OrderStatusDate"
},
ExpressionAttributeValues={
":c": { "S": "2b5a41c0" },
":start": { "S": "CANCELLED#2019-07-01T00:00:00.000000" },
":end": { "S": "CANCELLED#2019-10-01T00:00:00.000000" },
}
)
This is much more efficient than looking through all of the order
items to find the proper ones.
The composite sort key pattern works well when the following
statements are true:
221
2. One of the attributes is an enum-like value.
Notice how our items are sorted in our secondary index. They are
sorted first by the OrderStatus, then by the OrderDate. This means
we can do an exact match on that value and use more fine-grained
filtering on the second value.
When creating a secondary index, you will define a key schema for
the index. When you write an item into your base table,
DynamoDB will copy that item into your secondary index if it has
the elements of the key schema for your secondary index.
Crucially, if an item doesn’t have those elements, it won’t be copied
into the secondary index.
222
Note: if you’re using overloaded secondary indexes, many of your
secondary indexes might technically be sparse indexes. Imagine
you have a table that has three different entity types —
Organization, User, and Ticket. If two of those entities have two
access patterns while the third entity only has one, the two entities
will likely be projected into an overloaded secondary index.
Technically, this is a sparse index because it doesn’t include all
items in your base table. However, it’s a less-specific version of the
sparse index pattern.
223
Notice that our Users have different roles in our application. Some
are Admins and some are regular Members. Imagine that you had
an access pattern that wanted to fetch all Users that had
Administrator privileges within a particular Organization.
224
ORG#<OrgName> as the GSI1PK, and we’ll include Admin as the GSI1SK
only if the user is an administrator in the organization.
Notice that both Warren Buffett and Sheryl Sandberg have values
for GSI1SK but Charlie Munger does not, as he is not an admin.
225
The key to note here is that we’re intentionally using a sparse index
strategy to filter out User items that are not Administrators. We can
still use non-sparse index patterns for other entity types.
226
Notice that the table includes Customers, Orders, and
InventoryItems, as discussed, and these items are interspersed
across the table.
227
Notice the Customer items now have an attribute named
CustomerIndexId as outlined in red.
Only the Customer items are projected into this table. Now when
the marketing department wants to find all Customers to send
marketing emails, they can run a Scan operation on the
CustomerIndex, which is much more targeted and efficient. By
isolating all items of a particular type in the index, our sparse index
makes finding all items of that type much faster.
228
Again, notice that this strategy does not work with index
overloading. With index overloading, we’re using a secondary index
to index different entity types in different ways. However, this
strategy relies on projecting only a single entity type into the
secondary index.
Both sparse index patterns are great for filtering out non-matching
items entirely.
229
DynamoDB and immediately run a filter() method to throw
some away, it can be easier to handle than in your API request to
DynamoDB. This is more of a personal preference.
3. Better validation around time-to-live (TTL) expiry. When using
DynamoDB TTL, AWS states that items are generally deleted
within 48 hours of their TTL expiry. This is a wide range! If
you’re counting on expiration as a part of your business logic,
you could get incorrect results. To help guard against this, you
could write a filter expression that removes all items that should
have been expired by now, even if they’re not quite expired yet.
To see an example of this in action, check out Chapter 18.
If you implement this filter via a filter expression, you don’t know
how many items you will need to fetch to ensure you get ten orders
to return to the client. Accordingly, you’ll likely need to vastly
overfetch your items or have cases where you make follow-up
requests to retrieve additional items.
230
If you have a table where the order status is built into the access
pattern, like the composite sort key example shown above, you
know that you can add a Limit=10 parameter into your request and
retrieve exactly ten items.
For the first example, imagine you have a calender application. You
want to make it easy to find where certain gaps exist in the calendar
in order to schedule new appointments.
For the second factor, I often see people ask how to implement
many different filter conditions on a dataset. The canonical
example here is a table view in a web application. Perhaps the table
has 10 different columns, and users can specify conditions, filters,
231
or sort conditions for each of the columns.
If the data set that is being manipulated isn’t that large (sub-1MB),
just send the entire data set to the client and let all the manipulation
happen client-side. You’ll be able to handle any number of
conditions that are thrown at you, including full-text search, and
you won’t be pounding your database to handle it.
Again, this only works if the dataset is small to begin with. If you are
talking about Amazon.com’s inventory, it’s probably not a good fit.
Their inventory is so broad with so many filters (categories, price,
brand, size, etc.) that you cannot pull down all that data. But if we’re
talking about showing a table view of a particular user’s orders from
your e-commerce site, it’s much more manageable. At that point,
you’ve already done the work to narrow the vast dataset down to a
single customer, so you can let the client handle the rest.
13.7. Conclusion
232
Strategy Notes Relevant examples
Composite sort key Good for filtering on E-commerce filtering on
multiple fields order status and time
Sparse index Good for providing a Deals example
global filter on a set of
data
Filter expressions Good for removing a few GitHub example
extraneous items from a
narrow result set
Client-side filtering Good for flexible Table views
filtering and sorting on a
small (<1MB) set of data
Table 10. Filtering strategies
233
Chapter 14. Strategies for
sorting
Chapter Summary
When considering your DynamoDB access patterns up front,
this includes any sorting requirements you have. In this
chapter, we’ll review different strategies for properly sorting
your data.
Sections
1. Basics of sorting
2. Sorting on changing attributes
3. Ascending vs. descending
4. Two relational access patterns in a single item collection
5. Zero-padding with numbers
6. Faking ascending order
234
must be done with the sort key of a particular item collection.
In this chapter, we’ll review some strategies for sorting. We’ll cover:
As mentioned, sorting happens only on the sort key. You can only
use the scalar types of string, number, and binary for a sort key.
Thus, we don’t need to think about how DynamoDB would sort a
map attribute!
235
For sort keys of type number, the sorting is exactly as you would
expect—items are sorted according to the value of the number.
You might be surprised to see that DeBrie came before Dean! This
is due to the casing—uppercase before lowercase.
236
To avoid odd behavior around this, you should standardize your
sort keys in all uppercase or all lowercase values:
With all last names in uppercase, they are now sorted as we would
expect. You can then hold the properly-capitalized value in a
different attribute in your item.
First off, your choice needs to be sortable. In this case, either epoch
timestamps or ISO-8601 will do. What you absolutely cannot do is
use something that’s not sortable, such as a display-friendly format
like "May 26, 1988". This won’t be sortable in DynamoDB, and you’ll
be in a world of hurt.
237
tough to decipher items in the DynamoDB console if you have a
single-table design. As mentioned in Chapter 9, you should have
some scripts to aid in pulling items that you need for debugging.
There are a few options in this space, but I prefer the KSUID
implementation from the folks at Segment. A KSUID is a K-
Sortable Unique Identifier. Basically, it’s a unique identifier that is
prefixed with a timestamp but also contains enough randomness to
make collisions very unlikely. In total, you get a 27-character string
that is more unique than a UUIDv4 while still retaining
lexicographical sorting.
ksuid -f inspect
REPRESENTATION:
String: 1YnlHOfSSk3DhX4BR6lMAceAo1V
Raw: 0AF14D665D6068ACBE766CF717E210D69C94D115
COMPONENTS:
Time: 2020-03-07T13:02:30.000Z
Timestamp: 183586150
Payload: 5D6068ACBE766CF717E210D69C94D115
238
components below show the time and random payload that were
used.
Thanks to Rick Branson and the folks at Segment for the implementation and
description of KSUIDs. For more on this, check out Rick’s blog post on the
implementation of KSUIDs.
The sort key in DynamoDB is used for sorting items within a given
item collection. This can be great for a number of purposes,
including viewing the most recently updated items or a leaderboard
of top scores. However, it can be tricky if the value you are sorting
on is frequently changing. Let’s see this with an example.
239
In this table design, the organization name is the partition key,
which gives us the ‘group by’ functionality. Then the timestamp for
when the ticket was last updated is the sort key, which gives us
‘order by’ functionality. With this design, we could use
DynamoDB’s Query API to fetch the most recent tickets for an
organization.
240
Instead, let’s try a different approach. For our primary key, let’s use
two attributes that won’t change. We’ll keep the organization name
as the partition key but switch to using TicketId as the sort key.
241
Notice that this is precisely how our original table design used to
look. We can use the Query API against our secondary index to
satisfy our ‘Fetch most recently updated tickets’ access pattern.
More importantly, we don’t need to worry about complicated
delete + create logic when updating an item. We can rely on
DynamoDB to handle that logic when replicating the data into a
secondary index.
Now that we’ve got the basics out of the way, let’s look at some
more advanced strategies.
242
If you’re working with timestamps, this means starting at the year
1900 and working toward the year 2020.
When you do this, you need to consider the common sort order
you’ll use in this access pattern to know where to place the parent
item.
For example, imagine you have an IoT device that is sending back
occasional sensor readings. One of your common access patterns is
to fetch the Device item and the most recent 10 Reading items for
the device.
243
Notice that the parent Device item is located before any of the
Reading items because "DEVICE" comes before "READING" in the
alphabet. Because of this, our Query to get the Device and the
Readings would retrieve the oldest items. If our item collection was
big, we might need to make multiple pagination requests to get the
most recent items.
244
Now we can use the Query API to fetch the Device item and the
most recent Reading items by starting at the end of our item
collection and using the ScanIndexForward=False property.
245
We sure can! To do this, you’ll need to have one access pattern in
which you fetch the related items in ascending order and another
access pattern where you fetch the related items in descending
order. It also works if order doesn’t really matter for one of the two
access patterns.
In this table, we have our three types of items. All share the same PK
value of ORG#<OrgName>. The Team items have a SK of
246
#TEAM#<TeamName>. The Org items have an SK of ORG#<OrgName>.
The User items have an SK of USER#<UserName>.
Notice that the Org item is right between the Team items and the
User items. We had to specifically structure it this way using a #
prefix to put the Team item ahead of the Org item in our item
collection.
Now we could fetch the Org and all the Team items with the
following Query:
result = dynamodb.query(
TableName='SaaSTable',
KeyConditionExpression="#pk = :pk AND #sk <= :sk",
ExpressionAttributeNames={
"#pk": "PK",
"#sk": "SK"
},
ExpressionAttributeValues={
":pk": { "S": "ORG#MCDONALDS" },
":sk": { "S": "ORG#MCDONALDS" }
},
ScanIndexForward=False
)
This goes to our partition and finds all items less than or equal to
than the sort key value for our Org item. Then it scans backward to
pick up all the Team items.
247
You can do the reverse to fetch the Org item and all User items:
look for all items greater than or equal to our Org item sort key,
then read forward to pick up all the User items.
248
reading number it was on and sent that up with the sensor’s value.
249
Now Reading number 2 is placed before Reading number 10, as
expected.
The big factor here is to make sure your padding is big enough to
account for any growth. In our example, we used a fixed length of 5
digits which means we can go to Reading number 99999. If your
needs are bigger than that, make your fixed length longer.
250
one-to-many relationships that you want to query. Further,
imagine that both of those one-to-many relationships use a number
for identification but that you want to fetch both relationships in
the same order (either descending or ascending).
251
Notice that I changed the SK structure of the Reading items so that
the parent Device item is now at the top of our item collection. Now
we can fetch the Device and the most recent Readings by starting at
Device and reading forward, even though we’re actually getting the
readings in descending order according to their ReadingId.
14.7. Conclusion
252
Strategy Notes Relevant examples
Sorting on changing Use secondary indexes to Leaderboards; most
attributes avoid delete + create recently updated
workflow
Ascending vs. Consider the order in All relational examples
descending which you’ll access items
when modeling relations
Two relational access Save money on GitHub migration
patterns in a single item secondary indexes by
collection reusing an item
collection
Zero-padding with If sorting with integers, GitHub example
numbers make sure to zero-pad if
numbers are represented
in a string
Table 11. Sorting strategies
253
Chapter 15. Strategies for
Migrations
Chapter Summary
You need to model your DynamoDB table to match your access
patterns. So what do you do when your access patterns change?
This chapter contains strategies on adding new access patterns
and migrating your data.
Sections
1. Adding new attributes to an existing entity
2. Adding a new entity without relations
3. Adding a new entity type into an existing item collection
4. Adding a new entity type into a new item collection
5. Joining existing items into a new item collection
6. Using parallel scans
In the previous chapters, I’ve harped over and over about how you
need to know your access patterns before you model your table in
DynamoDB. And it’s true—DynamoDB is not flexible, but it’s
powerful. You can do almost anything with it, as long as you model
it that way up front.
In the process of writing this book, I’ve had a number of people ask
me about how to handle migrations. It’s just not realistic to assume
that your data model will be frozen in time. It’s going to evolve.
254
In this chapter, we’ll cover strategies for migrations as your data
model changes. Migrations are intimidating at first, but they’re
entirely manageable. Once you’ve done one or two migrations, they
will seem less scary.
When adding new attributes, you can simply add this in your
application code without doing any large scale ETL process on your
DynamoDB database. Recall from Chapter 9 that interacting with
255
DynamoDB should be at the very boundary of your application. As
you retrieve data from DynamoDB, you should transform the
returned items into objects in your application. As you transform
them, you can add defaults for new attributes.
For example, look at the following code that may involve retrieving
a User item in your application:
def get_user(username):
resp = client.get_item(
TableName='ApplicationTable',
Key={
'PK': f"USER#{username}"
}
)
return User(
username=item['Username']['S'],
name=item['Name']['S'],
birthdate=item.get('Birthdate', {}).get('S') # Handles missing values
)
Notice that we fetch the User item from our DynamoDB table and
then return a User object for our application. When constructing
the User object, we are careful to handle the absence of a Birthdate
attribute for users that haven’t added it yet.
For more on this strategy, check out the Code of Conduct addition
256
in the GitHub Migration example in Chapter 22.
There are a few different scenarios you see when adding a new
entity type. The first one I want to discuss is when your new entity
has no relations with existing entity types.
Now, it’s rare that an entity won’t have any relations at all on your
ERD for your application. As Rick Houlihan says, all data is
relational. It doesn’t float around independently out in the ether.
This scenario is more about when you don’t have an access pattern
that needs to model a relation in DynamoDB.
257
Now imagine we have a new entity type: Projects. A Project belongs
to an Organization, and an Organization can have multiple Projects.
Thus, there is a one-to-many relationship between Organizations
and Projects.
258
We’ve added our Project items as outlined in red at the bottom.
Notice that all the Project items are in the same item collection as
they share the same partition key, but they aren’t in the same item
collection as any existing items. We didn’t have to make any
changes to existing items to handle this new entity and its access
patterns, so it’s a purely additive change. Like the addition of an
attribute, you just update your application code and move on.
In this scenario, you can try to reuse an existing item collection. This
is great if your parent entity is in an item collection that’s not being
used for an existing relationship.
259
As an example, think about an application like Facebook before
they introduced the "Like" button. Their DynamoDB table may
have had Post items that represented a particular post that a user
made.
Then someone has the brilliant idea to allow users to "like" posts via
the Like button. When we want to add this feature, we have an
access pattern of "Fetch post and likes for post".
In this situation, the Like entity is a completely new entity type, and
we want to fetch it at the same time as fetching the Post entity. If we
look at our base table, the item collection for the Post entity isn’t
being used for anything. We can add our Like items into that
collection by using the following primary key pattern for Likes:
• PK: POST#<PostId>
• SK: LIKE#<Username>
260
Outlined in red is an item collection with both a Post and the Likes
for that Post. Like our previous examples, this is a purely additive
change that didn’t require making changes to our existing Post
items. All that was required is that we modeled our data
intentionally to locate the Like items into the existing item
collection for Posts.
The easy stuff is over. Now it’s time to look at something harder.
This section is similar to the previous one. We have a new item type
and a relational access pattern with an existing type. However,
there’s a twist—we don’t have an existing item collection where we
can handle this relational access pattern.
261
Let’s continue to use the example from the last section. We have a
social application with Posts and Likes. Now we want to add
Comments. Users can comment on a Post to give encouragement or
argue about some pedantic political point.
The Post item collection is already being used on the base table. To
handle this access pattern, we’ll need to create a new item collection
in a global secondary index.
• GSI1PK: POST#<PostId>
• GSI1SK: POST#<PostId>
• PK: COMMENT#<CommentId>
• SK: COMMENT#<CommentId>
• GSI1PK: POST#<PostId>
• GSI1SK: COMMENT#<Timestamp>
262
Notice that we’ve added two Comment items at the bottom
outlined in blue. We’ve also add two attributes, GSI1PK and GSI1SK,
to our Post items, outlined in red .
263
Our Post item is in the same item collection as our Comment items,
allowing us to handle our relational use case.
This looks easy in the data model, but I’ve skipped over the hard
part. How do you decorate the existing Post items to add GSI1PK
and GSI1SK attributes?
You will need to run a table scan on your table and update each of
the Post items to add these new attributes. A simplified version of
the code is as follows:
264
last_evaluated = ''
params = {
"TableName": "SocialNetwork",
"FilterExpression": "#type = :type",
"ExpressionAttributeNames": {
"#type": "Type"
},
"ExpressionAttributeValues": {
":type": { "S": "Post" }
}
}
while True:
# 1. Scan to find our items
if last_evaluated:
params['ExclusiveStartKey'] = last_evaluated
results = client.scan(**params)
265
properties to our existing items.
This is the hardest part of a migration, and you’ll want to test your
code thoroughly and monitor the job carefully to ensure all goes
well. However, there’s really not that much going on. A lot of this
can be parameterized:
From there, you just need to take the time for the whole update
operation to run.
So far, all of the examples have involved adding a new item type
into our application. But what if we just have a new access pattern
on existing types? Perhaps we want to filter existing items in a
different way. Or maybe we want to use a different sorting
mechanism. We may even want to join two items that previously
were separate.
266
The pattern here is similar to the last section. Find the items you
want to update and design an item collection in a new or existing
secondary index. Then, run your script to add the new attributes so
they’ll be added to the secondary index.
I mentioned above that you may want to use parallel scans in your
ETL scripts. This might sound scary—how do I do this in parallel?
Am I manually maintaining state across multiple independent
scanners?
In the ETL examples shown above, you would update your Scan
parameters to have the following:
params = {
"TableName": "SocialNetwork",
"FilterExpression": "#type = :type",
"ExpressionAttributeNames": {
"#type": "Type"
},
"ExpressionAttributeValues": {
":type": "Post"
},
"TotalSegments": 10,
"Segment": 0
}
With the two new parameters at the bottom, I’m indicating that I
267
want to split my Scan across 10 workers and that this is worker 0
processing the table. I would have nine other workers processing
segments 1 - 9 as well.
This will greatly speed up your ETL processing without adding a lot
of complexity. DynamoDB will handle all of the state management
for you to ensure every item is handled.
15.7. Conclusion
268
Chapter 16. Additional
strategies
Chapter Summary
This chapter contains additional strategies that are useful but
don’t fit into one of the previous categories.
Sections
1. Ensuring uniqueness on two or more attributes
2. Handling sequential IDs
3. Pagination
4. Singleton items
5. Reference counts
There are a few other strategies I want to cover that don’t fit neatly
into one of the five previous chapters. Hence, you get an "additional
strategies" chapter. They’re a bit of a grab bag but still useful.
269
• Pagination
• Singleton items
• Reference counts
In the table above, our PK and SK values both include the username
so that it will be unique. When creating a new user, I’ll use a
condition expression to ensure a user with the same username
doesn’t exist.
But what if you also want to ensure that a given email address is
270
unique across your system so that you don’t have people signing up
for multiple accounts to the same email address?
You might be tempted to add email into your primary key as well:
Now our table includes the username in the PK and the email in the
SK.
However, this won’t work. It’s the combination of a partition key and
sort key that makes an item unique within the table. Using this key
structure, you’re confirming that an email address will only be used
once for a given username. Now you’ve lost the original uniqueness
properties on the username, as someone else could sign up with the
same username and a different email address!
271
The code to write such a transaction would be as follows:
response = client.transact_write_items(
TransactItems=[
{
'Put': {
'TableName': 'UsersTable',
'Item': {
'PK': { 'S': 'USER#alexdebrie' },
'SK': { 'S': 'USER#alexdebrie' },
'Username': { 'S': 'alexdebrie' },
'FirstName': { 'S': 'Alex' },
...
},
'ConditionExpression': 'attribute_not_exists(PK)
}
},
{
'Put': {
'TableName': 'UsersTable',
'Item': {
'PK': { 'S': 'USEREMAIL#alex@debrie.com' },
'SK': { 'S': 'USEREMAIL#alex@debrie.com' },
},
'ConditionExpression': 'attribute_not_exists(PK)
}
}
]
)
272
Notice that the item that stores a user by email address doesn’t have
any of the user’s properties on it. You can do this if you will only
access a user by a username and never by an email address. The
email address item is essentially just a marker that tracks whether
the email has been used.
I’d avoid this if possible. Now every update to the user item needs
to be a transaction to update both items. It will increase the cost of
your writes and the latency on your requests.
273
in Chapter 19.
274
The full operation looks as follows:
resp = client.update_item(
TableName='JiraTable',
Key={
'PK': { 'S': 'PROJECT#my-project' },
'SK': { 'S': 'PROJECT#my-project' }
}
UpdateExpression="SET #count = #count + :incr",
ExpressionAttributeNames={
"#count": "IssueCount",
},
ExpressionAttributeValues={
":incr": { "N": "1" }
},
ReturnValues='UPDATED_NEW'
)
current_count = resp['Attributes']['IssueCount']['N']
resp = client.put_item(
TableName='JiraTable',
Item={
'PK': { 'S': 'PROJECT#my-project' },
'SK': { 'S': f"ISSUE#{current_count}" },
'IssueTitle': { 'S': 'Build DynamoDB data model' }
... other attributes ...
}
)
This isn’t the best since you’re making two requests to DynamoDB
in a single access pattern. However, it can be a way to handle auto-
incrementing IDs when you need them.
16.3. Pagination
275
In a relational database, you may use a combination of OFFSET and
LIMIT to handle pagination. DynamoDB does pagination a little
differently, but it’s pretty straightforward.
A key access pattern is to fetch the most recent orders for a user.
We only return five items per request, which seems very small but
it makes this example must easier to display. On the first request,
we would write a Query as follows:
276
resp = client.query(
TableName='Ecommerce',
KeyConditionExpression='#pk = :pk, #sk < :sk',
ExpressionAttributeNames={
'#pk': 'PK',
'#sk': 'SK'
},
ExpressionAttributeValues={
':pk': 'USER#alexdebrie',
':sk': 'ORDER$'
},
ScanIndexForward=False,
Limit=5
)
This request uses our partition key to find the proper partition, and
it looks for all values where the sort key is less than ORDER$. This
will come immediately after our Order items, which all start with
ORDER#. Then, we use ScanIndexForward=False to read items
backward (reverse chronological order) and set a limit of 5 items.
This works for the first page of items but what if a user makes
follow-up requests?
277
https://my-ecommerce-
store.com/users/alexdebrie/orders?before=1YRfXS14inXwIJEf9
tO5hWnL2pi. Notice how it includes both the username (in the URL
path) as well as the last seen OrderId (in a query parameter).
resp = client.query(
TableName='Ecommerce',
KeyConditionExpression='#pk = :pk, #sk < :sk',
ExpressionAttributeNames={
'#pk': 'PK',
'#sk': 'SK'
},
ExpressionAttributeValues={
':pk': 'USER#alexdebrie',
':sk': 'ORDER#1YRfXS14inXwIJEf9tO5hWnL2pi'
},
ScanIndexForward=False,
Limit=5
)
Notice how the SK value we’re comparing to is now the Order item
that we last saw. This will get us the next item in the item collection,
returning the following result:
By building these hints into your URL, you can discover where to
278
start for your next page through your item collection.
• PK: USER#<Username>
• SK: USER#<Username>
If you’re creating an Order item for the user, the pattern might be:
• PK: USER#<Username>
• SK: ORDER#<OrderId>
Notice that each item is customized for the particular user or order
by virtue of the username and/or order id.
To track this, you could create a singleton item that is responsible for
tracking all jobs in progress across the application. Your table might
look as follows:
279
We have three Job items outlined in red. Further, there is a
singleton item with a PK and SK of JOBS that tracks the existing jobs
in progress via a JobsInProgress attribute of type string set. When
we want to start a new job, we would do a transaction with two write
operations:
Another example for using singleton items is in the Big Time Deals
example in Chapter 20 where we use singleton items for the front
page of the application.
280
patterns in your application.
Often you’ll have a parent item in your application that has a large
number of related items through a relationship. It could be the
number of retweets that a tweet receives or the number of stars for
a GitHub repo.
As you show that parent item in the UI, you may want to show a
reference count of the number of related items for the parent. We
want to show the total number of retweets and likes for the tweet
and the total number of stars for the GitHub repo.
We could query and count the related items each time, but that
would be highly inefficient. A parent could have thousands or more
related items, and this would burn a lot of read capacity units just to
receive an integer indicating the total.
1. Ensure the related item doesn’t already exist (e.g. this particular
user hasn’t already starred this repo);
2. Increase the reference count on the parent item.
281
result = dynamodb.transact_write_items(
TransactItems=[
{
"Put": {
"Item": {
"PK": { "S": "REPO#alexdebrie#dynamodb-book" },
"SK": { "S": "STAR#danny-developer" }
...rest of attributes ...
},
"TableName": "GitHubModel",
"ConditionExpression": "attribute_not_exists(PK)"
}
},
{
"Update": {
"Key": {
"PK": { "S": "REPO#alexdebrie#dynamodb-book" },
"SK": { "S": "#REPO#alexdebrie#dynamodb-book" }
},
"TableName": "GitHubModel",
"ConditionExpression": "attribute_exists(PK)"
"UpdateExpression": "SET #count = #count + :incr",
"ExpressionAttributeNames": {
"#count": "StarCount"
},
"ExpressionAttributeValues": {
":incr": { "N": "1" }
}
}
}
]
)
282
16.6. Conclusion
283
Chapter 17. Data modeling
examples
In the remaining chapters, we’ll do deep dives into actual data
modeling examples. I think these are the most important chapters
in the book, and there are two things I want you to get from these
chapters.
The first takeaway is about process. Take the steps from Chapter 7
and apply them to an actual problem.
Do your best to try to build the data model with me. Your approach
might not look exactly like mine, but it’s worth going through the
exercise.
The second thing I want you to get from these examples is the
various strategies applied in the chapters. It is unlikely that any
example you find in this book or anywhere else will exactly match
the needs of your application. However, you can take the strategies
applied in the following chapters and apply them to sections of
your data model. Collect these strategies and learn how to apply
them in the proper situations.
284
17.1. Notes on the data modeling
examples
Below are a few notes on how I’ve organized the examples to help
orient you.
As you read through those, get used to listing out the access patterns
you’ll have in your application. Try to understand the needs of
your user and meld it with the possibilties of DynamoDB.
As I list the access patterns out, I don’t always include every potential
access pattern. If there are basic Read / Write / Update access
patterns on individual items, I won’t list them out unless there’s
something interesting.
If you’ve worked with ERDs before, you may see ERDs that include
all attributes on each entity. I don’t do that in my ERDs for a few
reasons. First, it adds a lot of clutter to the ERDs. Some of these
diagrams are large, complicated models with more than ten entities
and over fifteen relationships. To add attributes for each of those
would be more trouble than it’s worth.
285
Likewise, when showing examples of items in DynamoDB, I don’t
include all the attributes on an item. Recall in Chapter 9 the
difference between application attributes and indexing attributes.
Application attributes are useful in your application, whereas
indexing attributes are added solely for data access in DynamoDB.
17.2. Conclusion
The data modeling examples are a lot of fun. You finally get to put
all your learning to work. Try your best, and refer back to the
strategies chapters where necessary to understand the examples
more deeply.
286
Chapter 18. Building a Session
Store
It’s time for our first data modeling example. We’ll start with
something simple and get progressively more complex in
subsequent examples.
18.1. Introduction
Imagine you are building the next hot social network application.
As part of your application, you need user accounts with
authentication and authorization. To handle this, you’ll use a
session token.
The basic flow is as follows. First, there is the token creation flow, as
shown below.
287
Users will login to your application by providing a username and
password. Your backend service will verify the username and
password. If the username and password are correct, the backend
will create a session token, store it in a database, and return it to the
user.
288
Your backend will check the session token in the database to verify
that the token exists and validate the user making the request.
You don’t want these tokens to live forever. The longer a token
lives, the higher chance that a malicious actor will get it. As such,
your application will invalidate a token in two instances:
With these requirements in mind, let’s build our ERD and list our
access patterns.
289
18.2. ERD and Access Patterns
This is a pretty simple ERD. The core of the data model is going to
be the Session. A Session will be primarily identified and accessed
by its SessionToken. Additionally, a Session belongs to a User, and a
User can have multiple Sessions, such as if the User is using the
application on multiple devices.
Entity PK SK
Session
User
Table 14. Session store entity chart
Now that we have our ERD and entity chart, let’s think through our
access patterns. I came up with four:
• Create Session
• Get Session
290
• Delete Session (time-based)
• Delete Sessions for User (manual revocation)
With our ERD and access patterns in hand, it’s time to start
modeling!
In all but the most simple data models, you’ll use a composite
primary key. A composite primary key will allow you to group
items into an item collection and retrieve multiple items in a single
request with the Query operation. That said, our application is
pretty simple here, so we should keep the simple primary key as an
option.
291
table. It would be a disaster if we gave the same token to multiple
users, as each one could impersonate the other.
In our entity chart and ERD, we have two entities listed: Users and
Sessions. However, note that User is not really present in our access
patterns. We have a need to delete all tokens for a user, but we’re
not storing and retrieving anything about the user itself. Rather, the
user is just a grouping mechanism for the tokens.
With that in mind, let’s start working through our interesting access
patterns.
292
value.
One thing to note is that we don’t have a generic name like PK for
our partition key. Because we’re only storing a single type of entity
in this table, we can use a meaningful name like SessionToken for
the partition key. Even when we get around to our User-based
access pattern, we won’t be storing information about the User—
that will be in a different part of your application. This data model
is solely for authentication.
When inserting items into the table, you’ll want to use a condition
expression that ensures there is not an existing item with the same
token.
293
created_at = datetime.datetime.now()
expires_at = created_at + datetime.timedelta(days=7)
result = dynamodb.put_item(
TableName='SessionStore',
Item={
"SessionToken": { "S": str(uuid.uuidv4()) },
"Username": { "S": "bountyhunter1" },
"CreatedAt": { "S": created_at.isoformat() },
"ExpiresAt": { "S": expires_at.isoformat() }
},
ConditionExpression: "attribute_not_exists(SessionToken)"
)
294
we’d prefer not to use.
295
There is one more note around TTLs. DynamoDB states that items
with TTLs are generally removed within 48 hours of their given
expiration time. While I’ve found TTL deletion to be much faster
than that in practice, it does raise the possibility that you will allow
an expired token to work after it has expired.
epoch_seconds = int(time.time())
result = dynamodb.query(
TableName='SessionStore',
KeyConditionExpression="#token = :token",
FilterExpression="#ttl <= :epoch",
ExpressionAttributeNames={
"#token": "SessionToken",
"#ttl": "TTL"
},
ExpressionAttributeValues={
":token": { "S": "0bc6bdf8-6dac-4212-b11a-81f784297c78" },
":epoch": { "N": str(epoch_seconds) }
}
)
296
using filter expressions.
We have a few issues to consider here. The first is that it’s possible
there are multiple tokens for a single user, such as if they’ve logged
in on both a cell phone and a laptop. However, there’s no "DELETE
WHERE" syntax for DynamoDB. When deleting items, you need to
provide the entire primary key of the items you want to delete.
297
Notice that the two tokens for the user alexdebrie are in the same
partition, and the only attribute that was copied over was the
SessionToken.
Now if a user wants to delete their tokens, we can use code similar
to the following:
results = dynamodb.query(
TableName='SessionStore',
Index='UserIndex',
KeyConditionExpression="#username = :username",
ExpressionAttributeNames={
"#username": "Username"
},
ExpressionAttributeValues={
":username": { "S": "alexdebrie" }
}
)
This script runs a Query operation against our secondary index. For
298
each session it finds for the requested user, it runs a DeleteItem
operation to remove it from the table. This handles deletion of all
tokens for a single user.
18.4. Conclusion
Table Structure
The table also has a TTL property set on the attribute named TTL.
For our primary keys, we were able to use meaningful names like
SessionToken and Username rather than generic ones like PK and
SK.
Access Patterns
299
Access Pattern Index Parameters Notes
Add condition
expression to
Create Session Main table
• SessionToken ensure no
duplicates.
Add filter
expression to
Get Session Main table
• SessionToken ensure no expired
tokens.
Expiration
Delete Session
N/A N/A handled by
(time-based)
DynamoDB TTL
• Username 1. Find tokens for
UserIndex
user.
Delete Session
(manual) 2. Delete each
Main Table • SessionToken token returned
from step 1.
Table 15. Session store access patterns
300
Chapter 19. Building an e-
commerce application
With our first example out of the way, let’s work on something a
little more complex. In this example, we’re going to model the
ordering system for an e-commerce application. We’ll still have the
training wheels on, but we’ll start to look at important DynamoDB
concepts like primary key overloading and handling relationships
between entities.
19.1. Introduction
301
Notice that this page includes information about the user, such as
the user’s email address, as well as a paginated list of the most
recent orders from the user. The information about a particular
order is limited on this screen. It includes the order date, order
number, order status, number of items, and total cost, but it doesn’t
include information about the individual items themselves. Also
notice that there’s something mutable about orders—the order
status—meaning that we’ll need to change things about the order
over time.
302
The Order Detail page shows all information about the order,
including the summary information we already saw but also
additional information about each of the items in the order and the
payment and shipping information for the order.
A customer can save multiple addresses for use later on. Each
address has a name ("Home", "Parents' House") to identify it.
303
Let’s add a final requirement straight from our Product Manager.
While a customer is identified by their username, a customer is also
required to give an email address when signing up for an account.
We want both of these to be unique. There cannot be two customers
with the same username, and you cannot sign up for two accounts
with the same email address.
With these needs in mind, let’s build our ERD and list out our
access patterns.
304
there are Customers, as identified in the top lefthand corner. A
Customer may have multiple Addresses, so there is a one-to-many
relationship between Customers and Addresses as shown on the
lefthand side.
Moving across the top, a Customer may place multiple Orders over
time (indeed, our financial success depends on it!), so there is a one-
to-many relationship between Customers and Orders. Finally, an
Order can contain multiple OrderItems, as a customer may
purchase multiple books in a single order. Thus, there is a one-to-
many relationship between Orders and OrderItems.
Now that we have our ERD, let’s create our entity chart and list our
access patterns.
Entity PK SK
Customers
Addresses
Orders
OrderItems
Table 16. E-commerce entity chart
With these access patterns in mind, let’s start our data modeling.
305
19.3. Data modeling walkthrough
In this example, we’re beyond a simple model that just has one or
two entities. This points toward using a composite primary key.
Further, we have a few 'fetch many' access patterns, which strongly
points toward a composite primary key. We’ll go with that to start.
306
With that in mind, let’s model out our Customer items.
Customer:
• PK: CUSTOMER#<Username>
• SK: CUSTOMER#<Username>
CustomerEmail:
307
• PK: CUSTOMEREMAIL#<Email>
• SK: CUSTOMEREMAIL#<Email>
We can load our table with some items that look like the following:
So far, our service has two customers: Alex DeBrie and Vito
Corleone. For each customer, there are two items in DynamoDB.
One item tracks the customer by username and includes all
information about the customer. The other item tracks the
customer by email address and includes just a few attributes to
identify to whom the email belongs.
While this table shows the CustomerEmail items, I will hide them
when showing the table in subsequent views. They’re not critical to
the rest of the table design, so hiding them will de-clutter the table.
Entity PK SK
Customers CUSTOMER#<Username> CUSTOMER#<Username>
Addresses
Orders
OrderItems
308
Table 17. E-commerce entity chart
Finally, when creating a new customer, we’ll want to only create the
customer if there is not an existing customer with the same
username and if this email address has not been used for another
customer. We can handle that using a DynamoDB Transaction.
The code below shows the code to create a customer with proper
validation:
response = client.transact_write_items(
TransactItems=[
{
'Put': {
'TableName': 'EcommerceTable',
'Item': {
'PK': { 'S': 'CUSTOMER#alexdebrie' },
'SK': { 'S': 'CUSTOMER#alexdebrie' },
'Username': { 'S': 'alexdebrie' },
'Name': { 'S': 'Alex DeBrie' },
... other attributes ...
},
'ConditionExpression': 'attribute_not_exists(PK)
}
},
{
'Put': {
'TableName': 'EcommerceTable',
'Item': {
'PK': { 'S': 'CUSTOMEREMAIL#alexdebrie1@gmail.com' },
'SK': { 'S': 'CUSTOMEREMAIL#alexdebrie1@gmail.com' },
},
'ConditionExpression': 'attribute_not_exists(PK)
}
}
]
)
Now that we’ve handled our Customer item, let’s move on to one of
309
our relationships. I’ll go with Addresses next.
In this case, the answer to the first question is 'No'. We will show
customers their saved addresses, but it’s always in the context of the
customer’s account, whether on the Addresses page or the Order
Checkout page. We don’t have an access pattern like "Fetch
Customer by Address".
The answer to the second question is (or can be) 'No' as well. While
we may not have considered this limitation upfront, it won’t be a
burden on our customers to limit them to only 20 addresses. Notice
that data modeling can be a bit of a dance. You may not have
thought to limit the number of saved addresses during the initial
requirements design, but it’s easy to add on to make the data
modeling easier.
310
Notice that our Customer items from before have an Addresses
attribute outlined in red. The Addresses attribute is of the map
type and includes one or more named addresses for the customer.
Entity PK SK
Customers CUSTOMER#<Username> CUSTOMER#<Username>
311
denormalize and store as a complex attribute if there is a limit to
the number of related items. However, we don’t want to limit the
number of orders that a customer can make with us—we would be
leaving money on the table! Because of that, we’ll have to find a
different strategy.
The Query API can only fetch items with the same partition key, so
we need to make sure our Order items have the same partition key
as the Customer items. Further, we want to retrieve our Orders by
the time they were placed, starting with the most recent.
• PK: CUSTOMER#<Username>
• SK: #ORDER#<OrderId>
For the OrderId, we’ll used a KSUID. KSUIDs are unique identifiers
that include a timestamp in the beginning. This allows for
chronological ordering as well as uniqueness. You can read more
about KSUIDs in Chapter 14.
We can add a few Order items to our table to get the following:
312
We’ve added three Order items to our table, two for Alex DeBrie
and one for Vito Corleone. Notice that the Orders have the same
partition key and are thus in the same item collection as the
Customer. This means we can fetch both the Customer and the
most recent Orders in a single request.
To handle our pattern to retrieve the Customer and the most recent
Orders, we can write the following Query:
resp = client.query(
TableName='EcommerceTable',
KeyConditionExpression='#pk = :pk',
ExpressionAttributeNames={
'#pk': 'PK'
},
ExpressionAttributeValues={
':pk': { 'S': 'CUSTOMER#alexdebrie' }
},
ScanIndexForward=False,
Limit=11
)
313
We use a key expression that uses the proper PK to find the item
collection we want. Then we set ScanIndexForward=False so that it
will start at the end of our item collection and go in descending
order, which will return the Customer and the most recent Orders.
Finally, we set a limit of 11 so that we get the Customer item plus
the ten most recent orders.
Entity PK SK
Customers CUSTOMER#<Username> CUSTOMER#<Username>
OrderItems
Table 19. E-commerce entity chart
Like in the last pattern, we can’t denormalize these onto the Order
item as the number of items in an order is unbounded. We don’t
want to limit the number of items a customer can include in an
order.
Further, we can’t use the same strategy of a primary key plus Query
API to handle this one-to-many relationship. If we did that, our
314
OrderItems would be placed between Orders in the base table’s
item collections. This would significantly reduce the efficiency of
the "Fetch Customer and Most Recent Orders" access pattern we
handled in the last section as we would now be pulling back a ton of
extraneous OrderItems with our request.
That said, the principles we used in the last section are still valid.
We’ll just handle it in a secondary index.
First, let’s create the OrderItem entity in our base table. We’ll use
the following pattern for OrderItems:
• PK: ORDER#<OrderId>#ITEM#<ItemId>
• SK: ORDER#<OrderId>#ITEM#<ItemId>
We have added two OrderItems into our table. They are outlined in
red at the bottom. Notice that the OrderItems have the same
OrderId as an Order in our table but that the Order and OrderItems
are in different item collections.
To get them in the same item collection, we’ll add some additional
properties to both Order and OrderItems.
315
The GSI1 structure for Orders will be as follows:
• GSI1PK: ORDER#<OrderId>
• GSI1SK: ORDER#<OrderId>
• GSI1PK: ORDER#<OrderId>
• GSI1SK: ITEM#<ItemId>
Notice that our Order and OrderItems items have been decorated
with the GSI1PK and GSI1SK attributes.
316
Now our Orders and OrderItems have been re-arranged so they are
in the same item collection. As such, we can fetch an Order and all
of its OrderItems by using the Query API against our secondary
index.
resp = client.query(
TableName='EcommerceTable',
IndexName='GSI1',
KeyConditionExpression='#gsi1pk = :gsi1pk',
ExpressionAttributeNames={
'#gsi1pk': 'GSI1PK'
},
ExpressionAttributeValues={
':gsi1pk': 'ORDER#1VrgXBQ0VCshuQUnh1HrDIHQNwY'
}
)
Entity PK SK
Customers CUSTOMER#<Username> CUSTOMER#<Username>
317
Entity PK SK
Addresses N/A N/A
Orders CUSTOMER#<Username> #ORDER#<OrderId>
19.4. Conclusion
Table Structure
318
Our final entity chart for the main table is as follows:
Entity PK SK
Customers CUSTOMER#<Username> CUSTOMER#<Username>
And the final entity chart for the GSI1 index is as follows:
Notice a few divergences from our ERD to our entity chart. First,
we needed to add a special item type, 'CustomerEmails', that are
used solely for tracking the uniqueness of email addresses provided
by customers. Second, we don’t have a separate item for Addresses
as we denormalized it onto the Customer item.
After you make these entity charts, you should include them in the
documentation for your repository to assist others in knowing how
the table is configured. You don’t want to make them dig through
your data access layer to figure this stuff out.
Access Patterns
319
Access Pattern Index Parameters Notes
Use TransactWriteItems to
create Customer and
Create Customer N/A N/A CustomerEmail item with
conditions to ensure
uniqueness on each
Use UpdateItem to update
Create / Update
N/A N/A the Addresses attribute on
Address
the Customer item
Use
View Customer &
ScanIndexForward=False
Most Recent Main table
• Username to fetch in descending
Orders
order.
Use TransactWriteItems to
Save Order N/A N/A create Order and
OrderItems in one request
Use UpdateItem to update
Update Order N/A N/A
the status of an Order
Just like your entity charts, this chart with your access pattern
should be included in the documentation for your repository so
that it’s easier to understand what’s happening in your application.
320
Chapter 20. Building Big Time
Deals
The first two examples were warmups with a small number of
access patterns. We got our feet wet, but they’re likely to be more
simple than most real-world applications.
In this example, we’re going to create Big Time Deals, a Deals app
that is similar to slickdeals.net or a variety of other similar sites.
This application will have a number of relations between entities,
and we’ll get to see how to model out some complex patterns.
One thing I really like about this example is that it’s built for two
kinds of users. First, there are the external users—people like you
and me that want to visit the site to find the latest deals.
But we also have to consider the needs of the editors of the site.
They are interacting with the site via an internal content
management system (CMS) to do things like add deals to the site,
set featured deals, and send notifications to users. It’s fun to think of
this problem from both sides to make a pattern that works for both
parties.
20.1. Introduction
To begin, let’s walk through our site and get a feel for the entities
and access patterns we need to support.
321
First, Big Time Deals is solely accessible via a mobile app. We want
to provide a fast, native way for users to view deals.
There are three main things to point out here. First, on the top row,
we have Featured Deals. These are 5-10 curated deals that have
been selected by the Big Time Deals editors to be interesting to our
users.
322
exactly one category, and you can view all deals in a particular
category by clicking the category icon.
The bottom menu for the application has four different pages.
We’ve discussed the Front Page, so let’s go to the Brands page.
323
The Brands page is very simple. It includes an alphabetical list of all
brands and an icon for each brand. There aren’t that many brands
in our system—only about 200 currently—and we expect to add
maybe 1-2 brands per month.
If you click on a brand from the Brands page, you will be directed
to the page for that brand. The Brand page looks like the following:
The Brand page includes the name of the brand and the icon. The
Brand page also shows the number of likes and watchers for the
given brand. A user may 'like' the brand to show support for it, and
324
a user may 'watch' the brand to receive notifications whenever
there is a new deal for the brand.
Finally, the Brand page shows the latest deals for the given brand.
In addition to the Brand page, there is also a Category page for each
of the eight categories. The Category page for the Tech category
looks as follows:
The Category page is very similar to the Brand page with likes,
watchers, and latest deals.
325
The Category page also includes featured deals, which are deals
within the category that are curated by our editors. We didn’t
include featured deals on each Brand page because it’s too much
work for our editors to handle featured deal curation across 200+
brands, but it’s not as big a task to handle it for the eight categories.
If you click on a specific deal, you’ll get routed to the following Deal
page:
The deal page is pretty simple. It includes a picture of the deal and
details about the deal. It also includes the price of the deal. There is
326
a button to get this deal (which includes an affiliate link to the
external retailer) as well as buttons to see additional deals from the
same brand or in the same category.
The next page is the Editor’s Choice page, which looks as follows:
The Editor’s Choice page is another curated portion of the site. This
includes more unique or eclectic deals that might be missed in
other areas of the site. Like the Featured sections, the Editor’s
Choice page is entirely managed by our editors.
327
Finally, there is a Settings page:
328
2. The Big Time Deals team has blasted all users with news of a
particular deal
Now that we know the entities that we will be modeling, take the
time to create your ERD and draw up the access patterns in your
application.
As always, we’ll start with the ERD. Here is the ERD I created for
our Deals application:
Let’s start on the left-hand side. There are Users who will create
329
accounts in our application. These Users will have multiple
Messages in their inbox, but a specific Message will belong to a
particular User. Thus, there is a one-to-many relationship between
Users and Messages.
Additionally, some Deals are Featured Deals. This means they have
been selected by editors to be featured somewhere on the site. This
is an odd relationship to model, but I’ve shown it as a one-to-one
relationship between Deals and Featured Deals.
Now that we have our ERD, let’s create our entity chart and list our
access patterns.
Entity PK SK
Deal
330
Entity PK SK
Brand
Category
FeaturedDeal
Page
User
Message
Table 25. Big Time Deals entity chart
• Create Deal
• Create Brand
• Create Category
Fetch page:
331
• Fetch Deal
• Create User
• Like Brand for User
• Watch Brand for User
• Like Category for User
• Watch Category for User
Messages:
This has twenty-three access patterns that we’re going to model out.
Let’s do this!
As usual, we’re going to start out data modeling by asking our three
questions:
332
If you can model twenty-three access patterns with a simple
primary key, my hat’s off to you. We’re going to go with a
composite primary key in this example.
333
We do have the Deals entity which is the related entity in a number
of one-to-many relationships, but its only relationship where it’s
the parent is in the one-to-one relationship with Featured Deals.
However, because it’s pretty important in our application and
because it needs to be seen in a number of different contexts, let’s
start there.
Let’s handle the basic write & read operations first. We’ll use the
following primary key pattern for our Deal entity:
• PK: DEAL#<DealId>
• SK: DEAL#<DealId>
The DealId will be a unique identifier generated for each Deal item.
For generating our DealId attribute, we’ll use a KSUID. A KSUID is
a K-Sortable Unique IDentifier. It is generated by an algorithm that
combines a current timestamp and some randomness. A KSUID is a
27-character string that provides the uniqueness properties of a
UUIDv4 while also retaining chronological ordering when using
lexicographical sorting.
334
For more on KSUIDs, check out Chapter 14.
With this pattern in mind, our table with a few Deal items looks as
follows:
The pattern we’ll use for all three will be similar, but we’ll focus on
the first one to start since we haven’t modeled out the Brand and
Category items yet.
335
Let’s talk about data size first. When modeling out a "fetch most
recent" pattern, the common approach is to put the relevant items
in a single item collection by sharing the partition key, then using
the sort key to sort the items by a timestamp or a KSUID that is
sortable by timestamp. This is something we saw in the previous
chapter with Users & Orders.
To handle this, we’ll add two attributes to our Deal items. Those
attributes will be as follows:
• GSI1PK: DEALS#<TruncatedTimestamp>
• GSI1SK: DEAL#<DealId>
336
For example, if your timestamp was 2020-02-14 12:52:14 and you
truncated to the day, your truncated timestamp would be 2020-02-
14 00:00:00. If you truncated to the month, your truncated
timestamp would be 2020-02-01 00:00:00.
In our example, imagine that we’ll have roughly 100 new deals per
day. We might choose to truncate down to the day as a partition
with 100 items should be manageable. Given that, our Deals items
will now look as follows:
Notice the two attributes added that are outlined in red. These will
be used for our global secondary index.
Notice that deals are now grouped together by the day they were
created.
337
Note that this could run into a pagination problem. Imagine our
Fetch Latest Deals endpoint promises to return 25 deals in each
request. If the user is making this request at the beginning of a day
where there aren’t many deals or if they’ve paginated near the end
of a day’s deals, there might not be 25 items within the given
partition.
We can handle that in our application logic. First, you would have a
fetch_items_for_date function that looks as follows:
Then, when wanting to fetch some items, you could use the
following logic:
338
def get_deals(date, last_deal_seen='$', count=0):
deals = []
return deals[:24]
Now that we’ve covered the data size issue for Deals, let’s talk about
data speed. This is the issue I’m actually more worried about.
Because the most recent deals will be viewed by everyone that hits
our front page, we’ll be reading from that partition quite a bit. We
will very likely have a hot key issue where one partition in our
database is read significantly more frequently than others.
339
The good thing is that this access pattern is knowable in advance
and thus easily cachable. Whenever an editor in our internal CMS
adds a new deal, we can run an operation to fetch the most recent
two days worth of deals and cache them.
There are two ways we could do this. If we wanted to use some sort
of external cache like Redis, we could store our latest deals there.
That architecture might look as follows:
In the top left of the diagram, an internal editor adds a new Deal to
our DynamoDB table. This triggers an AWS Lambda function that
is processing the DynamoDB Stream for our table. Upon seeing a
new Deal has been added, it queries our table to find all Deals for
the past two days. Then it stores those Deals in Redis. Later, when
340
application users ask to view deals, the most recent two days will be
read from Redis.
If you didn’t want to add an additional system like Redis to the mix,
you could even cache this in DynamoDB. After retrieving the last
two days worth of deals, you could copy all that data across
multiple items that are serving as little caches.
To create these little caches, we’ll add some items to our tables that
look as follows:
341
After retrieving all deals for the last two days, we store them in the
Deals attribute on our cache items. Then we make N items in
DynamoDB, where N is the number of copies of this data that we
want. The PK and SK for this item will be
DEALSCACHE#<CacheNumber>.
Again, each of these cached items are exactly the same. It just helps
us to spread the reads across multiple partitions rather than hitting
the same item with a ton of traffic.
When fetching those deals, your code could look like this:
import random
resp = client.get_item(
TableName='BigTimeDeals',
Key={
'PK': { 'S': f"DEALSCACHE#{shard}" },
'SK': { 'S': f"DEALSCACHE#{shard}" },
}
)
return resp['Attributes']['Deals']['L']
Now when we get a bunch of users asking for the latest deals, our
342
flow looks like this:
The great thing about this pattern is that it’s easy to add on later if
we need it, and we can also increase the number of shards as
needed. To increase the shards, just update our write code to write
more items, then update the random integer generator in our read
path.
Let’s update our entity chart with our Deal item. Note that I’m not
going to include the DealsCache item as this is an optional addition
you could do.
Entity PK SK
Deal DEAL#<DealId> DEAL#<DealId>
Brand
Category
FeaturedDeal
Page
User
Message
Table 26. Big Time Deals entity chart
343
Entity GSI1PK GSI1SK
Deal DEALS#<TruncatedTimestamp> DEAL#<DealId>
Brand
Category
FeaturedDeal
Page
User
Message
Table 27. Big Time Deals GSI1 entity chart
Now that we’ve handled fetching the most recent deals overall, let’s
work on the similar pattern of fetching the most recent deals for a
particular brand.
Notice that our access pattern is "Fetch Brand and Latest Deals for
Brand". This implies a 'pre-join' access pattern where we place the
parent Brand item in the same item collection as the related Deal
items.
I’m going to go with the latter for two reasons. First, I’m still
concerned about that partition getting too big, so I’d prefer to break
it up. Second, we’ll be able to make these two calls (one to fetch the
344
Brand item and one to fetch the latest Deals) in parallel, so we don’t
have sequential, waterfall requests that really slow down our access
patterns.
Given that, let’s add the following attributes to our Deal items:
• GSI2PK: BRAND#<Brand>#<TruncatedTimestamp>
• GSI2SK: DEAL#<DealId>
With this pattern, we can use the GSI2 index to handle the "Fetch
latest deals for brand" access pattern using similar code as the
"Fetch latest deals overall" that we saw above.
The principles here are exactly the same as the previous one of
fetching the latest deals by Brand. We can handle it by adding the
following attributes to our Deal items:
• GSI3PK: CATEGORY#<Category>#<TruncatedTimestamp>
• GSI3SK: DEAL#<DealId>
Then we would use the GSI3 index to fetch the most recent deals
for a category.
345
In addition to the basic Create & Read Brand access patterns, we
also have the following access patterns:
We also had the "Fetch Brand and latest Deals for Brand", but we
partially covered that before. Given that, it’s now turned into a
simple "Fetch Brand" access pattern with a second request to get the
latest deals for the brand.
Let’s model the Brand entity and handle the "Fetch all Brands"
access patterns first.
• PK: BRAND#<Brand>
• SK: BRAND#<Brand>
This will handle our basic Create & Read Brand access patterns.
To handle the "Fetch all Brands" access patterns, we will need to put
all of our brands in the same item collection. In general, I’m very
leery of putting all of one entity type into a single item collection.
These partitions can get fat and then you lose the partitioning
strategy that DynamoDB wants you to use.
However, let’s think more about our access pattern here. When
fetching all Brands, all we need is the Brand name. All Brands are
shown in a list view. Users can then see more about a particular
Brand by clicking on the Brand to go to its page.
346
create a singleton item. This singleton item is a container for a bunch
of data, and it doesn’t have any parameters in the primary key like
most of our entities do. You can read more about singleton items in
Chapter 16.
This singleton item will use BRANDS as the value for both the PK and
SK attributes. It will have a single attribute, Brands, which is an
attribute of type set containing all Brands in our application. Now
when fetching the list of Brands, we just need to fetch that item and
display the names to users.
The table with our Brand items and the Brands singleton item looks
as follows:
One last note: if you’re worried about hot key issues on the Brands
singleton item, you could copy the data across multiple items like
we did with our Deals cache items.
347
Handling Brand Likes
While we’re on Brands, let’s handle the access pattern for a user to
"Like" a Brand. There are two factors to consider here:
There are two ways for us to handle the first issue: by using a set
attribute on the Brand item or by creating a separate item to
indicate the 'Like'.
The first option of using a set attribute only works if the total
number of 'likers' would be small. Remember that we have a 400KB
limit on a DynamoDB item. In this case, the number of users that
like a particular Brand could be unbounded, so using a set attribute
is infeasible.
Given that, we’ll keep track of each Like in a separate item. The
primary key pattern of this separate item will be as follows:
• PK: BRANDLIKE#<Brand>#<Username>
• SK: BRANDLIKE#<Brand>#<Username>
Because the Brand Like item includes both the Brand name and the
User name, we can prevent multiple likes by including a condition
expression to assert that the Brand Like doesn’t currently exist.
348
Notice that these two operations need to happen together. We want
to increment the counter when a new like happens, but we only
want to increment and record the like if the user hasn’t already
liked it. This is a great use case for a DynamoDB transaction.
result = dynamodb.transact_write_items(
TransactItems=[
{
"Put": {
"Item": {
"PK": { "S": "BRANDLIKE#APPLE#alexdebrie" },
"SK": { "S": "BRANDLIKE#APPLE#alexdebrie" },
...rest of attributes ...
},
"TableName": "BigTimeDeals",
"ConditionExpression": "attribute_not_exists(PK)"
}
},
{
"Update": {
"Key": {
"PK": { "S": "BRAND#APPLE" },
"SK": { "S": "BRAND#APPLE" },
},
"TableName": "BigTimeDeals",
"ConditionExpression": "attribute_exists(PK)"
"UpdateExpression": "SET #likes = #likes + :incr",
"ExpressionAttributeNames": {
"#likes": "LikesCount"
},
"ExpressionAttributeValues": {
":incr": { "N": "1" }
}
}
}
]
)
This transaction has two operations. The first tries to create a Brand
Like item for the User "alexdebrie" for the Brand "Apple". Notice
that it includes a condition expression to ensure the Brand Like
item doesn’t already exist, which would indicate the user already
liked this Brand.
349
Brand item by 1. It also includes a condition expression to ensure
the Brand exists.
The patterns here are pretty similar to the Brand Like discussion,
with one caveat: our internal system needs to be able to find all
watchers for a Brand so that we can notify them.
Because of that, we’ll put all Brand Watch items in the same item
collection so that we can run a Query operation on it.
• PK: BRANDWATCH#<Brand>
• SK: USER#<Username>
Notice that Username is not a part of the partition key in this one,
so all Brand Watch items will be in the same item collection.
350
One of our access patterns is to send a New Brand Deal Message to
all Users that are watching a Brand. Let’s discuss how to handle that
here.
resp = dynamodb.query(
TableName='BigTimeDeals',
KeyConditionExpression="#pk = :pk",
ExpressionAttributeNames={
'#pk': "PK"
},
ExpressionAttributeValues={
':pk': { 'S': "BRANDWATCH#APPLE" }
}
)
First, we run a Query operation to fetch all the watchers for the
given Brand. Then we send a message to each watcher to alert them
of the new deal.
351
20.3.3. Modeling the Category item
Now that we’ve handled the Brand item, let’s handle the Category
item as well. Categories are very similar to Brands, so we won’t
cover it in quite as much detail.
There are two main differences between the Categories and Brands:
352
Each Category item includes information about the Featured Deals
on the item directly. Remember that setting Featured Deals is an
internal use case, so we can program our internal CMS such that it
includes all information about all Featured Deals whenever an
editor is setting the Featured Deals for a Category.
Thus, for modeling out the Category items and related entities, we
create the following item types:
Category
• PK: CATEGORY#<Category>
• SK: CATEGORY#<Category>
CategoryLike
• PK: CATEGORYLIKE#<Category>#<Username>
• SK: CATEGORYLIKE#<Category>#<Username>
CategoryWatch
• PK: CATEGORYWATCH#<Category>
• SK: USER#<Username>
353
model. The last thing we need to handle is around the Featured
Deals on the front page of the application and the Editor’s Choice
page.
For me, this problem is similar to the "Featured Deals for Category"
problem that we just addressed. We could add some attributes to a
Deal to indicate that it’s featured and shuttle them off into a
secondary index. In contrast, we could go a much simpler route by
just duplicating some of that data elsewhere.
Notice the singleton items for the Front Page and the Editor’s
Choice page. Additionally, like we did with the Deals Cache items,
we could copy those across a number of partitions if needed. This is
a simple, effective way to handle these groupings of featured deals.
Let’s take a breath here and take a look at our updated entity chart
with all Deal-related items finished.
354
Entity PK SK
Deal DEAL#<DealId> DEAL#<DealId>
User
Message
Table 28. Big Time Deals entity chart
355
Entity GSI3PK GSI3SK
Deal CATEGORY#<Category>#<Trunc DEAL#<DealId>
atedTimestamp>
Table 31. Big Time Deals GSI3 entity chart
I’m going to start with the last access pattern—send new Hot Deal
Message to All Users—then focus on the access patterns around
fetching Messages.
When we think about the "Send new Hot Deal Message to all Users",
it’s really a two-step operation:
356
1. Find all Users in our application
2. For each User, send a Message
For the 'find all Users' portion of it, we might think to mimic what
we did for Brands: use a singleton item to hold all usernames in an
attribute. However, the number of Users we’ll have is unbounded. If
we want our application to be successful, we want to have as many
Users as possible, which means we want to exceed 400KB of data.
Instead, let’s use one of our sparse indexing strategies from Chapter
13. Here, we want to use the second type of sparse index, which
projects only a single type of entity into a table.
• PK: USER#<Username>
• SK: USER#<Username>
• UserIndex: USER#<Username>
357
Notice that we have three User items in our table. I’ve also placed a
Deal item to help demonstrate how our sparse index works.
Notice that only our User items have been copied into this table.
Because other items won’t have that attribute, this is a sparse index
containing just Users.
resp = dynamodb.scan(
TableName='BigTimeDeals',
IndexName='UserIndex'
)
358
This is a similar pattern to what we did when sending messages to
Brand or Category Watchers. However, rather than using the Query
operation on a partition key, we’re using the Scan operation on an
index. We will scan our index, then message each User that we find
in our index.
The final access patterns we need to handle are with the Messages
for a particular User. There are three access patterns here:
Notice the first two are the same pattern with an additional filter
condition. Let’s think through our different filtering strategies from
Chapter 13 on how we handle the two different patterns.
359
'Find all Messages' access pattern, and it will be difficult to find the
exact number of Messages we want without overfetching.
Likewise, we can’t use the sort key to filter here, whether we use it
directly or as part of a composite sort key. The composite sort key
works best when you always want to filter on a particular value.
Here, we sometimes want to filter and sometimes don’t.
• PK: MESSAGES#<Username>
• SK: MESSAGE#<MessageId>
• GSI1PK: MESSAGES#<Username>
• GSI1SK: MESSAGE#<MessageId>
For the MessageId, we’ll stick with the KSUID that we used for
Deals and discussed in Chapter 14.
Note that the PK & SK patterns are the exact same as the GSI1PK
and GSI1SK patterns. The distinction is that the GSI1 attributes will
only be added for unread Messages. Thus, GSI1 will be a sparse index
for unread Messages.
360
We have four Messages in our table. They’re grouped according to
Username, which makes it easy to retrieve all Messages for a User.
Also notice that three of the Messages are unread. For those three
Messages, they have GSI1PK and GSI1SK values.
When we look at our GSI1 secondary index, we’ll see only unread
Messages for a User:
361
This lets us quickly retrieve unread Messages for a User.
The modeling part is important, but I don’t want to leave out how
we implement this in code either. Let’s walk through a few code
snippets.
def create_message(message):
resp = client.put_item(
TableName='BigTimeDeals',
Item={
'PK': { 'S': f"MESSAGE#{message.username}" },
'SK': { 'S': f"MESSAGE#{message.created_at}" },
'Subject': { 'S': f"MESSAGE#{message.subject}" },
'Unread': { 'S': "True" },
'GSI1PK': { 'S': f"MESSAGE#{message.username}" },
'GSI1SK': { 'S': f"MESSAGE#{message.created_at}" },
}
)
return message
Notice that the caller of our function doesn’t need to add the GSI1
values or even think about whether the message is unread. Because
it’s unread by virtue of being new, we can set that property and
both of the GSI1 properties in the data access layer.
362
def mark_message_read(message):
resp = client.update_item(
TableName='BigTimeDeals',
Key={
'PK': { 'S': f"MESSAGE#{message.username}" },
'SK': { 'S': f"MESSAGE#{message.created_at}" },
},
UpdateExpression="SET #unread = :false, REMOVE #gsi1pk, gsi1sk",
ExpressionAttributeNames={
'#unread': 'Unread',
'#gsi1pk': 'GSI1PK',
'#gsi1sk': 'GSI1SK'
},
ExpressionAttributeValues={
':false': { 'S': 'False' }
}
)
return message
363
def get_messages_for_user(username, unread_only=False):
args = {
'TableName': 'BigTimeDeals',
'KeyConditionExpression': '#pk = :pk',
'ExpressionAttributeNames': {
'#pk': 'PK'
},
'ExpressionAttributeValues': {
':pk': { 'S': f"MESSAGE#{username}" }
},
'ScanIndexForward': False
}
if unread_only:
args['IndexName'] = 'GSI1'
resp = client.query(**args)
We can use the same method for fetching all Messages and fetching
unread Messages. With our unread_only argument, a caller can
specify whether they only want unread Messages. If that’s true, we’ll
add the IndexName property to our Query operation. Otherwise,
we’ll hit our base table.
With this sparse index pattern, we’re able to efficiently handle both
access patterns around Messages.
20.4. Conclusion
364
The final entity charts and access patterns are below, but I want to
close this out first.
Let’s look as our final entity charts. First, the entity chart for our
base table:
Entity PK SK
Deal DEAL#<DealId> DEAL#<DealId>
365
Entity GSI1PK GSI1SK
Deal DEALS#<TruncatedTimestamp> DEAL#<DealId>
366
Access Pattern Index Parameters Notes
Main table N/A Fetch Front Page Item
Fetch Front Page Query timestamp
& Latest Deals GSI1 • LastDealIdSeen partitions for up to 25
deals
• CategoryName
Main table Fetch Category Item
Condition expression
Create User Main table N/A to ensure uniqueness
on username
Transaction to
Like Brand For increment Brand
Main table N/A
User LikeCount and ensure
User hasn’t liked
Transaction to
increment Brand
WatchBrand For
Main table N/A WatchCount and
User
ensure User hasn’t
watched
Transaction to
Like Category For increment Category
Main table N/A
User LikeCount and ensure
User hasn’t liked
367
Access Pattern Index Parameters Notes
Transaction to
increment Category
WatchCategory
Main table N/A WatchCount and
For User
ensure User hasn’t
watched
368
Chapter 21. Recreating
GitHub’s Backend
The last example was a pretty big one, but we’re about to go even
bigger. In this example, we’re going to re-create the GitHub data
model around core metadata. We’ll model things like Repos, Users,
Organizations, and more. This model contains a number of closely-
related objects with a large number of access patterns. By the end of
this example, you should feel confident that DynamoDB can handle
highly-relational access patterns.
21.1. Introduction
• Repositories
369
• Users
• Organizations
• Payment plans
• Forks
• Issues
• Pull Requests
• Comments
• Reactions
• Stars
The deeper you get into git, the harder it is to keep track of the
data modeling principles we’re trying to learn. Further, much of
the API around files and branches is more search-related, which
would likely be done in an external system rather than in
DynamoDB. Given that, we will focus on the metadata elements
of GitHub rather than on the repository contents.
370
21.1.2. Walkthrough of the basics of GitHub
Now that we know the entities we’ll be covering, let’s walk through
what these entities are and how they interact with each other.
Repositories
Below is a screenshot of the GitHub repo for the AWS SDK for
JavaScript. We’ll walk through some of the key points below.
• Repo Name. Each repository has a name. For this example, you
can see the repository name in the top left corner. This repo’s
name is aws-sdk-js.
• Repo Owner. Each repository has a single owner. The owner can
be either a User or an Organization. For many purposes, there is
no difference as to whether a repo is owned by a User or an
Organization. The repository owner is shown in the top left
before the repo name. This repo’s owner is aws.
371
The combination of repo name and repo owner must be unique
across all of GitHub. You will often see a repository referred to
by the combination of its repo name and owner, such as
aws/aws-sdk-js. Note that there is a one-to-many relationship
between owners and repos. An owner may have many repos, but
a repo only has one owner.
• Issues & Pull Requests. Below the repo owner and name, notice the
tabs for "Issues" and "Pull Requests". These indicate the number
of open issues and pull requests, respectively, for the current
repo. Issues and pull requests are discussed in additional detail
below.
• Stars. On the top right, there is a box that indicates this repo has
5.7k stars. An individual User can choose to 'star' a repository,
indicating the user likes something about the repo. It’s
comparable to a 'like' in various social media applications. Each
user can star a repo only one time.
• Forks. On the top right next to Stars, you can see that this
repository has 1.1k forks. A user or organization can 'fork' a
repository to copy the repository’s existing code into a repository
owned by the forking user or organization. This can be done if a
user wants to make changes to submit back to the original repo,
or it can be done so that the user or org has a copy of the code
which they control entirely.
Users
372
A user is uniquely identified by a username. Users can own repos,
as discussed above. Users can also belong to one or more
organizations, as discussed below.
Organizations
Payment Plans
Both users and organizations can sign up for paid plans on GitHub.
These plans determine which features a user or organization is
allowed to use as well as the account-wide limits in place.
Forks
Forked repositories point back to their original repo, and you can
373
browse the forked repositories of a given repository in the GitHub
UI, as shown below.
Issues
374
receive an ID of 6.
Pull Requests
375
of combined issues and pull requests in the given repository.
Comments
Both issues and pull requests can have comments. This allows for
discussion between multiple users on a particular issue or pull
request. All comments are at the same level, with no threading, and
there is no limit to the number of comments in an issue or PR.
Reactions
376
Now that we know the entities that we will be modeling, take the
time to create your ERD and draw up the access patterns in your
application.
377
Let’s walk through what’s going on here.
Moving to the center of the ERD, we see the Repo entity. Both
Users and Organizations can own Repos. There is a one-to-many
relationship between Users or Organizations and Repos as a User or
Organization may own multiple Repos but a Repo only has one
owner.
Further, a User can also star a Repo to indicate interest in the Repo’s
contents. There is a many-to-many relationship between Users and
Repos via this Star relationship. Organizations may not add stars to
repositories.
Notice that a forked Repo is itself still a Repo as the forked version
of the Repo is owned by a User or Organization and will have Issues
and Pull Requests separate from the parent from which it was
forked.
The two other one-to-many relationships with Repos are Issues and
378
Pull Requests. A single Repo will have multiple of both Issues and
Pull Requests.
Issues and Pull Requests can each have Comments left by Users.
While technically Issues, Pull Requests, and Comments are all left
by Users, there are no access patterns related to fetching any of
these entities by the User that left them. Accordingly, we didn’t map
the relationship here.
Entity PK SK
Repo
Issue
Pull Request
Comment
Reaction
Fork
User
Organization
Payment Plan
Table 36. GitHub model entity chart
Finally, let’s list the access patterns for our application. Because we
have a bunch, I’ll group them into categories.
Repo basics:
379
• Get / Create / List Pull Request(s) for Repo
• Fork Repo
• Get Forks for Repo
Interactions:
User management:
• Create User
• Create Organization
• Add User to Organization
• Get Users for Organization
• Get Organizations for User
Now that we have our ERD and our access patterns, let’s get started
with the modeling.
380
As usual, we’ll start with the same three questions:
381
decide whether to store this on a single item or across eight
different items.
With our interesting requirements noted, it’s time to pick our first
entity. I like to start with a 'core' entity in a data model to get
started. For this model, the best entity to start with looks like the
Repo entity. Not only does it have the most access patterns, but also
it is the parent of a number of relationships. There are three one-
to-many relationships—Issues, Pull Requests, and Repos (via a
Fork)--as well as a many-to-many relationship with Users via Stars.
Given that, we’ll look at using the primary key + Query strategy for
our one-to-many relationships. At this point, I’m actually going to
remove Forks from consideration because Forks are themselves
Repos. This likely means we’ll need to do some grouping in a
secondary index rather than in the base table.
382
the most recent issues first.
• Pull Requests. Descending order by pull request number, as we’ll
want to fetch the most recent pull requests first.
Given this, it looks like we’ll only be able to model one of the one-
to-many relationships in the Repo’s item collection in the main
table. There’s really no difference when choosing between Issues &
Pull Requests, so I’ll just go with Issues as the one-to-many
relationship to model in our primary key.
Both the Repo and Issue entities will be in the same item collection
in the main table, meaning they need to have the same partition
key. Let’s use the following parties for these entities:
Repos:
• PK: REPO#<Owner>#<RepoName>
• SK: REPO#<Owner>#<RepoName>
Issues:
• PK: REPO#<Owner>#<RepoName>
• SK: ISSUE#<ZeroPaddedIssueNumber>
One thing to note here: for the Issue item, the Issue Number in the
sort key is a zero-padded number. The reason for this is discussed in
further detail in Chapter 14 on sorting. At a high level, if you are
ordering numbers that are stored in a string attribute (as here,
because we have the ISSUE# prefix), then you need to make sure
your numbers are all the same length. String sorting is done from
left to right, one character at a time, so ISSUE#10 would come
383
before ISSUE#9 because the seventh character (1 vs 9) is lower for
the former.
Let’s take a look at how our table will look with some sample data
for Repos and Issues.
Notice that there are two different item collections in the table so
far. One is for the alexdebrie/dynamodb-book repository and
contains three Issue items and one Repo item. The other is for the
alexdebrie/graphql-demo repository and contains just the Repo
item.
If I want to fetch the Repo item and the most recent Issue items, I
would use the following Query API action:
384
result = dynamodb.query(
TableName='GitHubTable',
KeyConditionExpression="#pk = :pk",
ExpressionAttributeNames={
"#pk": "PK",
},
ExpressionAttributeValues={
":pk": { "S": "REPO#alexdebrie#dynamodb-book" },
},
ScanIndexForward=False
)
With this primary key design, we now have the basics around
creating Repos and Issues. We can also handle the "Fetch Repo &
Issues for Repo" access pattern.
Entity PK SK
Repo REPO#<Owner>#<RepoName> REPO#<Owner>#<RepoName>
385
Later on, we’ll cover two other tricky things about Issues:
Pull Requests
Now that we’ve handled the Repo and Issue items, let’s move on to
the Pull Requests item. Pull requests are very similar to issues. We
need to be able to access them individually, when a user is browsing
to a specific pull request. Further, we need to be able to fetch a
group of Issues items for a given repository, along with the Repo
item that has information about the parent repository. Finally, pull
requests are often retrieved in descending order according to their
pull request number.
To handle the Pull Request access patterns, let’s first add our Pull
Request items to the table. We already learned we can’t have them
in the same item collection as the Repo item in our main table, so
we’ll split them into a different collection.
We’ll use the following pattern for the PK and SK of Pull Request
items:
• PK: PR#<Owner>#<RepoName>#<ZeroPaddedPullRequestNumber>
• SK: PR#<Owner>#<RepoName>#<ZeroPaddedPullRequestNumber>
These key values are long, but we need to assure uniqueness for
each pull request across all repositories.
386
Next, we’ll use a global secondary index to put Repos and Pull
Requests in the same item collection. To do this, we’ll add the
following attributes to Repos and Pull Requests:
Repos:
• GSI1PK: REPO#<Owner>#<RepoName>
• GSI1SK: REPO#<Owner>#<RepoName>
Pull Requests:
• GSI1PK: REPO#<Owner>#<RepoName>
• GSI1SK: PR#<ZeroPaddedPullRequestNumber>
Notice that we have added two Pull Request items at the bottom of
the table. Further, both Repo Items and Pull Request items have the
GSI1PK and GSI1SK values.
387
Let’s take a look at GSI1 to see our Repo items and Pull Request
items together:
Now that we have both issues and pull requests modeled, let’s
discuss our two additional quirks on those items.
388
feature. They are often used as primary keys, and the database will
handle it for you.
Second, we will make a PutItem API call to create the Issue or Pull
Request item. We will use the value of the
IssuesAndPullRequestCount attribute in the previous call to know
which number we should use for our Issue or Pull Request item.
389
resp = client.update_item(
TableName='GitHubTable',
Key={
'PK': { 'S': 'REPO#alexdebrie#dynamodb-book' },
'SK': { 'S': 'REPO#alexdebrie#dynamodb-book' }
}
UpdateExpression="SET #count = #count + :incr",
ExpressionAttributeNames={
"#count": "IssuesAndPullRequestCount",
},
ExpressionAttributeValues={
":incr": { "N": "1" }
},
ReturnValues='UPDATED_NEW'
)
current_count = resp['Attributes']['IssuesAndPullRequestCount']['N']
resp = client.put_item(
TableName='GitHubTable',
Item={
'PK': { 'S': 'REPO#alexdebrie#dynamodb-book' },
'SK': { 'S': f"ISSUE#{current_count}" },
'RepoName': { 'S': 'dynamodb-book' }
... other attributes ...
}
)
Notice how we make the UpdateItem call, then use the returned
value for IssuesAndPullRequestCount to populate the primary key
for the PutItem request.
390
handle filtering, there are two approaches we could take:
1. Build the status (Open vs. Closed) into the primary key for either
the base table or a secondary index. Use the Query API to filter to
the type you want, or
2. Use a filter expression or client-side filtering to filter out the
items you don’t want.
First, the values on which we’re filtering are limited. There are only
two options: Open or Closed. Because of this, we should have a
pretty high ratio of read items to returned items.
result = dynamodb.query(
TableName='GitHubTable',
KeyConditionExpression="#pk = :pk",
FilterExpression="attribute_not_exists(#status) OR #status = :status",
ExpressionAttributeNames={
"#pk": "PK",
"#status": "Status"
},
ExpressionAttributeValues={
":pk": { "S": "REPO#alexdebrie#dynamodb-book" },
"status": { "S": "Open" }
},
ScanIndexForward=False
)
391
asserting that either the Status attribute does not exist (indicating it
is a Repo item) or that the value of Status is Open, indicating it’s an
open issue.
The biggest downside to this pattern is that it’s hard to know how to
limit the number of items you read from the table. You could find
yourself reading the full 1MB from a table even if you only needed
the first 25 items. You could do a mixed approach where you first
try to find 25 items that match by reading the first 75 items or so. If
you need to make a follow up query, then you could do an
unbounded one.
In the next chapter, we imagine that this pattern did not work as well as
desired once implemented in production, so we move away from filter
expressions. Check it out to see how you would model this filter into the
primary key directly.
Modeling Forks
392
If a Repo is an original Repo (not a Fork), then the pattern for these
attributes will be as follows:
• GSI2PK: REPO#<Owner>#<RepoName>
• GSI2SK: #REPO#<RepoName>
• GSI2PK: REPO#<OriginalOwner>#<RepoName>
• GSI2SK: FORK#<Owner>+
For the Fork item, the OriginalOwner refers to the owner of the
Repo you are forking. This will group it in the same item collection
for the original Repo.
Here are four Repo items in our table. Other items have been
removed for clarity.
Notice that each Repo item has the standard Repo attributes, like
the RepoName and RepoOwner, as well as Fork attributes, like GSI2PK
and GSI2SK.
393
If we flip to looking at the GSI2 view, our index is as follows:
Notice how all Forks for a given repository are in the same item
collection. The original Repo item is at the top of the item
collection and all Forks are after it in ascending order by the
forking Owner’s name.
Modeling Stars
394
Stars:
• PK: REPO#<RepoName>
• SK: STAR#<UserName>
If we put a few Stars in our table, our base table looks as follows:
If I want to get the Repo and all Stars, I would use the following
Query:
395
result = dynamodb.query(
TableName='GitHubTable',
KeyConditionExpression="#pk = :pk AND #sk >= :sk",
ExpressionAttributeNames={
"#pk": "PK",
"#sk": "SK"
},
ExpressionAttributeValues={
":pk": { "S": "REPO#alexdebrie#dynamodb-book" },
":sk": { "S": "REPO#alexdebrie#dynamodb-book" }
}
)
This also means I’ll need to update my Query above for the Repo &
Issues access pattern. It should have a sort key condition as well.
We saw above that we’ll be storing a Star and Fork item for each
star and fork. But it would be inefficient to recount all those items
each time someone loaded the page. Instead, we’ll keep a running
count on the Repo item for fast display.
396
exist (we don’t want to allow a user to star a repo multiple times),
and
2. Increment the StarCount or ForkCount attribute on the Repo
item.
Notably, the second part should not happen without the first. If a
user tries to star a repo multiple times, we will reject the entire
operation.
The code to handle this will look something like the following:
397
result = dynamodb.transact_write_items(
TransactItems=[
{
"Put": {
"Item": {
"PK": { "S": "REPO#alexdebrie#dynamodb-book" },
"SK": { "S": "STAR#danny-developer" }
...rest of attributes ...
},
"TableName": "GitHubModel",
"ConditionExpression": "attribute_not_exists(PK)"
}
},
{
"Update": {
"Key": {
"PK": { "S": "REPO#alexdebrie#dynamodb-book" },
"SK": { "S": "#REPO#alexdebrie#dynamodb-book" }
},
"TableName": "GitHubModel",
"ConditionExpression": "attribute_exists(PK)"
"UpdateExpression": "SET #count = #count + :incr",
"ExpressionAttributeNames": {
"#count": "StarCount"
},
"ExpressionAttributeValues": {
":incr": { "N": "1" }
}
}
}
]
)
There is one write request to create the new Star item, and one
write request to increment the StarCount attribute on the Repo
item. If either part fails, we reject the whole operation and don’t get
an incorrect count.
398
on Comments and Reactions next.
IssueComments:
• PK: ISSUECOMMENT#<Owner>#<RepoName>#<IssueNumber>
399
• SK: ISSUECOMMENT#<CommentId>
PRComments:
• PK: PRCOMMENT#<Owner>#<RepoName>#<PRNumber>
• SK: PRCOMMENT#<CommentId>
This strategy will ensure all Comments for an Issue or Pull Request
will be in the same item collection, making it easy to fetch all
Comments for a particular Issue or Pull Request.
Modeling Reactions
400
attributes on the relevant Issue, Pull Request, or Comment to track
the eight different counts. Further, we could use an item with eight
different attributes (or even eight different items!) in our table to
ensure a user didn’t have the same reaction to the same target
multiple times.
First, on each Issue, Pull Request, and Comment item, we’ll have a
ReactionCounts attribute that stores the counts for the eight
different reaction types. Each key in the map will be a reaction
type, and the value will be the number of counts.
Then, we’ll create a User Reaction Target item that will track a
single user’s reactions to a particular target (Issue, Pull Request, or
Comment). There will be a Reactions attribute on the item that is
of type string set to track the reactions that have been used.
Reaction:
• PK:
<TargetType>REACTION#<Owner>#<RepoName>#<TargetIdentifi
er>#<UserName>
• SK:
<TargetType>REACTION#<Owner>#<RepoName>#<TargetIdentifi
er>#<UserName>
First, the TargetType refers to the item to which we’re reacting. The
potential values here are ISSUE, PR, ISSUECOMMENT, or PRCOMMENT.
401
or Pull Request Number. For a Comment, it will be the Issue or Pull
Request Number plus the Comment Id.
402
result = dynamodb.transact_write_items(
TransactItems=[
{
"Update": {
"Key": {
"PK": { "S": "ISSUEREACTION#alexdebrie#dynamodb-
book#4#happy-harry" },
"SK": { "S": "ISSUEREACTION#alexdebrie#dynamodb-
book#4#happy-harry" },
},
"TableName": "GitHubModel",
"UpdateExpression": "ADD #reactions :reaction",
"ConditionExpression": "attribute_not_exists(#reactions) OR
NOT contains(#reactions, :reaction)",
"ExpressionAttributeNames": {
"#reactions": "Reactions"
}
"ExpressionAttributeValues: {
":reaction": { "SS": [ "Heart" ]}
}
}
},
{
"Update": {
"Key": {
"PK": { "S": "REPO#alexdebrie#dynamodb-book" },
"SK": { "S": "#ISSUE#4" }
},
"TableName": "GitHubModel",
"UpdateExpression": "SET #reactions.#reaction =
#reactions.#reaction + :incr",
"ExpressionAttributeNames": {
"#reactions": "Reaction",
"#reaction": "Heart"
},
"ExpressionAttributeValues": {
":incr": { "N": "1" }
}
}
}
]
)
In the first action, we’re adding Heart to the Reactions string set
attribute as long as Heart doesn’t already exist in the set. In the
second action, we’re incrementing the count of Heart reactions on
our Issue item.
403
Organizations.
Entity PK SK
Repo REPO#<Owner>#<RepoName> REPO#<Owner>#<RepoName>
404
Entity GSI2PK GSI2SK
Fork REPO#<OriginalOwner>#<Rep FORK#<Owner>
oName>
Table 40. GitHub model GSI2 entity chart
405
User and Organization items
Let’s start with the User and Organization entities. With a many-to-
many relationship, the tricky part is how to query both sides of the
relationship with fresh data.
406
This gives us more flexibility to model the many-to-many
relationship. We’ll handle this in two similar but slightly different
ways.
User:
• PK: ACCOUNT#<AccountName>
• SK: ACCOUNT#<AccountName>
Organization:
407
• PK: ACCOUNT#<AccountName>
• SK: ACCOUNT#<AccountName>
Membership:
• PK: ACCOUNT#<OrganizationName>
• SK: MEMBERSHIP#<UserName>
Note that both Users and Organizations have the same primary key
structure. In fact, you can’t tell by the primary key alone whether
you’re working with a User or an Organization.
Further, both Users and Organizations own Repos and will have
access patterns around those. The only real difference is that an
Organization’s item collection can also include Membership items.
408
Notice that there are two Users, each of which have their own item
collection. The top User, alexdebrie, has an Organizations
attribute which contains details about the organization to which the
User belongs.
With this setup, we can handle both the "Get User" and "Get
Organization" access patterns using simple GetItem lookups.
Additionally, we can do the "Get Users in Organization" access
pattern by using a Query operation on the item collection for an
Organization.
Next, let’s model out payment plans for users and organizations. As
a reminder, both users and organizations can sign up for payment
plans with GitHub that give them access to premium features.
There is a one-to-one relationship between users and organizations
and a payment plan.
409
and you wanted to avoid accessing it each time you grab the parent
item.
That’s not the case here. Accordingly, we’ll just include payment
plan details as a complex attribute on the User or Organization item
itself.
Note that the alexdebrie User and the megacorp Organization both
have PaymentPlan attributes where payment information is listed in
a map.
The final access pattern to handle is fetching all Repos for a given
account, whether a User or an Organization. This is a one-to-many
relationship similar to the ones we’ve used in other areas.
410
attribute type. We’ll need to use the Query-based strategies using
either the primary key or a secondary index.
Our Repo item is already pretty busy on the primary key, as it’s
handling relationships for both Issues and Stars. As such, we’ll need
to use a secondary index. We have two secondary indexes already,
but the Repo item is already using both of them for other
relationships. It’s using GSI1 to handle Repo + Pull Requests, and it’s
using GSI2 to handle the hierarchical Fork relationship.
Accordingly, we’ll need to add a third secondary index.
• GSI3PK: ACCOUNT#<AccountName>
• GSI3SK: ACCOUNT#<AccountName>
Repos:
• GSI3PK: ACCOUNT#<AccountName>
• GSI3SK: +#<UpdatedAt>
411
In this secondary index, we have two item collections—one for the
alexdebrie User account and one for the megacorp Organization
account. Each collection has the Repos that belong to those
accounts sorted in order of the time in which they were last
updated.
21.4. Conclusion
We did it! This was a beefy data model, but we managed to get it.
Entity PK SK
Repo REPO#<Owner>#<RepoName> REPO#<Owner>#<RepoName>
412
Entity PK SK
Issue REPO#<Owner>#<RepoName> ISSUE#<ZeroPaddedIssueNumb
er>
Pull Request PR#<Owner>#<RepoName>#<Ze PR#<Owner>#<RepoName>#<Zer
roPaddedPRNumber> oPaddedPRNumber>
IssueComment ISSUECOMMENT#<Owner>#<Rep ISSUECOMMENT#<CommentId>
oName>#<IssueNumber>
PRComment PRCOMMENT#<Owner>#<RepoNa PRCOMMENT#<CommentId>
me>#<PullRequestNumber>
Reaction <TargetType>REACTION#<Own <TargetType>REACTION#<Owne
er>#<RepoName>#<TargetIde r>#<RepoName>#<TargetIdent
ntifier>#<UserName> ifier>#<UserName>
User ACCOUNT#<Username> ACCOUNT#<Username>
413
Entity GSI3PK GSI3SK
Repo ACCOUNT#<AccountName> #<UpdatedAt>
414
Chapter 22. Handling
Migrations in our GitHub
example
In our last example, we walked through a complicated data model
to re-create the GitHub metadata API. If you’re like me, you felt
pretty good about how it all turned out.
Now let’s imagine it’s a year later. We have new access patterns to
deal with. We realized some of our initial assumptions were
incorrect. How will we modify our table design to handle this?
22.1. Introduction
To start off, let’s see the new patterns we need to handle. We’re
going to make four changes in this chapter. Three of these are
brand new entity types and one is a change in how we model an
existing pattern based on some incorrect assumptions.
415
22.1.1. Adding a Code of Conduct
Over the last few years, it has become more popular to include a
Code of Conduct in your GitHub repository. These Codes of
Conduct describe how one should act when using or contributing to
a particular piece of open source software. The goal is to make a
more welcoming and inclusive atmosphere for all developers.
416
On this screen, notice that it shows my most recent gists according
to the date they were created.
There are two main access patterns for GitHub Apps. First, a user or
an organization is the one that will create a GitHub App. In addition
to the create & update patterns around that, we will need "Fetch
account (User or Organization) and all GitHub Apps for account" to
handle the following screen:
417
This screen shows all of the GitHub Apps that I have created. There
is a similar screen for Organizations.
418
filter. When retrieving Issues and Pull Requests, you either retrieve
the "Open" ones or the "Closed" ones.
Given these problems, we’re going to switch our Issue & Pull
Request access patterns such that we can fetch open or closed items
via the primary key, rather than using a filter expression.
Now that we know our four changes, let’s follow our usual process.
First, we need to create our ERD.
419
red.
420
multiple repositories and a single repository may install multiple
GitHub Apps.
Note that we don’t need to make any changes to the ERD for
updated approach for issues and pull requests as nothing changed
about the data model, just about how we want to handle it.
Note that the last two aren’t new patterns, but they are ones we need
to handle.
421
patterns. I generally like to knock out the easiest ones first and then
work on the harder ones. When modeling in DynamoDB, I rank
them in the following order of difficulty:
In the GitHub API response, there are two attributes included for a
repository’s Code of Conduct:
422
A one-to-one relationship in DynamoDB is almost always modeled
as part of the same item. We’re big fans of denormalization with
DynamoDB, and a one-to-one relationship isn’t even
denormalization! The only time you may want to split this out is if
the related item is quite large and has different access patterns than
the underlying object. You may be able to save read capacity by
only fetching the related item when needed.
In this case, the name and url properties are likely to be quite small.
As such, we’ll just add the Code of Conduct entity as an attribute to
our Repo item.
The great thing about adding this new entity is that we can be lazy
about it. This isn’t used for indexing and querying in DynamoDB,
and the CodeOfConduct attribute is not a required property.
Accordingly, We don’t need to run a big ETL job to modify all of
our existing Repo items. As users take advantage of this feature by
registering a Codes of Conduct in their repositories, we can add the
CodeOfConduct attribute to the corresponding Repo item.
423
The create and update functionality around a new item like Gists
should be pretty straightforward. We just need to determine the
primary key pattern for a Gist.
The tricky part with Gists is that we have an access pattern of "Fetch
User and Gists for User". This means we’ll need to ensure Gists are
in the same item collection as a User entity.
Let’s take a look at the existing User entity in our base table:
If we look at our User entity, there are no other entities in its item
collection in the main table at the moment. That means we can
reuse this item collection without having to decorate our User items
to create a new item collection in a secondary index.
To do this, we’ll use our old friend, the KSUID. Remember, the
KSUID is like a UUID but with a timestamp prefix to give it
chronological sorting. We’ve used KSUIDs in a number of previous
examples, and you can read more about it in Chapter 14.
When adding Gists to our table, we’ll use the following pattern:
424
Gists:
• PK: ACCOUNT#<Username>
• SK: #GIST#<KSUID>
Here’s a look at our table with a few Gists added to the existing
Users:
Further, we can use the Query API to retrieve a User and the User’s
most recent Gist items. That Query would be written as follows:
result = dynamodb.query(
TableName='GitHubTable',
KeyConditionExpression="#pk = :pk",
ExpressionAttributeNames={
"#pk": "PK",
},
ExpressionAttributeValues={
":pk": { "S": "ACCOUNT#alexdebrie" },
},
ScanIndexForward=False
)
425
return our User and the most recent Gists.
GitHub Apps have a few things going on. First, there is a one-to-
many relationship between GitHub Apps and its owner (either a
User or an Organization). Second, a GitHub App can be installed
into multiple repositories, meaning there is a many-to-many
relationship between GitHub Apps and Repo items.
Let’s start with the one-to-many relationship first, then circle back
to the many-to-many relationships.
Let’s first take a look at what is in the Users and Orgs item
collections in our base table:
426
Notice that our User item collections contain Gist items which are
located before the User item, which gives us reverse-chronological
ordering. Our Organization item collections, on the other hand,
have Membership items after the Organization item because we’re
fetching user memberships in an organization in alphabetical order.
This means we can’t easily place the GitHub App items in the
relevant item collections in the base table. We could place it after
the User item in the User item collections but before the
Organization item in Organization item collections. Howver, our
client won’t know at query time whether they’re dealing with a User
or an Organization and thus which way to scan the index to find the
most recent Gists.
GitHub App:
427
• PK: APP#<AccountName>#<AppName>
• SK: APP#<AccountName>#<AppName>
• GSI1PK: ACCOUNT#<AccountName>
• GSI1SK: APP#<AppName>
• GSI1PK: ACCOUNT#<AccountName>
• GSI1SK: ACCOUNT#<AccountName>
With these updates in place, our base table would look as follows:
Notice that we’ve added two GitHub App items into our base table
(outlined in red). We’ve also added GSI1PK and GSI1SK attributes to
our existing User and Org items (outlined in blue).
428
Notice the three items outlined in red. We have created an item
collection in GSI1 that includes both an account (whether a User
item or an Org item) as well as all the GitHub Apps that belong to
that account.
On one hand, we could use the lazy method that we used above
with Codes of Conduct. Because a GitHub App is a new type of
entity, we would only need to add these attributes to Users or Orgs
that create a GitHub App. Whenever adding a GitHub App, we
could ensure that the adding User or Org item has the relevent
attributes.
429
1. Scan our entire table;
2. Find the relevent items to be updated; and
3. Run an update operation to add the new properties.
last_evaluated = ''
while True:
params = {
"TableName": "GithubModel",
"FilterExpression": "#type IN (:user, :org)",
"ExpressionAttributeNames": {
"#type": "Type"
},
"ExpressionAttributeValues": {
":user": { "S": "User" },
":org": { "S": "Organization" }
}
}
if last_evaluated:
params['ExclusiveStartKey'] = last_evaluated
results = client.scan(**params)
if not results['LastEvaluatedKey']:
break
last_evaluated = results['LastEvaluatedKey']
430
doesn’t match User or Organization as we only care about those
items.
For both of these relationships, we want to fetch the root item and
the related sub-item. That is, we want to fetch a GitHub App and all
its installations, and we want to fetch a Repo and all installed apps.
Further, there is no limit to the number of installations a single
GitHub App may have.
Because of these patterns, we’ll go with the adjacency list pattern for
many-to-many relationships. We’ll create a separate
AppInstallation item that will track the installation of a particular
431
app into a particular repository. We will include this
AppInstallation item in an item collection for both the parent app
as well as the repository to which it was installed.
Our Repo item is already pretty busy in the base table. We’re
storing Issues in there as well as Stars. Accordingly, we’ll have to use
a secondary index for that item collection.
Our GitHub App item, however, is not as busy. We just created this
item in the last step, and there’s nothing in its item collection yet.
Let’s put our AppInstallation items in there.
Further, our Repo item only has one relationship modeling in the
GSI1 index. We can handle the other side of the many-to-many
relationship there.
AppInstallation:
• PK: APP#<AccountName>#<AppName>
• SK: REPO#<RepoOwner>#<RepoName>
• GSI1PK: REPO#<RepoOwner>#<RepoName>
• GSI1SK: REPOAPP#<AppOwner>#<AppName>
432
Notice the AppInstallation item and GitHub App item outlined in
red. They are in the same item collection. Because the two items are
in the same item collection, we can use a Query API action to fetch
a GitHub App and all of its installations.
Now my Repo item and all the AppInstallations for the Repo are
located in the same item collection (outlined in red).
433
co-locate Repos with their AppInstallations.
A double filter is a great use case for the composite sort key pattern
that we discussed in Chapter 13. We can have a sort key that
combines the Status and the Issue Number using a pattern of
ISSUE#<Status>#<IssueNumber>. For example, for Issue Number 15
that has a Status of "Closed", the sort key would be
ISSUE#CLOSED#00000015 (recall that we’re zero-padding the Issue
Number to 8 digits to allow for proper lexicographical sorting).
Likewise, if Issue Number 874 has a Status of "Open", the sort key
would be ISSUE#OPEN#00000874.
434
We could show those in an item collection as follows:
• Closed Issues
• Open Issues
• Repo
One way you could solve this is to prefix the closed Issues and the
Repo sort keys with a #. Your pattern would be as follows:
SK values:
• Repo: #REPO#<RepoOwner>#<RepoName>
• Closed Issue: #ISSUE#CLOSED#<IssueNumber>
• Open Issue: ISSUE#OPEN#<IssueNumber>
435
Now our Repo item is between the closed Issue and open Issue
items. However, we have a new problem. When fetching our Repo
item and open Issues, we’ll be fetching the open Issues in ascending
order, which will give us the oldest items first. We don’t want that,
so let’s think of something else.
Let’s see how this works in practice. For open issues, we will create
the sort key in the following manner:
436
2. In the sort key, use the pattern of
ISSUE#OPEN#<IssueNumberDifference>.
If your issue number was 15, the issue number difference would be
99999984 (99999999 - 15). Thus, the sort key for that issue would be
ISSUE#OPEN#99999984.
Great! Our open issues are now sorted in descending order even
while searching forward in the item collection. This allows us to use
one secondary index to handle both open and closed issue access
patterns.
The pattern is basically the same for Pull Requests. We’ll use the
pull request number difference to order open pull requests as needed.
Note that all secondary indexes are already being used by the Repo
item for other access patterns. Given that, we’ll need to create two
new secondary indexes: GSI4 and GSI5.
Repos:
437
• GSI4PK: REPO#<RepoOwner>#<RepoName>
• GSI4SK: REPO#<RepoOwner>#<RepoName>
• GSI5PK: REPO#<RepoOwner>#<RepoName>
• GSI5SK: REPO#<RepoOwner>#<RepoName>
Issues:
• GSI4PK: #REPO#<RepoOwner>#<RepoName>
• GSI4SK: #ISSUE#CLOSED#<IssueNumber> (Closed Issue)
• GSI4SK: ISSUE#OPEN#<IssueNumberDifference> (Open Issue)
Pull Requests:
• GSI5PK: #REPO#<RepoOwner>#<RepoName>
• GSI5SK: #PR#CLOSED#<PRNumber> (Closed Pull Request)
• GSI5SK: PR#OPEN#<PRNumberDifference> (Open Pull Request)
Note that all of these are existing items, so you’ll need to do the
Scan + UpdateItem pattern that was shown above in the GitHub
Apps section to decorate your items with the new attributes.
438
22.4. Conclusion
439
Our new entity charts looks as follows:
Entity PK SK
Repo REPO#<Owner>#<RepoName> REPO#<Owner>#<RepoName>
440
Entity GSI1PK GSI1SK
GitHubApp ACCOUNT#<AccountName> APP#<AppName>
And GSI4:
441