Dzone Refcard 335 Distributed SQL 2020
Dzone Refcard 335 Distributed SQL 2020
Essentials
• Fundamentals of Distributed SQL
Architecture
• Conclusion
ANDREW OLIVER
SENIOR DIRECTOR OF PRODUCT MARKETING, YUGABYTE
Distributed SQL databases combine the resilience and scalability Traditional relational databases were architected for transactional
of a NoSQL database with the full functionality of a relational integrity, low disk storage, and high performance, all on a single-
database. Since the 1980s, the relational model and SQL query server architecture. This requires a larger server or instance
language have been the dominant approach that businesses adopt whenever there is an increase in the amount of data, query volume,
to develop critical applications. While the 2010s saw the emergence or data complexity. Monolithic architectures do not support high
of databases from NoSQL to various data stores in the Hadoop availability and disaster recovery.
ecosystem, most mission-critical transactional applications have
Clustering technologies like Oracle RAC have addressed some of the
remained on traditional relational databases like Oracle and SQL
issues of high availability, and replication technologies like Oracle
Server.
Streams (Golden Gate) have allowed for forms of disaster recovery
WHAT IS DISTRIBUTED SQL? and cross-data center replication, they come at a heavy cost in both
Distributed SQL databases are becoming popular with organizations dollars and performance.
• A SQL API for querying and modeling data, with support for
traditional RDBMS features like foreign keys, partial indexes,
stored procedures, and triggers.
1
The SQL features you know and love, now
cloud native, geo-distributed,
and proven at massive scale
Learn more
REFCARD | DISTRIBUTED SQL ESSENTIALS
application complexity and do not deliver performance at “Internet Data in a distributed SQL database is exposed as tables but stored in
scale” without in-memory caching technologies. When combined, key object pairs. Tables in distributed SQL databases are divided into
caching, clustering, replication, and manual sharding strategies lead smaller units, which the Spanner paper refers to as “tablets.”
to an extremely brittle architecture.
Each tablet is replicated to a number of peers. Peers live on separate
NoSQL has not been a panacea, either. Transactional integrity, instances and generally separate availability zones and data centers.
flexible query, and join functionality are the primary reasons For instance, if the replication factor is 3, then there are three
organizations have not adopted NoSQL technologies outside of low tablet peers for each tablet. Data between tablet peers is strongly
risk applications and niche areas where there was no choice in order consistent.
to deal with the required high data volume and multi-data center
deployments. Traditional databases excel at joins and transactional
data integrity, but not when combined with the requirement to
handle large data volumes and global data replication.
functionality as distributed SQL databases. In some cases, they are GraphQL enables developers to think in terms of a natural
merely a sharding strategy built on existing monolithic databases like application structure. Visualize the data structures the application
MySQL and PostgreSQL. Some do not provide global consistency or uses as an object graph, rather than as queries and objects. Update
failover. sections of the object graph and allow GraphQL to take care of
remoting and updates. For instance, you might ask for “articles
Distributed SQL databases are a subset of NewSQL databases that
written by John Doe.”
scale, shard, failover, and provide global consistency. Because
there is no governing standard for distributed SQL databases, some
NewSQL vendors have begun calling their systems “distributed SQL,”
despite lacking distributed SQL features.
FUNDAMENTALS OF DISTRIBUTED
SQL ARCHITECTURE
Using or migrating to a distributed SQL database requires
considering your workload, application architecture, performance,
consistency requirements, geographic topography, and execution
environment. Match application requirements to the capabilities and There are other tools, like Gatsby, used with Hasura to facilitate rapid
configuration of a distributed SQL database to ensure your business application development. Gatsby provides an IDE that allows you to
and customer needs are met. compose GraphQL queries by exploring a Hasura object graph. At the
core of Hasura-powered GraphQL apps is a PostgreSQL-compatible
CONSIDER YOUR WORKLOAD
database used to store both data and metadata. When used with
Distributed SQL databases are good for moving mission-critical and PostgreSQL-compatible distributed SQL databases, Hasura apps are
system-of-record workloads to cloud computing environments while more resilient, scalable, and geo-distributed beyond the capabilities
Scalability requirements are expressed in terms of queries per Distributed SQL databases also allow you to configure certain tables
second, writes per second, reads per second, data volume, and to be stored together. In the classic case of an Order and Order
relative growth. This enables the cluster to be sized appropriately. Details table, which are frequently queried together, the tables can
be stored on the same instance. Alternatively, just the related rows
Latency requirements should be expressed as nominal and
can be stored together. This feature allows queries to be executed
maximum delay for writes and reads on the system. As the
locally rather than in a distributed fashion.
application matures, develop more specific requirements for
different components or services. For instance, a sale might take When configuring the database, keep in mind the requirements
10ms, but other requests might be allowed to take 3ms. for consistency, geographic location, and latency for both reads
and writes. Different availability or geographic zones can serve as
UNDERSTAND CONSISTENCY
replicas or leaders. Transactional guarantees can be strengthened
Business-critical applications rely upon ACID guarantees. When or weakened as appropriate for data integrity and performance
data is written in one row in one table on one server, it is easy for considerations.
Ensuring consistency requires some degree of locking or copying and Compared to monolithic databases, distributed databases of all
waiting on acknowledgment, so when seeking higher throughput or types are fundamentally more complex to operate. Distributed
lower latency, it may make sense to compromise read consistency databases offer horizontal scalability and the ability to scale
guarantees for select datasets or clients. individual resources, but there are a lot of pieces to deploy,
administer, and potentially repair.
Write consistency should never be compromised for system-of-record
applications. For anything from ever-growing data to informational Organizations are turning to Kubernetes to orchestrate and deploy
status, consistency may not be as important, or at least not equally applications, services, and databases. When deciding whether to use
as important to all applications or application clients. Kubernetes (K8s) to deploy a distributed SQL database, keep a few
things in mind: the Kubernetes implementation, overhead, and the
The database may offer different modes of isolation, for instance
database performance requirements.
REPEATABLE_READ and SERIALIZABLE. These offer different tradeoffs
in terms of consistency and performance under concurrent access. In order to simplify the container management in Kubernetes, the
Making these compromises in integrity depends on your database’s “Helm” package manager was created. Helm has quickly become the
capabilities, the nature of the data, and the needs of your application. standard way to manage Kubernetes but is less mature overall and
has some well-known weaknesses including security.
GEOGRAPHIC LOCALITY
While Kubernetes is open source and can be downloaded, installed,
Distributed SQL databases are able to assure consistency across
and used for free, each cloud service provider (CSP) also provides its
availability and geographic zones. Multi-leader (i.e. multi-
own version of Kubernetes. A CSP’s implementation of Kubernetes is
primary) configurations consistently replicate writes made in any
usually cheaper to operate because CSPs optimize it for their cloud.
zone to every other zone. “Read replica” and “leader-follower”
DevOps tooling to deploy and run distributed SQL databases is
configurations can be used where writes are fewer than reads and
specific to each public cloud. The steps for installing and managing
perform better than multi-leader configurations.
an application via Amazon’s Elastic Kubernetes Service (EKS), Google
Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS) are all
different.
most developers, architects, and operations people, the tradeoff is AVERAGE LATENCY
worthwhile — however, not for all applications, services, or datasets.
The average latency of the system is a measure of performance.
CLOUD, HOME, OR HYBRID Latency is how long writes and reads are taking.
On any system under test or diagnosis, queries should be captured, and requires either extensive DevOps expertise and tooling or a
and the most common or longest running queries should be analyzed distributed SQL management platform that takes care of the details.
with the database’s equivalent of “explain plan,” which describes This is also true when monitoring key performance and cost metrics.
how the database executes the query. The rewards of distributed SQL technology including globally
consistent transactions, high availability, and scalability make these
In a distributed SQL database, some types of query problems are challenges worthwhile especially when deployed with the right
exacerbated by the distributed nature of the database. For example, DevOps tooling and architecture.
joining two full table scans would generate a lot of reads and merges
all over the network rather than just disk/CPU thrashing like in a FURTHER READING
monolithic database. • The Spanner Paper
(2 rows)
ANDREW OLIVER,
Explain plan on showing an index scan SENIOR DIRECTOR OF PRODUCT MARKETING, YUGABYTE