0% found this document useful (0 votes)
66 views

Dzone Refcard 335 Distributed SQL 2020

Uploaded by

SankarPalanisamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

Dzone Refcard 335 Distributed SQL 2020

Uploaded by

SankarPalanisamy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

BROUGHT TO YOU IN PARTNERSHIP WITH

Distributed SQL CONTENTS

• What Is Distributed SQL?

Essentials
• Fundamentals of Distributed SQL
Architecture

• Key Operational Metrics

• Conclusion

ANDREW OLIVER
SENIOR DIRECTOR OF PRODUCT MARKETING, YUGABYTE

Distributed SQL databases combine the resilience and scalability Traditional relational databases were architected for transactional
of a NoSQL database with the full functionality of a relational integrity, low disk storage, and high performance, all on a single-
database. Since the 1980s, the relational model and SQL query server architecture. This requires a larger server or instance
language have been the dominant approach that businesses adopt whenever there is an increase in the amount of data, query volume,
to develop critical applications. While the 2010s saw the emergence or data complexity. Monolithic architectures do not support high
of databases from NoSQL to various data stores in the Hadoop availability and disaster recovery.
ecosystem, most mission-critical transactional applications have
Clustering technologies like Oracle RAC have addressed some of the
remained on traditional relational databases like Oracle and SQL
issues of high availability, and replication technologies like Oracle
Server.
Streams (Golden Gate) have allowed for forms of disaster recovery

WHAT IS DISTRIBUTED SQL? and cross-data center replication, they come at a heavy cost in both

Distributed SQL databases are becoming popular with organizations dollars and performance.

interested in moving data infrastructure to the cloud and/or cloud-


Traditional databases added forms of data partitioning to better
native environments in order to reduce TCO, move away from the
use storage hardware, but these partitioning methods increase
horizontal scaling limitations of monolithic RDBMSs like Oracle,
PostgreSQL, MySQL, and SQL Server. The basic characteristics of
distributed SQL are:

• A SQL API for querying and modeling data, with support for
traditional RDBMS features like foreign keys, partial indexes,
stored procedures, and triggers.

• An automatic distributed query execution so that no single


node becomes a bottleneck.

• Automatic and transparent distributed data storage


including indexes, sharded across multiple nodes of the
cluster so that no single node becomes a bottleneck. Data
distribution ensures high performance and high availability.

• Strongly consistent replication and distributed ACID


transactions.

1
The SQL features you know and love, now
cloud native, geo-distributed,
and proven at massive scale

Point and click Distribute queries Scale on demand, Stay up no matter


deployment to public and data while maintaining what. Even if you
and private clouds consistently across single-digit milli- lose a zone, your
via Kubernetes. the planet. second latency. data lives.

Powering business-critical apps at scale


for some of the world’s most innovative enterprises

312k+ 115k+ 34k+ 11k+


TPS TPS TPS TPS

Learn more
REFCARD | DISTRIBUTED SQL ESSENTIALS

application complexity and do not deliver performance at “Internet Data in a distributed SQL database is exposed as tables but stored in
scale” without in-memory caching technologies. When combined, key object pairs. Tables in distributed SQL databases are divided into
caching, clustering, replication, and manual sharding strategies lead smaller units, which the Spanner paper refers to as “tablets.”
to an extremely brittle architecture.
Each tablet is replicated to a number of peers. Peers live on separate
NoSQL has not been a panacea, either. Transactional integrity, instances and generally separate availability zones and data centers.
flexible query, and join functionality are the primary reasons For instance, if the replication factor is 3, then there are three
organizations have not adopted NoSQL technologies outside of low tablet peers for each tablet. Data between tablet peers is strongly
risk applications and niche areas where there was no choice in order consistent.
to deal with the required high data volume and multi-data center
deployments. Traditional databases excel at joins and transactional
data integrity, but not when combined with the requirement to
handle large data volumes and global data replication.

The promise of distributed SQL databases is that they allow you


to achieve low latency, global distribution, and high availability at
Internet scale without sacrificing data integrity, full RDBMS features
(such as joins), and a standard query language (SQL). These new One tablet-peer is elected as the leader using a consensus protocol
distributed architectures embrace and take advantage of modern (Raft for most other distributed SQL databases). When an application
technologies like cloud computing and Kubernetes. writes to the database, a set of keys is locked and the state change is
replicated among the peers in strict order. A log ensures ordered data
HOW DISTRIBUTED SQL DATABASES WORK
replication regardless of failures or cluster changes.

NoSQL databases dropped much of the functionality of SQL


Distributed SQL databases allow for multi-zone and multi-region
databases because they are difficult to implement at scale. Features
deployments, where data for each tablet is replicated using the
like transactional consistency were not as strictly important for the
consensus protocol. Tablet leaders coordinate writes and are usually
“big data” use cases these databases were created for. As Google
distributed among multiple zones or regions evenly. All tablet leaders
faced increasingly critical workloads, it developed a technology
can be pinned to a preferred zone or region when the application’s
called “Spanner,” which was outlined in a paper published in 2012.
reads or writes predominantly originate from that geographic
The basic structure outlined in the paper is common across all
location.
distributed SQL databases.

Distributed SQL databases include a SQL execution engine that


distributes queries to multiple servers. They include a distributed
storage engine that shards and replicates data.

Google Spanner and most distributed SQL databases follow the


CAP theorem as “CP” databases, meaning they are consistent
and partitionable. According to the CAP theorem, this means they
sacrifice “100% availability.” However, this is a “theoretical” analysis
of their storage and replication model. If the database through other
means achieves a high enough level of reliability, then theoretical
fault tolerance is less of a concern.

In the event a zone or region becomes unavailable, a new tablet


leader is elected in one of the remaining zones or regions.

NEWSQL VS. DISTRIBUTED SQL

Before the Spanner paper, there were a number of “NewSQL”


databases developed. These databases do not offer the same level of

3 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

functionality as distributed SQL databases. In some cases, they are GraphQL enables developers to think in terms of a natural
merely a sharding strategy built on existing monolithic databases like application structure. Visualize the data structures the application
MySQL and PostgreSQL. Some do not provide global consistency or uses as an object graph, rather than as queries and objects. Update
failover. sections of the object graph and allow GraphQL to take care of
remoting and updates. For instance, you might ask for “articles
Distributed SQL databases are a subset of NewSQL databases that
written by John Doe.”
scale, shard, failover, and provide global consistency. Because
there is no governing standard for distributed SQL databases, some
NewSQL vendors have begun calling their systems “distributed SQL,”
despite lacking distributed SQL features.

FUNDAMENTALS OF DISTRIBUTED
SQL ARCHITECTURE
Using or migrating to a distributed SQL database requires
considering your workload, application architecture, performance,
consistency requirements, geographic topography, and execution
environment. Match application requirements to the capabilities and There are other tools, like Gatsby, used with Hasura to facilitate rapid
configuration of a distributed SQL database to ensure your business application development. Gatsby provides an IDE that allows you to
and customer needs are met. compose GraphQL queries by exploring a Hasura object graph. At the
core of Hasura-powered GraphQL apps is a PostgreSQL-compatible
CONSIDER YOUR WORKLOAD
database used to store both data and metadata. When used with

Distributed SQL databases are good for moving mission-critical and PostgreSQL-compatible distributed SQL databases, Hasura apps are

system-of-record workloads to cloud computing environments while more resilient, scalable, and geo-distributed beyond the capabilities

providing resilience and consistency across geo-redundant zones. of traditional RDBMSs.

Some distributed SQL databases also handle low latency and/or


time-series workloads, similar to NoSQL databases like Cassandra.
Select a distributed SQL database capable of supporting all of your
workloads.

To migrate an existing application, consider whether the distributed


SQL database emulates your previous database’s APIs or reuses
the front end (i.e. PostgreSQL). If an application is written using an
Object-Relational-Mapping (ORM) tool, consider a distributed SQL
database that is well supported by that ORM.

SELECT THE APPROPRIATE APPLICATION


ARCHITECTURE When selecting a distributed SQL database, understand what tools
and technologies work with the database. Your application stack
Modern applications use an object mapper which maps the language
should have good examples and tutorials for the database that
to the underlying database elements. For distributed SQL databases,
you choose. While distributed SQL databases support connectivity
this is a tool like Hibernate (Java), Sequelize (JavaScript), or Prisma
standards like JDBC or ODBC — this may not be sufficient to prove
(TypeScript). ORM tools allow developers to focus on application and
compatibility with your overall stack. Development blogs and
business logic rather than data storage.
documentation are good sources of compatibility information.
Many developers now use GraphQL tools like Hasura or Apollo. These
MAKE AND MEET PERFORMANCE REQUIREMENTS
tools help reduce communication from the client to the middle tier
and finally the database. Using GraphQL, the overall system can be Using your business requirements as a foundation, establish
optimized to achieve better performance than database design or technical requirements for scalability and latency.
tuning alone.

4 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

Scalability requirements are expressed in terms of queries per Distributed SQL databases also allow you to configure certain tables
second, writes per second, reads per second, data volume, and to be stored together. In the classic case of an Order and Order
relative growth. This enables the cluster to be sized appropriately. Details table, which are frequently queried together, the tables can
be stored on the same instance. Alternatively, just the related rows
Latency requirements should be expressed as nominal and
can be stored together. This feature allows queries to be executed
maximum delay for writes and reads on the system. As the
locally rather than in a distributed fashion.
application matures, develop more specific requirements for
different components or services. For instance, a sale might take When configuring the database, keep in mind the requirements
10ms, but other requests might be allowed to take 3ms. for consistency, geographic location, and latency for both reads
and writes. Different availability or geographic zones can serve as
UNDERSTAND CONSISTENCY
replicas or leaders. Transactional guarantees can be strengthened

Business-critical applications rely upon ACID guarantees. When or weakened as appropriate for data integrity and performance

data is written in one row in one table on one server, it is easy for considerations.

databases to be “ACID.” When multiple rows or multiple tables are


EXECUTION ENVIRONMENT
written, it is important to understand the database’s consistency
guarantees and their failure modes. TO K8S OR NOT TO K8S

Ensuring consistency requires some degree of locking or copying and Compared to monolithic databases, distributed databases of all
waiting on acknowledgment, so when seeking higher throughput or types are fundamentally more complex to operate. Distributed
lower latency, it may make sense to compromise read consistency databases offer horizontal scalability and the ability to scale
guarantees for select datasets or clients. individual resources, but there are a lot of pieces to deploy,
administer, and potentially repair.
Write consistency should never be compromised for system-of-record
applications. For anything from ever-growing data to informational Organizations are turning to Kubernetes to orchestrate and deploy
status, consistency may not be as important, or at least not equally applications, services, and databases. When deciding whether to use
as important to all applications or application clients. Kubernetes (K8s) to deploy a distributed SQL database, keep a few
things in mind: the Kubernetes implementation, overhead, and the
The database may offer different modes of isolation, for instance
database performance requirements.
REPEATABLE_READ and SERIALIZABLE. These offer different tradeoffs
in terms of consistency and performance under concurrent access. In order to simplify the container management in Kubernetes, the
Making these compromises in integrity depends on your database’s “Helm” package manager was created. Helm has quickly become the
capabilities, the nature of the data, and the needs of your application. standard way to manage Kubernetes but is less mature overall and
has some well-known weaknesses including security.
GEOGRAPHIC LOCALITY
While Kubernetes is open source and can be downloaded, installed,
Distributed SQL databases are able to assure consistency across
and used for free, each cloud service provider (CSP) also provides its
availability and geographic zones. Multi-leader (i.e. multi-
own version of Kubernetes. A CSP’s implementation of Kubernetes is
primary) configurations consistently replicate writes made in any
usually cheaper to operate because CSPs optimize it for their cloud.
zone to every other zone. “Read replica” and “leader-follower”
DevOps tooling to deploy and run distributed SQL databases is
configurations can be used where writes are fewer than reads and
specific to each public cloud. The steps for installing and managing
perform better than multi-leader configurations.
an application via Amazon’s Elastic Kubernetes Service (EKS), Google
Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS) are all
different.

Kubernetes does involve a considerable amount of overhead. It


abstracts from bare metal disk I/O and performs a considerable
amount of network communication. The choice is between greater
manageability and performance optimization. While the answer
from a capacity standpoint may be a few more nodes, from a
latency perspective, there will always be a cost to Kubernetes. For

5 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

most developers, architects, and operations people, the tradeoff is AVERAGE LATENCY
worthwhile — however, not for all applications, services, or datasets.
The average latency of the system is a measure of performance.
CLOUD, HOME, OR HYBRID Latency is how long writes and reads are taking.

Many applications and services have moved to public clouds. Some


distributed SQL databases are offered as a vendor-hosted database
as a service. Although it may be more costly, the distributed SQL
database provider has the most expertise on running that type of
database and can apply the best practices learned across multiple
customer installs. The cost may also balance out when labor, Sub-millisecond reads and writes
hardware, and risk are accounted for.
DISK USAGE
The downside of a managed offering is the loss of control. A vendor
may not support a really old version of their database because of Even in 2020, people run out of disk space! It is critical to monitor
costs. An organization may not want to assume the risk of an upgrade how much space is left in the database. With a distributed SQL
until their application or service is revised. database, disk space can be grown by adding more nodes — but
running out of disk space while at full capacity and with clients
Some applications, datasets, or industries cannot consider the
waiting is best avoided. Even in a distributed SQL database, you need
public cloud due to cost, or sensitivity of the data, or specific security
to monitor to ensure you do not run out of resources.
requirements. Running locally also may achieve lower latency with
data at the edge of the network.

Many organizations have a combination of sensitive and less-


sensitive data and may want a hybrid cloud consisting of local
instances as well as public cloud instances. Data may even
be replicated between private and public instances in various
configurations.
CPU USAGE

While there are many potential advantages to a hybrid cloud


CPU utilization is a good measure of the execution capacity
approach to database hosting, it is more complex to operate and
remaining in the system.
requires a platform or tool to manage the database cost effectively.

KEY OPERATIONAL METRICS


When operating your distributed SQL database, some key metrics
should be monitored. These metrics help determine whether your
system is healthy and your business requirements are being met.

OPERATIONS PER SECOND


COST
Operations per second are how many reads and writes are performed
In a cloud environment, waiting for the bill at the end of the month
per second. This is an overall measurement of how much capacity is
or quarter can be a nasty surprise. Monitor the cost per day and cost
being used. Upon deployment, test the initial capacity of the overall
per month of running a system such as a distributed SQL database.
system. During operation, monitor the reads/writes to determine
Major changes in cost without matching increases in utilization and
how much capacity the system has remaining.
revenue require investigation.

3-4k reads per second with 44 writes per second

6 BROUGHT TO YOU IN PARTNERSHIP WITH


REFCARD | DISTRIBUTED SQL ESSENTIALS

QUERY PLANS Anyone architecting a modern application that takes advantage of a


distributed database should investigate tools like Hasura or Prisma
Queries may be machine-generated with architecture based on
as well as Object-Relational Mapping tools like Hibernate. These
tools like Hasura or Hibernate. With any database, queries may not
tools may allow for overall distributed application optimization
always be the most efficient — or efficient enough — and may require
including optimizing database operations. Deploying and managing
indexes be created.
a distributed database is more complex than a monolithic database

On any system under test or diagnosis, queries should be captured, and requires either extensive DevOps expertise and tooling or a

and the most common or longest running queries should be analyzed distributed SQL management platform that takes care of the details.

with the database’s equivalent of “explain plan,” which describes This is also true when monitoring key performance and cost metrics.

how the database executes the query. The rewards of distributed SQL technology including globally
consistent transactions, high availability, and scalability make these
In a distributed SQL database, some types of query problems are challenges worthwhile especially when deployed with the right
exacerbated by the distributed nature of the database. For example, DevOps tooling and architecture.
joining two full table scans would generate a lot of reads and merges
all over the network rather than just disk/CPU thrashing like in a FURTHER READING
monolithic database. • The Spanner Paper

EXPLAIN SELECT * FROM employees WHERE salary=2000; • What is Distributed SQL


QUERY PLAN
• Distributed SQL vs NewSQL
-------------------------------------------------------
-------------------- • Beyond SQL, the Case for Distributed SQL
Index Scan using idx_emp_salary on employees (cost=...
rows=... width=...)
Index Cond: (salary = 2000)

(2 rows)
ANDREW OLIVER,
Explain plan on showing an index scan SENIOR DIRECTOR OF PRODUCT MARKETING, YUGABYTE

Andrew C. Oliver is the Senior Director of


CONCLUSION Product Marketing for Yugabyte. He is a prolific
Distributed SQL databases are critical infrastructure for taking writer about technology — particularly open
source and distributed database technologies. In the past, he’s
systems of record to the cloud or operating them at Internet scale. served on the board of the Open Source Initiative, founded
These systems match the scale and resilience of a NoSQL database Apache POI and was an early part of JBoss, Inc. before its
acquisition by Red Hat. Find him over on Twitter @acoliver.
with the full-featured performance of a relational database. Whether
the consideration is for an existing workload or a new system of
record, any application that requires scale, low latency, transactional
integrity, or general resilience is a good candidate for a distributed
SQL database.

DZone, a Devada Media Property, is the resource software Devada, Inc.


developers, engineers, and architects turn to time and 600 Park Offices Drive
again to learn new skills, solve software development Suite 150
problems, and share their expertise. Every day, hundreds of Research Triangle Park, NC 27709
thousands of developers come to DZone to read about the
888.678.0399 919.678.0300
latest technologies, methodologies, and best practices. That
makes DZone the ideal place for developer marketers to Copyright © 2020 Devada, Inc. All rights reserved. No part
build product and brand awareness and drive sales. DZone of this publication may be reproduced, stored in a retrieval
clients include some of the most innovative technology and system, or transmitted, in any form or by means of electronic,
tech-enabled companies in the world including Red Hat, mechanical, photocopying, or otherwise, without prior
Cloud Elements, Sensu, and Sauce Labs. written permission of the publisher.

7 BROUGHT TO YOU IN PARTNERSHIP WITH

You might also like