Distributed Databases: SQL vs NoSQL
Seda Unal, Yuchen Zheng
April 23, 2017
1 Introduction
Distributed databases have become increasingly popular in the era of big
data because of their advantages over traditional databases. In this project,
distributed databases are investigated from a relational versus non-relational
perspective introducing SQL and NoSQL and discussing their advantages and
disadvantages for distributed databases.
A distributed database is a database where portions of the database are
stored in multiple physical locations and processing is distributed among multi-
ple database nodes. SQL, Structured Query Language, is the standard language
for relational database management systems. In a relational database, rela-
tions/tables are used to match data and the resulting group is called a Schema.
The need to store, process and analyze the unstructured data led to the devel-
opment of schema-less alternatives to SQL, namely NoSQL known as not only
SQL.
In this project, a thorough comparison between SQL and NoSQL for dis-
tributed databases is provided on many aspects from model features, data in-
tegrity and flexibility to scalability.
2 Distributed Databases
A distributed database is a collection of multiple, logically related databases
distributed over a computer network [1]. It has advantages such as improved per-
formance, speed and resource efficiency over a traditional centralized database
system. However, it also has disadvantages such as increased complexity and
difficulty to maintain data integrity.
A distributed database system allows applications to access data from lo-
cal and remote databases. With the data stored on multiple computers, users
can access and modify the data in a network simultaneously. DDBMS (dis-
tributed database management system) control all database servers and main-
tain the consistency of the global database in a distributed database. Dis-
tributed databases use a client-server-node architecture to process information
requests, in which client is an application that requests information from a
1
server, a server is the software that manages a database, and a node in a dis-
tributed database system can be either a client or a server.
In the recent years preference has gradually moved to distributed database
systems with motivations like the distributed nature of organizational units, sup-
port for both OLTP (online transaction processing) and OLAP (online analytical
processing) and database recovery [2, 3]. Distributed databases have advantages
over centralized databases like reliability, higher speed, low communication cost.
Minimal disruption on database from failures indicates a distributed database
is much reliable as the entire system will not come to a halt because of minor
mistakes. Data stored on multiple computers, which does not require sending
everything to a central computer to process, guarantees faster responses for
users. Lastly, data in distributed databases is located locally where it is mostly
used, which minimizes the communication costs for data manipulation [3, 4].
3 SQL vs NoSQL in Distributed Databases
SQL and NoSQL are both great inventions to keep data storage and retrieval
optimized and smooth with one having some advantages over the other in certain
scenarios [5, 8, 7].
Consider a social platform where users engage with each other in terms of
posts associated with images, audio, video, comments, links to websites, etc.
Using SQL a query with many joins is necessary to retrieve the content. In
addition to the complexity of the data, consider stream of posts dynamically
loading. With SQL, many queries and joins will be necessary to complete the
task.
A solution to this problem can be using JSON, as it is the supported dynamic
data format. Another approach, from a nonrelational perspective would be using
NoSQL. NoSQL simplifies the approach for this specific scenario.
Due to its simplicity, the use of NoSQL has grown with the social media
platforms in order to successfully handle the growing need of IoT (Internet of
Things).
3.1 SQL
SQL (Structured Query Language) is a programming language widely used
to manage data in relational databases. A relational database is a set of tables
containing data fitted into predefined categories. Users can access data from
the database without knowing the structure of the database table.
Scalability in relational SQL databases is an important issue since the database
has to be distributed on to multiple servers and handling tables across differ-
ent servers can be problematic.However, with Google announcing F1, a SQL
database that is trivial to scale up, to run at the core of AdWords business;
SQL is guaranteed to be available for distributed databases [8].
2
3.2 NoSQL
Non-relational databases have been around since the late 1960s, but the
major shift to NoSQL databases happened only in the last decade, when Amazon
introduced its Dynamo distributed NoSQL system [2].
A NoSQL database provides a mechanism for storage and retrieval of data
which is modeled in means other than the tabular relations used in relational
databases.
3.2.1 NoSQL Databases
There has been a booming of all kinds of data in recent years, a big portion
of which is no longer structured data which SQL is good at handling. NoSQL
is exactly the tool to deal with these unstructured and semi-structured data.
A NoSQL database environment is, simply put, a non-relational and largely
distributed database system that enables rapid, ad-hoc organization and anal-
ysis of extremely high-volume, disparate data types [9].
Schema-less NoSQL databases follow basically all advantages of distributed
databases and have become the alternative to relational databases. What does
it mean by saying NoSQL is schema-less? It means for any data we put into
NoSQL databases, we do not need to predefine a rigid schema like one of those in
relational databases. The format of the data can even be changed at any time,
without disruption on anything, which provides great application flexibility, and
a faster speed than relational databases.
NoSQL also has great horizontal scalability. It automatically spreads data
over many servers, which can be added or removed from the data layer. Another
reason is that NoSQL does not follow ACID (atomicity, consistency, isolation,
durability), which relational databases have. ACID makes it harder to have a
good horizontal scalability.
Other features like replication (which maintains availability) and integrated
caching (which improves read performance) also make NoSQL databases distinct
from relational databases.
3.2.2 Types of NoSQL Databases
Among many types of NoSQL databases, there are four most common ones:
Document database, which stores document-oriented information (semi-structured
data); Column store, which stores data tables as sections of columns of data,
rather than as rows of data; Key-value store, which stores data with an indexed
key and a value; Graph database, which stores data whose relations are repre-
sented as a graph [9]. A brief comparison of these databases in in Figure 1 in
terms of their performance, scalability, flexibility, complexity and functionality.
Let’s check an example here. We choose MongoDB (document), Cassandra
(column), Riak (key-value), Neo4j (graph) from the types we just mentioned
to see how they are similar or different from each other in Data distribution
mechanism and Distributed data processing support.
3
Figure 1: Comparison among NoSQL Databases [9]
MongoDB Cassandra Riak Neo4j
Data distribution Key-range Key-range Hash based master-
mechanism and hash and hash distribution slave clus-
based dis- based ter
tribution. distribution
Distributed data Aggregation Hadoop, MapReduce Hadoop
processing support pipeline, MapRe-
MapRe- duce
duce
Table 1: Comparison of Four Representative NoSQL Databases [10]
3.3 NoSQL vs SQL
NoSQL supporting a nonrelational, flexible, dynamic and horizontally scal-
able databases is the frequently selected programming language for distributed
database systems. SQL, on the other hand, enforces strict schema, structured
data, strong consistency and vertical scalability. Depending on the needs, the
best fit between SQL and NoSQL can be decided. In distributed systems, due
the the scalability problems of SQL, mostly NoSQL is preferred. However, there
is evidence that scalability in SQL can be achieved with clustered hierarchical
distributed databases [8].
4 Conclusion
In the last decade, with the introduction of big data, IoT and increasing user
activity in social platforms have forced the backbone in database technology to
be distributive. Distributed databases have better reliability, higher speed and
low communication cost compared to the traditional databases.
SQL and NoSQL are both great inventions to keep data storage and retrieval
optimized and smooth. There are advantages and disadvantages of both pro-
gramming languages in different applications. However, it would be fair to state
that both are good and a better usage might depends on the situation and the
needs.
4
References
[1] M. T. Özsu and P. Valduriez, Principles of distributed databases, 2011.
[2] S. Venkatraman et al., SQL versus NoSQL movement with big data analyt-
ics, International Journal of Information Technology and Computer Science
8(12), pp. 5966. 2016.
[3] Distributed DBMS-Distributed Databases:
https://www.tutorialspoint.com/distributedd bms/distributedd bmsd atabases.htm
[4] N. Leavitt, Will NoSQL databases live up to their promise? Computer
43(2), pp. 1214. 2010.
[5] MangoDB White Paper: Top 5 Considerations When Evaluating NoSQL
Databases
[6] Oracle9i Database Administrator’s Guide Release 2 (9.2) Part Number
A96521-01
[7] NoSQL vs. SQL, Microsoft Azure: https://docs.microsoft.com/en-
us/azure/documentdb/documentdb-nosql-vs-sql
[8] Ian Rae et.al. August 26th 2013, Proceedings of the VLDB Endowment,
Vol. 6, No. 11
[9] What is NoSQL?: https://academy.datastax.com/planet-cassandra/what-
is-nosql
[10] Xiaoming Gao, Investigation and Comparison of Distributed NoSQL
Database Systems, Indiana University