What Is A Distributed Database

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

What is a distributed database?

In the most basic terms, a distributed database is a database that stores data in multiple locations
instead of one location. This means that rather than putting all data on one server or on one
computer, data is placed on multiple servers or in a cluster of computers consisting of individual
nodes. These nodes are oftentimes geographically separate and may be physical computers or virtual
machines within a cloud database.

Illustration of cluster and nodes

The MongoDB cluster example above is one of many configuration possibilities available when
creating a distributed database. However, unlike traditional centralized databases, all distributed
databases share the common characteristic of spreading data across multiple locations (physical
and/or virtual) which improves data resiliency and availability. Sharding the data across multiple
locations allows for horizontal scaling as well.

Distributed database types

There are two distinct types of distributed databases: homogeneous databases and heterogeneous
databases.
Homogeneous distributed databases

In a homogeneous distributed database, the machines, nodes, servers, or sites store the same data,
use the same data model, work with the same operating system, and share the same distributed
database management system (DDBMS) or occasionally multiple types of DDBMS from the same
vendor.

Within homogenous distributed databases, there are two subsets: autonomous and non-
autonomous.

 Autonomous distributed databases: In an autonomous distributed database, nodes work on


their own with their own complete set of data, only requiring an application to facilitate
universal updates across all nodes or messaging between nodes.

 Non-autonomous distributed databases: In non-autonomous distributed databases, nodes


rely on a centralized database management system (DBMS) to coordinate data distribution,
communications, and all updates.

As a rule, homogeneous distributed databases offer significant data protection through redundancy
and simplified management due to the similarity of all nodes.

Heterogeneous distributed databases

In a heterogeneous distributed database, different machines or sites may house different data sets,
use different operating systems, contain different data schemas, and require software to facilitate
communication between machines. Further, different sites may not even be aware of the existence
of other sites.

Within heterogeneous distributed databases, there are two subsets: federated and unfederated.

 Federated distributed databases: In a federated distributed database, multiple nodes —


which are able to function completely on their own and may contain different data — can
work together and function as one entity. This means that when a query occurs, the system
determines which node is best equipped to respond and passes the query appropriately. This
process is sometimes referred to as data virtualization.

 Unfederated distributed databases: In an unfederated distributed database, each node


operates individually and there is a central application that manages the access to each
database in each node.

While more complex to manage, heterogeneous distributed databases offer more flexibility in terms
of data models, schema choices, and the types of data that can be stored than homogeneous
distributed databases.

How do distributed databases work?

As previously discussed, nodes are individual servers or computers that reside within a distributed
database system (e.g., computers, virtual machines, servers that share no physical components).
Each node stores a set of data and runs on distributed database management system software
(DDBMS). To determine which data will be stored amongst which nodes, the concept of data
distribution must be considered.
Data distribution

Proper data distribution is critical to the efficiency, security, and optimal user access in a distributed
database. This process, sometimes referred to as data partitioning, can be accomplished using two
different methods.

 Horizontal partitioning: Horizontal partitioning involves splitting data tables into rows across
multiple nodes.

 Vertical partitioning: Vertical partitioning splits tables into columns across multiple nodes.

(Source: Hazelcast.com, 2023)

The resulting data sets from horizontal or vertical partitioning of the original table are sometimes
referred to as shards.

Distributed database system communication

While nodes are able to fully function on their own, it is necessary for them to communicate with
other nodes as well since, unlike centralized databases, they do not share the same physical
components or even the same data sets. There are three types of distributed database
communication:
 Broadcast communication: One message is sent to all other nodes within the distributed
database system.

 Multicast communication: One message is sent to some but not all other nodes within the
distributed database system.

 Unicast communication: A message is sent from an individual node to one other individual
node.

Transaction management

Distributed databases must often support distributed transactions, where one transaction can
involve more than one node. This support methodology is highlighted in the ACID properties
(atomicity, consistency, isolation, durability) of transactions across distributed database systems. Key
elements of ACID properties include:

(Source: Dev.to, 2020)

 Atomicity means that a transaction is treated as a single unit. This also means that either a
complete transaction is available for storage or it's rejected as an error which ensures data
integrity.

 Consistency is maintained in distributed database systems by enforcing predefined rules and


data constraints. If the state, nature, or content of a transaction violates these rules, the
transaction will not be ingested and stored in the distributed system.

 Isolation involves the separation of each transaction from the other transactions to prevent
data conflicts and maintain data integrity. In addition, this benefits operations when
managing multiple distributed data records that may exist across local data stores, virtual
machines via cloud computing, and multiple database nodes which may be located across
multiple sites.

 Durability ensures that stored data is preserved in the event of a system failure. There are a
variety of ways that a transactional distributed database management system accomplishes
this task, including:

Fault tolerance

Because distributed database systems are more likely to experience failures or operations
interruptions than centralized databases (e.g., due to multiple sites or a suboptimal file system),
strong fault tolerance processes are essential to maintain access reliability and effective database
operations. With that said, the number of individual components that distributed systems are able to
preserve removes the risk of a single point of failure.

Some common fault tolerance processes include data replication, backup protocols, continuous
failure detection, data checksums, load balancing, and query optimization.

Data replication

Data replication is the process by which multiple copies of data are maintained across different
nodes, servers, or sites. There are different types of database replication schema to choose from,
including:

 Full replication: In full replication, a complete, functional copy of the entire database is sent
to all sites within the distributed database system. Database copy updates are provided on a
routine schedule. There are two subtypes of full replication, as well.

 Transactional replication: In transactional replication, a full and complete database


copy is provided to each node, and then data changes are updated to that copy as
transaction processing occurs, often in real-time.

 Snapshot replication: Using snapshot replication, a copy of the database at a specific


point in time is captured. This snapshot is then distributed across nodes and the user
base as needed but does not consistently monitor for data changes. For this reason,
snapshot database replication is only recommended for infrequently changing
content.

 Partial replication: In some cases, certain nodes only require specific portions of the
database, so a defined portion of the database is replicated to a select group. In this type of
data replication, any number of nodes or sites can receive the replication.

 Merge replication: As its name indicates, merge database replication is the merging of two
databases into one. This is the most complex of the database replication types.
Backup protocols

Through a consistent program of automated data backups, data integrity and database systems
availability can be maintained without overburdening organizational employees. Some of the most
common solutions in the marketplace include backup software from Veeam, Druva, and Commvault.

Three of the most used types of backup for distributed databases include:

 Full backup: The entire database is copied and stored every time a database backup is
executed.

 Differential backup: Only the changes made since the last full backup are copied and stored.

 Incremental backup: Incremental backups do not require a previous full backup — they can
save changes since the previous differential or incremental backup.

Continuous failure detection

As with any system, it is critical for distributed database systems to be continuously monitored for
system failures — whether they be technical issues, natural disasters, or cyberattacks. Just a few of
the ways this monitoring is accomplished include:

 Heartbeating: In heartbeating, each node sends out a signal (heartbeat) to other nodes to
verify it's operational. If that signal isn't received, a failure message is created and further
investigation of that node's operations by system administration is undertaken.

 Watchdog timers: Individual nodes will have watchdog timers that are focused on a specific
activity or process. If the timer expires without the activity or process being completed, a
failure message is generated indicating further investigation is required.

 Data checksums: In order to identify data tampering or other issues with data transmission,
when a data transmission is sent, it is assigned a certain value (or checksum). When that
transmission is received, it is also assigned a checksum. By using software to verify that both
the sender and receiver have equivalent checksums for that transmission, issues with data
transmission integrity can be quickly identified.

Load Balancing

Load balancing techniques distribute user requests and queries evenly across database nodes. This
not only improves performance but also ensures that the failure of one node does not cause an
overload on others.

Usually, load balancing software is deployed as the intermediary between the applications or
database users. When a query is received, the load balancer will evaluate the request and determine
which node(s) are best equipped to respond. During this evaluation, such factors as proximity,
current load, and other predetermined system rules will be considered. This evaluation and
assignment helps the system avoid system overload and system inefficiency which can result in long
wait times for users.
Query optimization

Distributed databases use query optimization techniques to distribute queries efficiently across
nodes while minimizing data transfer traffic between nodes. One of the ways this is accomplished is
through cost-based query optimization. This form of query optimization considers the most efficient
execution for the query, with such factors as query complexity, available data, and the location of the
site containing that data.

Benefits and challenges distributed databases offer

As with any type of database solution, there are both benefits and challenges. Here is a brief
summary to consider when researching distributed databases for your organization.

Distributed database benefits

 Flexibility: Flexibility of data structures and schemas used within a distributed database (e.g.,
heterogeneous) are a significant benefit for organizations with a variety of data asset types
and processing requirements.

 Resiliency: Because distributed databases locate data across multiple nodes in the
distributed system, the risk of a single point of failure is significantly reduced.

 Scalability: Distributed databases can easily scale up (or down) by simply adjusting the
number of nodes in the database, making them ideal for growing organizations.

 Improved performance: Distributed databases are able to use load balancing and query
optimization to improve overall database performance while reducing user wait times.

 High availability: Fault tolerance (e.g., data replication, continuous failure detection) provide
high system availability for users.

Distributed database challenges

 Complexity: Because there are more moving parts to distributed databases vs. centralized
databases, they can be more complex to both design and manage. The Atlas developer data
platform simplifies this dramatically by providing a single UI/API to control and manage
secure MongoDB distributed systems at scale.

 Latency: If not managed properly, latency can occur when users query data from multiple
nodes.
 Data consistency: Since distributed databases are able to employ multiple data schemas and
structures, maintaining data consistency requiresmore effort than traditional databases. In
addition, if there is a hardware or network failure, data restoration can be more complex.

 Cost: Distributed databases can be more expensive due to the added complexity that their
greater flexibility brings. In addition, there may be additional networking costs since they
tend to have more sites and hardware than traditional databases.

You might also like