0% found this document useful (0 votes)
46 views

NoSQL Databases UNIT-2

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

NoSQL Databases UNIT-2

Uploaded by

Chandrika Surya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

NOSQL DATABASES

VISWANADH. MNV
UNIT-2 Distribution Models, Consistency & Version
Stamps
1.2 Distribution Models
2.1.1 Introduction
● The main reason for interest in NoSQL databases is their ability to run on large
clusters of servers. As data grows, it's harder and more costly to upgrade to a
bigger server. Instead, it's better to spread the database across multiple servers.
This approach works well because data can be easily divided and distributed.
● To handle larger amounts of data and increase processing power, there are two
main methods: replication and sharding.
○ Replication: involves copying the same data to multiple servers, which
can improve availability and fault tolerance.
○ Sharding involves splitting different pieces of data across different
servers, which can handle more data and traffic.
● These methods can be used separately or together. Replication comes in two
types:
○ Master-slave replication, where one server is the master and others are
backups.
○ Peer-to-peer replication, where all servers are equal and share data.
● The process starts from the simplest setup (a single server) and progresses to
more complex arrangements:
1. Single server
2. Master-slave replication
3. Sharding
4. Peer-to-peer replication

2.1.2 Single Server


● Simplest Distribution Option:
○ Recommended to avoid distribution complexities.
○ Easier for operations and application development.
● NoSQL on a Single Server:
○ Suitable for applications where NoSQL data model fits well.
○ Graph databases perform best in single-server setups.
○ Single-server document or key-value stores are beneficial for aggregate
processing.

1
Fig: Single Server Database or Centralized Database
● Advantages:
○ Eliminates complexities of distributed systems.
○ Simplifies management and reasoning for developers.

2.1.3 Sharding
Database sharding is splitting a large database into smaller, manageable pieces called
shards, which are stored on different servers. This approach helps overcome the
limitations of a single server's storage and processing capacity by distributing data
across multiple servers that work together. ( Figure 1.1).

Why Database Sharding is Important


As applications grow, more users and data can slow down a single database, creating a
bottleneck and affecting performance. Sharding solves this by allowing parallel
processing of smaller data sets, improving speed and efficiency.

Benefits of Database Sharding


1. Improves Response Time: Smaller data sets mean faster data retrieval.
2. Prevents Service Outages: Distributes data across multiple servers, so if one
fails, others can continue to operate.
3. Enables Efficient Scaling: Add more shards to accommodate growing data
without downtime.

2
Figure 1.1. Sharding puts different data on separate nodes, each of which does its own reads
and writes.
How Database Sharding Works
Sharding splits a database table into smaller pieces, with each piece (or shard) stored on
a different server (or node). All shards follow the original database’s schema but contain
unique rows of data.

Example:
● Original Database:
- Customer ID 1: John (California)
- Customer ID 2: Jane (Washington)
- Customer ID 3: Paulo (Arizona)
- Customer ID 4: Wang (Georgia)

● Sharded Database:
Server A: John, Jane
Server B: Paulo, Wang

Key Terms
● Shards: Partitions of data stored on different servers.
● Shard Key: A column used to determine data distribution across shards.
● Shared-nothing Architecture: Each shard operates independently, unaware of
others.

3
Sharding Methods
1. Range-based Sharding: Divides data based on a range of values.
Example: Names starting with A-I on Shard A, J-S on Shard B, T-Z on Shard C.
Pros: Simple to implement.
Cons: Risk of uneven data distribution.
2. Hashed Sharding: Uses a hash function to distribute data evenly.
Pros: Ensures even data distribution.
Cons: Complex to reassign data when adding new shards.
3. Directory Sharding: Uses a lookup table to assign data to shards.
Pros: Flexible and meaningful data distribution.
Cons: Lookup table errors can cause issues.
4. Geo Sharding: Stores data based on geographical location.
Pros: Faster data retrieval due to proximity.
Cons: Potential for uneven data distribution.

Optimizing Shard Key Selection


To avoid database hotspots (overloaded shards), choose shard keys that balance the
data load:
● Cardinality: Ensure a wide range of possible shard keys.
● Frequency: Consider how often certain values will occur.
● Monotonic Change: Avoid keys that increase or decrease predictably, causing
imbalance.
By carefully selecting and managing shard keys, you can ensure even data distribution
and efficient database performance.

2.1.4 Master-Slave
Replication with Master-Slave Setup:
● Master Node: The primary, authoritative source for data, responsible for
processing updates.
● Slave Nodes: Secondary nodes synchronized with the master, primarily handling
read requests.

4
Figure 1.2. Data is replicated from master to slaves. The master services all writes; reads may
come from either master or slaves.
Benefits:
● Scalability for Read-Intensive Data:
○ Adding more slave nodes allows horizontal scaling to handle more read
requests.
○ All read requests are directed to the slaves.
○ Not ideal for heavy write traffic due to the master’s update processing
limits.
● Read Resilience:
○ If the master fails, slaves can still handle read requests.
○ - Writes are paused until the master is restored or a new master is
appointed.
○ - Quick recovery as a slave can become the new master.
● Hot Backup:
○ Slaves act as a hot backup for the master.
○ Provides resilience, allowing graceful handling of server failures.
○ Think of the system as a single-server store with an immediate backup.

Master Appointment:
● Manual Appointment: Configure one node as the master during setup.
● Automatic Appointment: Nodes in the cluster elect a master automatically.
○ Simplifies configuration.

5
○ Reduces downtime by quickly appointing a new master if the current one
fails.

Read Resilience Considerations:


● Ensure separate read and write paths in your application.
● Use different database connections for reads and writes.
● Test thoroughly to confirm reads still work if writes fail.

Potential Issues:Inconsistency due to R-W issue:


● Different slaves might show different data if updates haven’t propagated.
● A client might not see its own recent writes.
● Updates may be lost if the master fails before they propagate to slaves.

2.1.5 Peer-to-Peer Replication


● No Master Node: All replicas have equal status.
● Scalability and Resilience:
○ - All nodes can accept writes.
○ - Loss of any node doesn’t prevent access to the data store.

Figure 1.3. Peer-to-peer replication has all nodes applying reads and writes to all the data.

Benefits:
● Enhanced Scalability: Easily add nodes to improve performance.

6
● Improved Resilience: Continue accessing data despite node failures.

Challenges:
● Consistency Issues due to W-W issue:
○ Risk of write-write conflicts when two nodes update the same record
simultaneously.
○ Inconsistent reads are temporary, but inconsistent writes can cause
long-term issues.
● Trade-off: Balance between consistency and availability.

Handling Write Inconsistencies:


● Coordination Approach:
○ Replicas coordinate to avoid conflicts during writes.
○ Requires network traffic to synchronize writes.
○ Not all replicas need to agree, just a majority, allowing for some node
failures.
● Merging Approach:
○ Accept inconsistent writes and merge them later based on a predefined
policy.
○ Maximizes performance by allowing writes to any replica.

2.1.6 Combining Sharding and Replication


Combining Strategies of Sharding and Replication are
● Master-Slave Replication with Sharding
● Peer-to-Peer Replication with Sharding

1. Master-Slave Replication with Sharding:


● Multiple masters, each responsible for different data.
● Nodes can act as masters for some data and slaves for others.
● Alternatively, nodes can be dedicated solely as masters or slaves

Configuration Options for Master-Slave Configuration:


● Assign specific nodes as masters or slaves.
● Allows for a clear distribution of roles and responsibilities.

7
Figure 1.4. Using master-slave replication together with sharding

2. Peer-to-Peer Replication with Sharding:


● Common in column-family databases.
● Clusters can have tens or hundreds of nodes with data sharded across them.

Peer-to-Peer Replication Setup using Replication Factor:


● A good starting point is a replication factor of 3.
● Each shard is present on three nodes.
● If a node fails, other nodes will rebuild the shards from the failed node.

Configuration Options: Peer-to-Peer Configuration:


● Nodes have equal weight and can handle any operation.
● Flexibility in handling node failures and scaling.

Benefits:
● Scalability:
○ Both read and write operations can be distributed across multiple nodes.
○ Enhances performance by balancing the load.
● Resilience:
○ Ensures data availability even if some nodes fail.

8
○ Increases fault tolerance and speeds up recovery.

Figure 1.5. Using peer-to-peer replication together with sharding

Key Points
● There are two styles of distributing data:
○ Sharding distributes different data across multiple servers, so each server
acts as the single source for a subset of data.
○ Replication copies data across multiple servers, so each bit of data can be
found in multiple places. A system may use either or both techniques.
● Replication comes in two forms:
○ Master-slave replication makes one node the authoritative copy that
handles writes while slaves synchronize with the master and may handle
reads.
○ Peer-to-peer replication allows writes to any node; the nodes coordinate
to synchronize their copies of the data.
● Master-slave replication reduces the chance of update conflicts but peer-to-peer
replication avoids loading all writes onto a single point of failure.

9
2.2 Consistency
2.2.1 Introduction
● Switching from a centralized relational database to a cluster-oriented NoSQL
database involves a big shift in how you think about consistency.
● Relational databases aim for strong consistency by avoiding many types of
inconsistencies.
● In the NoSQL world, terms like "CAP theorem" and "eventual consistency"
become important.
● When building with NoSQL, you need to decide what level of consistency your
system requires.
● Consistency has different forms, and each form can introduce different types of
errors.

2.2.2 Update Consistency

Definition 1: The update consistency in NoSQL databases refers to the challenge of


ensuring that data remains consistent across multiple replicas or nodes after an update,
particularly in distributed systems where updates may not be immediately reflected,
leading to temporary inconsistencies.

Definition 2: In NoSQL databases, update consistency refers to the guarantee that a


write operation will be reflected in all subsequent reads of the data item. However,
NoSQL databases often prioritize availability and performance over strong consistency,
leading to potential update consistency issues.

10
Scenario:
The diagram depicts a scenario with two nodes: Node 1 and Node 2. Both nodes are
interacting with a shared data item, "A." The timeline shows various operations
performed on "A" at different time points (t1, t2, ..., t8).
The diagram illustrates this problem through the following steps:
● t1: Node 1 reads the value of A.
● t2: Node 1 updates the value of A by adding 100 to it.
● t3: Node 2 reads the value of A. This read operation might return the old value of
A, as the update might not have been propagated to Node 2 yet.
● t4: Node 2 updates the value of A by adding 200 to it.
● t5: Node 1 writes the updated value of A back.
● t7: Node 2 writes the updated value of A back.

Consequences of Update Consistency Issues


The potential for updates to be lost or overwritten can lead to several problems in
NoSQL applications:
● Write-Write Conflict: This occurs when two or more users attempt to update
the same data simultaneously. In our example, both Martin's and Pramod's
updates reach the server at nearly the same time.
● Data Loss: If an update is lost due to a consistency issue, the data might become
incomplete or inaccurate.
● Data Corruption: If multiple updates are made to the same data item without
proper synchronization, it can lead to data corruption.
● Unexpected Application Behavior: If an application relies on consistent data,
inconsistent updates can lead to unexpected behavior or errors.

Mitigating(Solving) update Consistency Issues


While NoSQL databases often prioritize availability and performance over strong
consistency, there are strategies to mitigate read consistency problems:
● Serialization: The server decides the order in which these updates are applied.
Without any control, Martin’s update might be overwritten by Pramod’s, leading
to a lost update where Pramod’s change disregards Martin’s earlier update.
● Consistency Approaches: There are two main approaches:
○ Pessimistic: Prevents conflicts by using write locks, allowing only one
user to update data at a time. In our example, Martin would lock the data
first, ensuring Pramod sees Martin’s update before deciding on his own.

11
○ Optimistic: Allows conflicts but detects and resolves them. For instance,
if Martin updates first, Pramod’s attempt would recognize the change and
adjust accordingly.
● Handling Distributed Updates: In distributed systems with multiple servers,
maintaining consistent updates becomes more complex. Methods like sequential
consistency ensure all servers apply updates in the same order.
● Conflict Resolution: Sometimes, conflicting updates are saved separately and
marked for resolution. This approach, akin to version control systems, requires
merging conflicting changes intelligently, possibly involving user intervention.

2.2.3 Read Consistency


● Definition1: The read consistency problem in NoSQL databases refers to the
challenge of ensuring that read operations return the most up-to-date and
accurate data, particularly when data may be replicated across multiple nodes,
resulting in potential discrepancies or stale data due to replication delays or
eventual consistency models.
● Definition 2: In NoSQL databases, read consistency refers to the guarantee that
a read operation will return the most recent value written to a data item.
However, NoSQL databases often prioritize availability and performance over
strong consistency, leading to potential read consistency issues.

12
Scenario:
The diagram illustrates this problem through the following steps:
● t1: Master Node reads the value of A.
● t2: Master Node updates the value of A by subtracting 500.
● t3: Master Node writes the updated value of A back.
● t4: Client Node reads the value of A. This read operation might return the old
value of A, as the update might not have been propagated to the Client Node yet.
● t5: Master Node updates the value of A again by subtracting 500.
● t6: Master Node writes the updated value of A back.
● t7: Client Node reads the value of A again. It might still read the old value of A,
depending on the replication and consistency settings of the NoSQL database.

Consequences of Read Consistency Issues


The potential for reading stale or inconsistent data can lead to several problems in
NoSQL applications:
● Write-Read (WR) conflicts: Inconsistent reads can lead to write-read conflicts,
where a write operation is performed based on stale data, leading to unexpected
results.
● Incorrect data display: If a user interface displays data based on a stale read, it
might show outdated or incorrect information.
● Unexpected application behavior: If an application makes decisions based on
inconsistent data, it might behave unexpectedly or produce erroneous results.
● Data integrity issues: Inconsistent reads can compromise data integrity and
lead to data loss or corruption.

Mitigating(Solving) Read Consistency Issues

While NoSQL databases often prioritize availability and performance over strong
consistency, there are strategies to mitigate read consistency problems:

1. Inconsistency Window:
● The time during which clients might read inconsistent data.
● Can be very short; for example, Amazon's SimpleDB service often has an
inconsistency window of less than a second.
2. Replication Consistency:
● Ensures the same data item has the same value when read from different
replicas.

13
● Example: A hotel room booking may show different availability on
different nodes due to replication delays.
3. Eventual Consistency:
● Over time, all nodes will update to the same value if there are no further
updates.
● Data that hasn't been updated yet is considered stale.
4. Sequential Consistency:
● Ensures operations are applied in the same order across all nodes.
5. Consistency Levels:
● Applications can often specify the consistency level for individual
requests.
● Use weak consistency when acceptable, and strong consistency when
necessary.
6. Read-Your-Writes Consistency:
● Ensures once a user makes an update, they see it in subsequent reads.
● Important for scenarios like posting comments on a blog where users
expect to see their updates immediately.
7. Session Consistency:
● Provides read-your-writes consistency within a user’s session.
● Can be implemented using sticky sessions (tying a session to one node) or
version stamps (including the latest version stamp in each interaction).
8. Handling Read Consistency with Replication:
● Using sticky sessions might reduce load balancing effectiveness.
● Another approach is to switch the session to the master for writes and
keep reads consistent until slaves catch up.
9. Application Design Considerations:
● Maintaining replication consistency is crucial for overall application
design.
● Avoid keeping transactions open during user interactions to prevent
conflicts. Use approaches like offline locks instead.

2.2.4 Relaxing Consistency


Definition
Relaxing consistency in NoSQL means allowing some temporary inconsistencies in the
data to improve other system characteristics, such as performance and availability. This

14
is a trade-off where strict data accuracy is sometimes sacrificed to make the system
faster and more responsive.

Key Points
1. Trade-offs in System Design:
● Always maintaining perfect consistency can slow down the system.
● Designers often need to balance consistency with other factors like speed
and availability.
● Different applications can tolerate different levels of inconsistency,
depending on their needs.
Example from Relational Databases:
● Traditional databases use transactions to ensure strong consistency.
● However, to boost performance, databases can relax these consistency
rules (e.g., allowing some uncommitted data to be read).
● This is commonly done using a "read-committed" level, which prevents
some conflicts but not all.
2. Impact on Performance:
● Transactions can slow down the system, so some systems avoid using
them entirely.
● For example, early versions of MySQL were popular because they were
fast, even though they didn’t support transactions.
● Large websites like eBay sometimes skip transactions to handle large
volumes of data efficiently, especially when data is split across multiple
servers (sharding).
3. Handling Remote Systems:
● Many applications interact with external systems that can't be included in
a transaction.
● This means updates often happen outside of strict transaction
boundaries, which is common in enterprise applications.

2.2.5 The CAP Theorem

Overview

The CAP Theorem is about NoSQL databases to explain why maintaining consistency
can be challenging. Proposed by Eric Brewer in 2000, and later formally proven by Seth
Gilbert and Nancy Lynch, it helps to understand the trade-offs in distributed systems.

15
Understanding the CAP Theorem helps developers choose the right database for their
specific use case, balancing the trade-offs based on the requirements of their
application.

Theorem statement

The CAP Theorem can be stated as follows:

In a distributed data store, it is impossible to simultaneously guarantee all three


of the following properties:

1. Consistency (C): All nodes see the same data at the same time.

2. Availability (A): Every request (read or write) receives a response, regardless


of whether the data is up-to-date.

3. Partition Tolerance (P): The system continues to operate despite network


partitions.

You can have at most two of these three guarantees at any given time.

According to the CAP Theorem, a distributed system can achieve at most two of these
three properties at the same time. This leads to different types of NoSQL databases
being classified based on their design choices:

● CP (Consistency and Partition Tolerance): These systems prioritize consistency


and partition tolerance but may sacrifice availability during network partitions.
An example is HBase.

16
● AP (Availability and Partition Tolerance): These systems focus on availability
and partition tolerance, potentially allowing for temporary inconsistencies. An
example is Cassandra.
● CA (Consistency and Availability): This is typically not achievable in a
distributed system because network partitions are a reality in any distributed
setup. Most systems cannot guarantee both consistency and availability in the
presence of network failures.

Example scenarios

Here are example scenarios for each combination of the CAP Theorem (Consistency,
Availability, and Partition Tolerance):

1. CA (Consistency and Availability)

● Example Scenario: Banking Transaction System


● Description: A banking system that operates on a single-node architecture.
● Behavior: When a user makes a deposit or withdrawal, the transaction is
processed immediately, ensuring that all subsequent reads return the most
recent account balance (consistency). Since there's only one node, the system is
always available as long as that node is operational.
● Limitation: If the server goes down (e.g., due to hardware failure), the system
becomes unavailable, losing availability.

2. AP (Availability and Partition Tolerance)

● Example Scenario: Social Media Feed


● Description: A social media platform that allows users to post updates and
comments.
● Behavior: Users can post updates and interact with the platform even if some
nodes are unreachable due to network issues (availability). If a network partition
occurs, different nodes might have slightly different views of the feed, allowing
users to continue posting and commenting without waiting for synchronization.
● Limitation: Because of the partition tolerance, there may be inconsistencies
between users’ feeds, as updates made on one side of the partition may not be
reflected on the other until the partition is resolved.

3. CP (Consistency and Partition Tolerance)

● Example Scenario: Distributed Database for E-Commerce

17
● Description: An e-commerce platform that needs to maintain accurate inventory
counts across multiple geographic locations.
● Behavior: When a user tries to purchase a product, the system ensures that the
inventory count is updated consistently across all nodes before completing the
transaction (consistency). If there’s a network partition, the system may
temporarily reject orders or limit access to ensure that inventory counts remain
consistent.
● Limitation: During a partition, users may experience delays or rejections when
trying to place orders, sacrificing availability to maintain consistency in inventory
management.

2.2.6 Relaxing Durability


Definition: In database systems, relaxing durability means allowing some updates to be
temporarily held in memory and periodically saved to disk instead of being immediately
written to persistent storage. This can improve performance but risks data loss if the
system crashes before the data is saved.

Key Points
1. Durability:Ensures that once data is written, it won't be lost, even if the system
crashes.

2. When to Relax Durability:


● Higher Performance: Running a database mostly in memory and occasionally
flushing data to disk can make the system more responsive.
● Risk of Data Loss: If the server crashes, updates since the last flush may be lost.

3. Examples of Relaxing Durability:


● User-Session State:
○ Websites store temporary user activity.
○ Losing session data might be less problematic than slowing down the
website.
○ Suitable for nondurable writes.
● Telemetric Data:
○ Capturing data from physical devices.
○ Prioritizing faster data capture over the risk of losing recent updates if the
server crashes.

18
4. Issues
Replication Durability:
● Replication Failure: When an update is processed but not replicated before a
node fails.

● Master-Slave Model: If the master fails, unreplicated updates are lost.


● Auto-Failover: Appointing a new master can lead to update conflicts.
● Improving Replication Durability:
○ Master waits for replicas to acknowledge updates before confirming to
the client.
○ This can slow down updates and affect availability if replicas fail.

5. Benefits:
● Durability Trade-offs:
○ Choose based on how critical data retention is.
○ Allow individual updates to specify their durability needs.
● Balancing Performance and Durability:
○ Assess the impact of potential data loss against the performance benefits.
○ Use non durable writes for less critical data to improve system
responsiveness.

2.2.7 Quorums
Definition: In NoSQL databases, quorums are the rules used to decide node count and
replication factor to balance consistency and availability when making data operations.

Quorums example:

19
1. Write Quorum:
● Definition: The minimum number of nodes that must confirm a write operation
to ensure strong consistency. The rule for the write quorum is as follows
W > N/2
W: number of nodes participating in the write
N: number of nodes involved in replication
Meaning: number of nodes participating in the write must be
more than the half the number of nodes involved in replication
Replication Factor: The number of replicas is often called the replication factor.

● Example: With data replicated across three nodes, a write quorum might require
at least two nodes to acknowledge the write.

2. Read Quorum:

20
● Definition: The minimum number of nodes that need to be contacted to ensure
reading the most up-to-date data.
● Example: Depending on the write quorum (e.g., if two nodes confirm writes), you
may need to contact at least two nodes for a read.

3. Relationship Between Quorums:


● Definition: The relationship between the number of nodes involved in reading
(R), confirming writes (W), and the total replication factor (N).
● Example: To ensure strongly consistent reads, the inequality R + W > N must be
satisfied.

4. Replication Factor and Cluster Size:


● Definition: The number of copies of data maintained across different nodes
(replication factor) versus the total number of nodes in the cluster.
● Example: A cluster may have 100 nodes but a replication factor of 3, with data
distributed through sharding.

5. Benefits
Resilience and Automatic Rebalancing:
● Definition: Ensuring data availability and redundancy by maintaining replicas
and automatically redistributing data if a node fails.
● Example: A replication factor of 3 allows for continued operation even if one
node fails.

21
Flexibility in Operations:
● Definition: The ability to adjust the number of nodes involved in operations
based on the desired balance between consistency and availability.
● Example: Some operations might require full quorum for strong consistency,
while others may prioritize speed over freshness of data.

8. Tradeoffs in NoSQL:
● Definition: Balancing consistency (data accuracy) and availability (system uptime)
based on specific operational needs and constraints.
● Example: Choosing whether to prioritize fast reads or strong consistency
depending on the application requirements.

Key Points
● Write-write conflicts occur when two clients try to write the same data at the
same time.
● Read- write conflicts occur when one client reads inconsistent data in the middle
of another client's write.
● Pessimistic approaches lock data records to prevent conflicts. Optimistic
approaches detect conflicts and fix them.
● Distributed systems see read-write conflicts due to some nodes having received
updates while other nodes have not. Eventual consistency means that at some

22
point the system will become consistent once all the writes have propagated to
all the nodes.
● Clients usually want read-your-writes consistency, which means a client can write
and then immediately read the new value. This can be difficult if the read and the
write happen on different nodes.
● To get good consistency, you need to involve many nodes in data operations, but
this increases latency. So you often have to trade off consistency versus latency.
● The CAP theorem states that if you get a network partition, you have to trade off
availability of data versus consistency.
● Durability can also be traded off against latency, particularly if you want to
survive failures with replicated data.
● You do not need to contact all replicants to preserve strong consistency with
replication; you just need a large enough quorum.

2.3 Version Stamps


2.3.1 Introduction
● Some people criticize NoSQL databases because they don't support transactions,
which are important for maintaining data consistency. However, many NoSQL
databases do support atomic updates within an aggregate, which helps with
consistency. It's important to consider your transactional needs when choosing a
database.

23
● Transactions have limitations. Sometimes updates need human intervention and
can't be handled within a transaction because it would take too long. Version
stamps can help in these cases and are useful in distributed systems as well.

2.3.2 Business and System Transactions

● Ensuring update consistency without transactions is common, even with


transactional databases. When people think of transactions, they often think of
business transactions, like shopping online. These actions usually don't lock the
database the entire time because that would cause delays and inefficiencies.
● Instead, the system usually locks the database only at the end of the interaction.
However, this can lead to problems if the data changes during the process.
Techniques like Optimistic Offline Lock can help by checking if the data has
changed before completing the update.
● A version stamp, a field in the database that changes with every update, helps
track these changes. If the version stamp is different when you try to update, you
know the data has changed.
● Some databases and HTTP resources use mechanisms like etags, which serve a
similar purpose. You compare the etag you received with the one on the server
before updating to ensure data consistency.
● There are different ways to create version stamps:
1. Counters: Increment with each update. Easy to compare but need a
central authority to avoid duplication.

Example:

{ "_id": ObjectId("60e7c9c2eab8f44813c0494f"),

"name": "Laptop",

"price": 1200,

"version": 1

// Increment the version counter for a specific document

db.products.updateOne(

{ _id: ObjectId("60e7c9c2eab8f44813c0494f") },

24
{ $inc: { version: 1 } }

);

2. GUIDs: Unique identifiers that can be generated by anyone. They are


large and can't be compared directly for recentness.

Example:

// Inserting a document with a GUID in MongoDB

db.collection.insertOne({

_id: ObjectId(), // MongoDB generates a unique ObjectId

name: "John Doe",

guid: BinData(3, UUID().base64()) // Generates a GUID and stores it as


BinData

});

3. Content Hashes: Based on the data content, unique, and deterministic


but not comparable for recentness.

Example

"_id": ObjectId("60e7c9c2eab8f44813c0494f"),

"name": "Jane Doe",

"age": 30,

"city": "New York"

db.documents.aggregate([{

$addFields: {

contentHash: {

$hash: {

25
value: {

$toString: { $toObject: "$$ROOT" } } } } }

}])

4. Timestamps: Comparable for recentness but require synchronized clocks


to avoid issues.

Example

// Inserting a document with a timestamp version stamp in MongoDB

db.collection.insertOne({

_id: ObjectId(),

name: "Alice Smith",

createdAt: new Date()

});

● You can combine these methods for better results. For example, CouchDB uses
both counters and content hashes to manage version stamps.

2.3.3 Version Stamps on Multiple Nodes

● Simple version stamps work well with a single data source or master-slave
replication, where the master controls the stamps. However, in a peer-to-peer
system, things get more complex because there isn't a single authority.
● A counter-based version stamp is simple: each node increments its counter with
each update. If two nodes have different counters, the higher one is more
recent.
● In multiple-master systems, you need more sophisticated methods, like keeping
a history of version stamps to track changes and detect inconsistencies.
Timestamps can be problematic due to clock synchronization issues and can't
detect write-write conflicts.
● A common approach in peer-to-peer NoSQL systems is the vector stamp. This
uses a set of counters, one for each node. Each node updates its counter with
every change, and nodes synchronize their vector stamps when they
communicate.

26
● By comparing vector stamps, you can detect which version is newer or if there
are conflicts. Missing values in vectors are treated as zeros, allowing for flexible
node addition without invalidating stamps.
● Vector stamps are good for detecting inconsistencies but don't resolve them.
Resolving conflicts depends on the specific use case and the trade-off between
consistency and latency.

27
Key Points
● Version stamps help you detect concurrency conflicts. When you read data, then
update it, you can check the version stamp to ensure nobody updated the data
between your read and write.
● Version stamps can be implemented using counters, GUIDs, content hashes,
timestamps, or a combination of these.
● With distributed systems, a vector of version stamps allows you to detect when
different nodes have conflicting updates.

28

You might also like