0% found this document useful (0 votes)
9 views

2 NoSQL Databases Principles

Uploaded by

khawla tadist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

2 NoSQL Databases Principles

Uploaded by

khawla tadist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

NoSQL Databases

Principles
Ecole d’Ingénierie Digitale et d’Intelligence Artificielle (EIDIA)
Cycle Préparatoire formation Ingénieur
Khawla TADIST
Année 2023-2024
Outline
Different aspects of data distribution
 Scaling
 Vertical vs. horizontal
 Distribution models
 Sharding
 Replication
 Master-slave vs. peer-to-peer architectures
 CAP properties
 Consistency, Availability and Partition tolerance
 ACID vs. BASE
Scalability

What is scalability?
 Capability of a system to handle growing amounts of data and/or queries without
losing performance,
 Or its potential to be enlarged in order to accommodate such a growth.
 Two general approaches
 Vertical scaling
 Horizontal scaling
Scalability - Vertical Scalability

Vertical scaling (scaling up/down)


 Adding resources to a single node in a system
 Increasing the number of CPUs,
 Extending system memory,
 Using larger disk arrays,
 …
 i.e. larger and more powerful machines are involved
Scalability - Vertical Scalability

Vertical scaling (scaling up/down)


 Traditional choice
 In favor of strong consistency
 Easy to implement and deploy
 No issues caused by data distribution
 …
 Works well in many cases but …
Scalability - Vertical Scalability Drawbacks

Performance limits
 Even the most powerful machine has a limit
 Moreover, everything works well… unless we start approaching such limits
Higher costs
 The cost of expansion increases exponentially
 In particular, it is higher than the sum of costs of equivalent commodity
hardware
Scalability - Vertical Scalability Drawbacks

Proactive provisioning
 New projects/applications might evolve rapidly
 Upfront budget is needed when deploying new machines
 And so flexibility is seriously suppressed
Scalability - Vertical Scalability Drawbacks

Vendor lock-in
 There are only a few manufacturers of large machines
 Customer is made dependent on a single vendor
 Their products, services, but also implementation details, proprietary formats,
interfaces, …
 i.e. it is difficult or impossible to switch to another vendor
Deployment downtime
 Inevitable downtime is often required when scaling up
Scalability - Horizontal Scalability

Horizontal scaling (scaling out/in)


 Adding more nodes to a system
 i.e. the system is distributed across multiple nodes in a cluster
 Choice of many NoSQL systems
Scalability - Horizontal Scalability

Horizontal scaling (scaling out/in)


 Advantages
 Commodity hardware, cost effective
 Flexible deployment and maintenance
 Often surpasses the vertical scaling
 …
 Unfortunately, there are also plenty of false assumptions…
Scalability - Horizontal Scalability
Drawbacks
False assumptions
 Network is reliable
 Network is secure
 Latency is zero
 Bandwidth is infinite
 Topology does not change
 There is one administrator
 Transport cost is zero
Scalability - Horizontal Scalability
Consequences
Significantly increases complexity
 Complexity of management,
 Programming model, …
 Introduces new issues and problems
Synchronization of nodes
 Data distribution
 Data consistency
 Recovery from failures
 …
Scalability - Horizontal Scalability

A standalone node still might be a better option in certain cases


 e.g. for graph databases
 Simply because it is difficult to split and distribute graphs
 In other words
 It can make sense to run even a NoSQL database system on a single node
 No distribution at all is the most preferred/simple scenario
But in general, horizontal scaling does open new possibilities
Scalability - Horizontal Scalability
Architecture
What is a cluster?
 A collection of mutually interconnected commodity nodes
 Based on the shared-nothing architecture
 Nodes do not share their CPUs, memory, hard drives,…
 Each node runs its own operating system instance
 Nodes send messages to interact with each other
 Nodes of a cluster can be heterogeneous
 Data, queries, computation, workload, …
 This is all distributed among the nodes within a cluster
Distribution Models

Generic techniques of data distribution


 Sharding
 Different data on different nodes
 Motivation: increasing volume of data, increasing performance
 Replication
 Copies of the same data on different nodes
 Motivation: increasing performance, increasing fault tolerance
Distribution Models

Both the techniques are mutually orthogonal


 i.e. we can use either of them, or combine them both
NoSQL systems often offer automatic sharding and replication
Distribution Models - Sharding

Sharding (horizontal partitioning)


 Placement of different data on different nodes
 What does different data mean? Different aggregates
– E.g. key-value pairs, documents, …
Distribution Models - Sharding

Sharding (horizontal partitioning)


 Placement of different data on different nodes
 Related pieces of data that are accessed together should also be kept together
– Specifically, operations involving data on multiple shards should be
avoided
Distribution Models - Sharding

Sharding (horizontal partitioning)


 The questions are…
 How to design aggregate structures?
 How to actually distribute these aggregates?
Distribution Models - Sharding
Sharding (horizontal partitioning)
Distribution Models - Sharding

Objectives
 Uniformly distributed data (volume of data)
 Balanced workload (read and write requests)
 Respecting physical locations
 e.g. different data centers for users around the world
 …
Unfortunately, these objectives…
 May mutually contradict each other
 May change in time
Distribution Models - Sharding
Sharding (horizontal partitioning)

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
Distribution Models - Sharding

How to actually determine shards for aggregates?


 We not only need to be able to place new data when handling write requests,
 But also find the data in case of read requests
 i.e. when a given search criterion is provided (e.g. key, id, …),
Distribution Models - Sharding

How to actually determine shards for aggregates?


 We must be able to determine the corresponding shard to the given key
 So that the requested data can be accessed and returned,
 Or failure can be correctly detected when the data is missing
Distribution Models - Sharding

Sharding strategies
 Based on mapping structures
 Placing of data on shards in a random fashion (e.g. round-robin) (Not suitable)
 Based on general rules:
 Hash partitioning,
 Range partitioning
Distribution Models - Replication

Replication
 Placement of multiple copies – replicas – of the same data on different nodes
 Replication factor = the number of copies
 Two approaches:
 Master-slave architecture
 Peer-to-peer architecture
Distribution Models - Replication - Master-
Slave
Master-Slave Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
Distribution Models - Replication - Master-
Slave

Architecture
 One node is primary (master), all the other secondary (slave)
 Master node bears all the management responsibility
 All the nodes contain identical data
Distribution Models - Replication - Master-
Slave

Architecture
 Read requests can be handled by both the master or slaves
 Suitable for read-intensive applications
 More read requests to deal with → more slaves to deploy
 When the master fails, read operations can still be handled
Distribution Models - Replication - Master-
Slave

Write requests can only be handled by the master


Newly written replicas are propagated to all the slaves
Consistency issue
 Luckily enough, at most one write request is handled at a time
 But the propagation still takes some time during which obsolete reads might happen
 Hence certain synchronization is required to avoid conflicts
Distribution Models - Replication - Master-
Slave

In case of master failure, a new one needs to be appointed


 Manually (user-defined)
 Automatically (cluster-elected)
 Since the nodes are identical, appointment can be fast
Master might therefore represent a bottleneck (because of the performance or
failures)
Distribution Models - Replication - Peer-to-
Peer
Peer-to-Peer Architecture

Source: Sadalage, Pramod J. - Fowler, Martin: NoSQL Distilled. Pearson Education, Inc., 2013.
Distribution Models - Replication

Architecture
 All the nodes have equal roles and responsibilities
 All the nodes contain identical data once again
Distribution Models - Replication

Both read and write requests can be handled by any node


 No bottleneck, no single point of failure
 More requests to deal with → more nodes to deploy
Distribution Models - Replication

Both read and write requests can be handled by any node


 Consistency issues
 Unfortunately, multiple write requests can be initiated independently and handled at
the same time
 Hence synchronization is required to avoid conflicts
Distribution Models - Sharding and
Replication

Observations with respect to replication:


 Does the replication factor really need to correspond to the number of nodes?
 No, replication factor of 3 will often be the right choice
 Consequences
– Nodes will no longer contain identical data
– Replica placement strategy will be needed
 Sharding and replication can be combined… but how?
Distribution Models - Sharding and
Replication

Combinations of sharding and replication


 Sharding + master-slave replication
 Multiple masters, each for different data
 Roles of the nodes can overlap
– Each node can be master for some data
and/or slave for other
Distribution Models - Sharding and
Replication

Combinations of sharding and replication


 Sharding + peer-to-peer replication
 Placement of anything anywhere
CAP Theorem

Assumptions
 System with sharding and replication
 Read and write operations on a single aggregate
CAP properties = properties of a distributed system
 Consistency
 Availability
 Partition tolerance
CAP Theorem

CAP theorem
 It is not possible to have a distributed system that would guarantee consistency,
availability, and partition tolerance at the same time.
 Only 2 of these 3 properties can be enforced.
 But, what do these properties actually mean?
CAP Theorem - Properties

Consistency
 Read and write operations must be executed atomically
 There must exist a total order on all operations such that each operation looks as if it
was completed at a single instant,
 i.e. as if all the operations were executed one by one on a single standalone node
CAP Theorem - Properties

Consistency
 Practical consequence: After a write operation, all readers see the same data
 Since any node can be used for handling of read requests, atomicity of write
operations means that changes must be propagated to all the replicas
CAP Theorem - Properties

Availability
 If a node is working, it must respond to user requests
 Every read or write request received by a non-failing node in the system must result in a
response
CAP Theorem - Properties

Partition tolerance
 System continues to operate even when two or more sets of nodes get isolated
 i.e. a connection failure MUST NOT shut the whole system down
CAP Theorem - Consequences

At most two properties can be guaranteed


 CA = Consistency + Availability
 CP = Consistency + Partition tolerance
 AP = Availability + Partition tolerance
CAP Theorem - Consequences

If at most two properties can be guaranteed…


 CA = Consistency + Availability
 Traditional ACID properties are easy to achieve
 Examples: RDBMS, Google BigTable
 Any single-node system
 However, should the network partition happen, all the nodes must be forced to stop
accepting user requests
CAP Theorem - Consequences

If at most two properties can be guaranteed…


 CP = Consistency + Partition tolerance
 Examples: MongoDB, HBase
CAP Theorem - Consequences

If at most two properties can be guaranteed…


 AP = Availability + Partition tolerance
 New concept of BASE properties
 Examples: Apache Cassandra, Apache CouchDB
 Other examples: web caching, DNS
CAP Theorem - Consequences

Partition tolerance is necessary in clusters


 Why?
 Because it is difficult to detect network failures
 Does it mean that only purely CP and AP systems are possible?
 No…
CAP Theorem - Consequences

The real meaning of the CAP theorem:


 Partition tolerance is a MUST,
 But we can trade off consistency versus availability
 Just a little bit relaxed consistency can bring a lot of availability
 Such trade-offs are not only possible, but often work very well in practice
ACID Properties
Traditional ACID properties
 Atomicity
 Partial execution of transactions is not allowed (all or nothing)
 Consistency
 Transactions bring the database from one consistent (valid) state to another
 Isolation
 Although multiple transactions may execute in parallel, each transaction must take into
consideration the execution of the other
 Durability
 Effects of committed transactions must remain durable
BASE Properties

New concept of BASE properties


 Basically Available
 The system works basically all the time
 Partial failures can occur, but without total system failure
 Soft State
 The system is in flux (unstable)
 Changes occur all the time
 Eventual Consistency
 Sooner or later the system will be in some consistent state
ACID and BASE
ACID
 Choose consistency over availability
 Pessimistic approach
 Implemented by traditional relational databases
BASE
 Choose availability over consistency
 Optimistic approach
 Common in NoSQL databases
 Allows levels of scalability that cannot be acquired with ACID
Current trend in NoSQL:
 Strong consistency → eventual consistency
Consistency

Consistency in general…
 Consistency is the lack of contradiction in the database
 Strong consistency is achievable even in clusters, but eventual consistency might
often be sufficient
 Even when an already unavailable hotel room is booked once again, the situation can be
figured out in the real world
 …
Consistency

Write consistency (update consistency)


 Problem: write-write conflict
 Two or more write requests on the same aggregate are initiated concurrently
 Issue: lost update
 Question: Do we need to solve the problem in the first place?
Consistency

Write consistency (update consistency)


 Question: Do we need to solve the problem in the first place?
 If yes, than there are two general solutions
 Pessimistic approaches
 Preventing conflicts from occurring
 Techniques: write locks, …
 Optimistic approaches
 Conflicts may occur, but are detected and resolved later on
 Techniques: version stamps, …
Consistency

Read consistency (replication consistency)


 Problem: read-write conflict
 Write and read requests on the same aggregate are initiated concurrently
 Issue: inconsistent read
 When not treated, inconsistency window will exist
 Propagation of changes to all the replicas takes some time
 Until this process is finished, inconsistent reads may happen even the initiator of the
write request may read wrong data!
Conclusion
There is a wide range of options influencing…
 Scalability
– how well the system scales (data and requests)?
 Availability
– when nodes may refuse to handle user requests?
 Consistency
– what level of consistency is required?
 Latency
– how complicated is to handle user requests?
 Durability
– are the committed data written reliably?
 Resilience
– can the data be recovered in case of failures?
It’s good to know these properties and choose the right trade-off

You might also like