NOSQL Databases: Introduction to NOSQL Systems, CAP Theorem, Document-Based NOSQL
Systems and MongoDB, NOSQL Key-Value Stores, Column-Based or Wide Column NOSQL
Systems, NOSQL Graph Databases and Neo4j
Need for NoSQL Databases:
Traditional relational databases (RDBMS) are powerful but not always ideal for today’s data
needs. The need for NoSQL arises due to the following reasons:
1. Handling Big Data
• NoSQL databases are designed to handle massive volumes of structured, semi-
structured, and unstructured data.
• RDBMS struggles with scalability and flexibility in such cases.
2. Horizontal Scalability
• NoSQL systems support horizontal scaling by distributing data across multiple
servers.
• RDBMS typically supports vertical scaling (adding more power to a single machine),
which is costlier and limited.
3. Schema Flexibility
• NoSQL databases are schema-less or schema-flexible, allowing developers to evolve
the structure over time.
• Useful in agile development and when dealing with dynamic or unknown data models.
4. High Throughput and Low Latency
• Optimized for fast read/write operations, even with high traffic and massive
workloads.
• Often used in real-time web applications, analytics, and IoT.
5. Cloud-native and Distributed Systems
• Built with distributed architecture in mind.
• Suitable for cloud computing, where distributed storage and compute are common.
6. Variety of Data Models
Supports diverse models like document, key-value, column, and graph to match
various use cases.
Introduction to NoSQL Systems
NoSQL, or "Not Only SQL," is a database management system (DBMS) designed to handle
large volumes of unstructured and semi-structured data. Unlike traditional relational databases
that use tables and pre-defined schemas, NoSQL databases provide flexible data models and
support horizontal scalability, making them ideal for modern applications that require real-time
data processing.
Features of NoSQL Databases:
Unlike relational databases, which uses Structured Query Language, NoSQL databases don't
have a universal query language. Instead, each type of NoSQL database typically has its unique
query language. Traditional relational databases follow ACID (Atomicity, Consistency,
Isolation, Durability) principles, ensuring strong consistency and structured relationships
between data.
However, as applications evolved to handle big data, real-time analytics, and distributed
environments, NoSQL emerged as a solution with:
• Schema-less: Flexible data models (no fixed schema).
• Horizontal Scalability: Easy to scale across multiple servers.
• High Performance: Optimized for high-speed reads/writes.
• Distributed Architecture: Built to support large-scale, distributed systems.
Types of NoSQL Databases:
1. Document-Based
2. Key-Value Stores
3. Column-Based (Wide Column Stores)
4. Graph Databases
1.Document-oriented databases:
• A document-oriented database stores data in documents similar to JSON (JavaScript
Object Notation) objects or BSON format.
• Each document contains pairs of fields and values.
• The values can typically be a variety of types, including things like strings, numbers,
booleans, arrays, or even other objects.
• A document database offers a flexible data model, much suited for semi-structured and
typically unstructured data sets.
• They also support nested structures, making it easy to represent complex relationships
or hierarchical data.
• Each document is a self-contained unit with nested structures.
Advantages:
• Schema flexibility
• Good for semi-structured data
• Easy to map to programming language objects
Examples for Document-oriented databases:
MongoDB, CouchDB
2. Key-value stores
A key-value store is a simpler type of database where each item contains keys and values. Each
key is unique and associated with a single value.
• Data is stored as key-value pairs, making retrieval extremely fast.
• they are used for caching and session management and provide high performance in
reads and writes because they tend to store things in memory.
• Examples: Redis, Memcached, Amazon DynamoDB
EXAMPLE:
Key: user:12345
Value: {"name": "foo bar", "email": "foo@bar.com", "designation": "software developer"}
3. Column-Based (Wide Column) NoSQL Systems
• Wide-column stores store data in tables, rows, and dynamic columns.
• The data is stored in tables.
• However, unlike traditional SQL databases, wide-column stores are flexible, where different
rows can have different sets of columns.
• These databases can employ column compression techniques to reduce the storage space and
enhance performance.
• The wide rows and columns enable efficient retrieval of sparse and wide data.
• Great for time-series data, IoT applications, and big data analytics.
Some examples of wide-column stores are:
Google BigTable:
• Proprietary system used in Gmail and other Google services.
• Uses Google File System (GFS) for distributed storage.
Apache HBase (open-source):
• Inspired by BigTable.
• Uses Hadoop Distributed File System (HDFS) or Amazon S3 for storage.
Cassandra:
• Shares characteristics of both column-store and key-value systems.
Key Characteristics:
• Keys in column-based systems are multi-dimensional:
Typically include: Table name, Row key, Column (family + qualifier), and Timestamp.
• Columns are grouped into column families, and each family contains column qualifiers.
• Data is stored in rows, but only values for defined columns are stored (supports sparsity).
• Each cell can hold multiple versions of data (tracked using timestamps).
EXAMPLE:
UserID Name Email
101 John Doe john@example.com
102 Jane jane@example.com
4.NoSQL Graph Databases and Neo4j
• Data is stored as nodes and edges, enabling complex relationship management.
• Nodes typically store information about people, places, and things (like nouns), while
edges store information about the relationships between the nodes.
• They work well for highly connected data, where the relationships or patterns may not be
very obvious initially.
• Useful for applications requiring relationship-based queries such as fraud detection and
social network analysis.
Examples of graph databases are Neo4J and Amazon Neptune. MongoDB also provides graph
traversal capabilities using the $graphLookup stage of the aggregation pipeline.
CRUD Operations of MongoDB:
MongoDB stores data in collections as documents using BSON (Binary JSON) format.
CRUD Operations are:
• Create Operation
• Read Operation
• Update Operation
• Delete Operation
CREATE Operation:
Used to insert new documents into a collection.
Syntax for single Document insertion:
db.collectionName.insertOne({ key1: value1, key2: value2 })
Syntax for multiple Document insertion
db.students.insertMany({ key1: value1, key2: value2 },
{ key1: value1, key2: value2 })
Example:
To creates one new document with fields name, age, and course in the student’s collection.
db.students.insertOne({
name: "Alice",
age: 20,
course: "Computer Science"
})
To Adds two documents at once into the students collection with fields name, age, and
course.
db.students.insertMany([
{ name: "Bob", age: 21, course: "Electronics" },
{ name: "Charlie", age: 22, course: "Mechanical" }])
Read Operation:
To retrieve data from a MongoDB collection using various queries.
i. Find All Documents(find())
ii. Find with Condition(find({key:value }))
iii. Find with Projection (select specific fields)
iv. Find One Document
v. Using Comparison Operators
find()
• Returns a cursor to all matching documents in the collection.
1.Example (All records):
db.students.find()
Returns a cursor to all matching documents in the collection.
2. Example (With condition):
db.students.find({ age: 21 })
fetches documents where age is exactly 21.
Projection
Selects specific fields to be returned.
db.students.find({ age: 21 }, { name: 1, _id: 0 })
Returns only the name field (not _id) for students aged 21.
Operator Meaning Example
$gt Greater than { age: { $gt: 20 } }
$lt Less than { age: { $lt: 25 } }
$eq Equal { course: { $eq: "CS" } }
findOne()
Returns the first matching document.
db.students.findOne({ name: "Alice" })
Useful when only one document is expected or needed.
Update Operation
Updates the first document that matches the filter.
db.students.updateOne(
{ name: "Alice" },
{ $set: { age: 21 } }
)
Finds the document with name: "Alice" and sets her age to 21.
updateMany()
Updates all documents matching the filter.
db.students.updateMany(
{ course: "ECE" },
{ $set: { course: "Electronics" } }
)
All students with course: "ECE" will have their course changed to "Electronics".
replaceOne()
Replaces an entire document.
db.students.replaceOne(
{ name: "Alice" },
{ name: "Alice", age: 21, course: "AI" }
)
Completely replaces the old document with a new one (fields not mentioned will be removed).
Delete Operation
To remove documents from a collection.
deleteOne()
Deletes the first matching document.
db.students.deleteOne({ name: "Alice" })
Only the first document with name: "Alice" is deleted, even if more exist.
deleteMany()
Deletes all documents that match the condition.
db.students.deleteMany({ course: "Mechanical" })
Removes all students who are in the Mechanical course.
Delete All Documents
db.students.deleteMany({})
Deletes every document in the students collection — use with caution!
MongoDB Distributed Systems Characteristics:
1. Transactions in MongoDB
2. Replication in MongoDB
3. Sharding in MongoDB
4. Replication vs Sharding
Transactions in MongoDB
• Atomicity: MongoDB supports atomic operations within a single document by default.
• Multi-document Transactions: Supported using two-phase commit protocol to ensure
atomicity and consistency across multiple documents.
• Useful in distributed environments where multiple documents/nodes are involved.
Replication in MongoDB
• Replica Set: A group of MongoDB servers maintaining identical data copies to ensure
high availability.
• Primary Node (N1): Handles all write operations and, by default, read operations.
• Secondary Nodes (N2, N3, ...): Hold replicas of the data; updated asynchronously
from the primary.
• Arbiter: Participates in elections to choose a new primary but does not store data.
• Total members in a replica set = odd number (to avoid voting conflicts).
Read Preferences:
• Default: Reads from the primary only.
• Optional: Reads can be allowed from secondaries (not guaranteed to be most recent).
Sharding in MongoDB
• Purpose: Distribute a large dataset across multiple machines for performance and
scalability.
• Sharding = Horizontal Partitioning of a collection into disjoint sets (called shards).
• Shard Key: The field used for partitioning. It must:
o Exist in every document.
o Have an index.
• Partitioning Methods:
o Range Partitioning: Shard key values split into continuous ranges.
o Hash Partitioning: Uses a hash function to distribute documents randomly
across shards.
• Query Router:
o Forwards CRUD operations to relevant shards.
o If the target shard can't be identified, the query is broadcast to all shards.
Replication vs Sharding
Aspect Replication Sharding
Purpose High availability, failover Scalability, load balancing
Multiple copies of same Partitions of data across
Data Copies
data nodes
Yes, via secondaries and
Failure Tolerance Not the main focus
elections
Based on shard key and
Write Target Always to primary
query router
Examples of Other Key-Value Stores:
• Oracle key-value store. Oracle has one of the well-known SQL relational database
systems, and Oracle also offers a system based on the key-value store concept;this
system is called the Oracle NoSQL Database.
• Redis key-value cache and store. Redis differs from the other systems discussed here
because it caches its data in main memory to further improve performance. It offers
master-slave replication and high availability, and it also offers persistence by backing
up the cache to disk.
• Apache Cassandra. Cassandra is a NOSQL system that is not easily categorizedinto
one category; it is sometimes listed in the column-based NOSQL category (see Section
24.5) or in the key-value category. If offers features from several NOSQL categories
and is used by Facebook as well as many other customers.
Neo4j Data Model:
Neo4j is a graph database in the NoSQL family. Unlike traditional relational databases (which
use tables), Neo4j represents data as a graph structure of nodes, relationships, and
properties.
Core Components:
Nodes
Relationships
Properties
Labels
1. Nodes
• Represent entities or objects (e.g., people, products, cities).
• Analogous to rows in a table in relational databases.
• Can have one or more labels that define their role or type.
• Can store properties as key-value pairs
2. Relationships
• Represent connections between nodes.
• Are directed (have a start and end node).
• Have a type and can also have properties.
• Relationships are first-class citizens in Neo4j.
3. Properties
• Key-value pairs attached to nodes and relationships.
• Can store data like strings, numbers, arrays, etc.
4. Labels
• Labels are used to group nodes into sets (e.g., :Person, :Movie).
• A node can have multiple labels.
5. Relationship Types
• Describe the semantic meaning of the connection.
• Example: :FRIENDS_WITH, :ACTED_IN, :WORKS_FOR.
Advantages of Neo4j’s Data Model:
• Intuitive: Naturally represents complex, highly connected domains.
• Flexible: Schema-optional; nodes can have different sets of properties.
• Efficient Traversals: Relationship pointers make graph traversals very fast.
• Powerful Queries: Cypher makes pattern-matching easy and expressive.
Example: Movie Database
Let's build a mini movie graph:
Entities:
• Persons: Alice, Bob
• Movies: Matrix, Inception
Relationships:
• Alice and Bob are actors.
• Alice acted in Matrix.
• Bob acted in Inception.
Cypher Code (Neo4j Query Language):
CREATE
(alice:Person {name: "Alice", born: 1985}),
(bob:Person {name: "Bob", born: 1990}),
(matrix:Movie {title: "The Matrix", released: 1999}),
(inception:Movie {title: "Inception", released: 2010}),
(alice)-[:ACTED_IN {role: "Neo"}]->(matrix),
(bob)-[:ACTED_IN {role: "Cobb"}]->(inception)
Visual Representation (Property Graph):
(:Person {name: "Alice", born: 1985})
└──[:ACTED_IN {role: "Neo"}]──> (:Movie {title: "The Matrix", released: 1999})
(:Person {name: "Bob", born: 1990})
└──[:ACTED_IN {role: "Cobb"}]──> (:Movie {title: "Inception", released: 2010})
Use Case Example
Social Networks Users, friendships, likes
Recommendation Systems Products, ratings, purchases
Fraud Detection Accounts, transactions, devices
Knowledge Graphs Concepts, definitions, relationships
Network/IT Management Devices, connections, dependencies
Neo4j Interfaces and Distributed System Characteristics:
Neo4j has other interfaces that can be used to create, retrieve, and update nodes and
relationships in a graph database. It also has two main versions: the enterprise edition, which
comes with additional capabilities, and the community edition.
■ Enterprise edition vs. community edition. Both editions support the Neo4j graph data
model and storage system, as well as the Cypher graph query language, and several other
interfaces, including a high-performance native API, language drivers for several popular
programming languages, such as Java, Python, PHP, and the REST (Representational State
Transfer) API. In addition, both editions support ACID properties. The enterprise edition
supports additional features for enhancing performance, such as caching and clustering of data
and locking.
■ Graph visualization interface.
Neo4j has a graph visualization interface, so that a subset of the nodes and edges in a database
graph can be displayed as a graph. This tool can be used to visualize query results in a graph
representation.
■ Master-slave replication:
Neo4j can be configured on a cluster of distributed system nodes (computers), where one node
is designated the master node. The data and indexes are fully replicated on each node in the
cluster. Various ways of synchronizing the data between master and slave nodes can be
configured in the distributed cluster.
■ Caching.
A main memory cache can be configured to store the graph data for improved performance.
■ Logical logs. Logs can be maintained to recover from failures.