NoSql Mod 1 C
NoSql Mod 1 C
NoSql Mod 1 C
Persistent Data Storage: Databases store large amounts of data persistently, ensuring data
is retained even after power loss or system failures, unlike volatile main memory.
Main Memory Limitations: Main memory is limited in space and loses data during
power outages or operating system issues, so data is stored on a backing store for persistence.
Back-up Store Organization: Backing stores can be organized as files in a file system for
productivity applications or as databases in enterprise applications, which offer more
flexibility in managing large datasets.
ii)Concurrency
Error Handling with Transactions: Transactions also aid in error handling, allowing
changes to be rolled back if an error occurs during the process, ensuring data integrity and
consistency
iii)Integration
Inter-application Collaboration: Enterprise applications often need to collaborate and
share data across different teams, requiring updates made by one application to be visible to
others, which can be challenging across organizational boundaries.
Relational databases have succeeded because they provide the core benefits we outlined
earlier in a (mostly) standard way. As a result, developers and database professionals can
learn the basic relational model and apply it in many projects. Although there are differences
between different relational databases, the core mechanisms remain the same: Different
vendors’ SQL dialects are similar, transactions operate in mostly the same way
1. Impedance Mismatch
For application developers, the biggest frustration has been the impedance mismatch: the
difference between the relational model and the in-memory data structures. The
relational data model organizes data into a structure of tables and rows, or more
properly, relations and tuples. In the relational model, a tuple is a set of name-value pairs
and a relation is a set of tuples.
Relational Model Limitations: The relational data model organizes data into relations
and tuples, which must consist of simple values without nested records or lists, limiting
its expressiveness compared to in-memory data structures that can contain more complex
and nested structures.
Object-Oriented Databases: In the 1990s, many believed that relational databases would
be replaced by object-oriented databases that mirrored in-memory data structures to disk,
driven by the rise of object-oriented programming languages. However, object-oriented
databases did not gain long-term success.
Relational Databases' Strength: Relational databases continued to thrive due to their role
as an integration mechanism, supported by the standard SQL language and a growing
professional divide between application developers and database administrators.
Object-Relational Mapping (ORM): ORM frameworks like Hibernate and iBATIS have
helped bridge the gap between in-memory data and relational databases by automating
the translation between the two, though this approach can sometimes lead to
performance issues if developers ignore database-specific considerations.
Application and Integration Databases
1. Relational Databases and Integration: The primary reason relational databases
triumphed over object-oriented databases is that SQL served as an effective
integration mechanism between applications, allowing multiple teams to work on a
consistent set of data within a shared database.
2. Downsides of Shared Database Integration: While shared databases improve
communication, they result in complex structures that serve multiple applications,
making it difficult to change data storage without affecting other systems, leading to
performance and coordination issues between teams.
3. Application Database Approach: In contrast to shared databases, an application
database is accessed by a single application and its team, simplifying database schema
management and allowing the application code to maintain data integrity, without
needing to coordinate with other applications.
4. Shift to Web Services: The 2000s saw a move toward web services (such as HTTP-
based communication) for integration, allowing applications to exchange richer data
structures like nested records in XML or JSON, reducing the reliance on relational
structures enforced by SQL.
5. Flexibility with Application Databases: Using application databases decouples
internal data storage from external services, enabling teams to explore non-relational
databases without affecting external communication, while allowing application code
to handle security and integrity.
6. Text vs Binary Protocols: Text protocols over HTTP (like web services) are easier to
work with and are recommended unless high performance necessitates using binary
protocols, which should be carefully considered before use.
7. Relational Databases Still Prevail: Despite the flexibility offered by application
databases, most teams still preferred relational databases due to their familiarity,
reliability, and adequacy for most enterprise needs, although alternative databases
might have become more prevalent with time.
2. Scaling Up vs. Scaling Out: To cope with the influx of data and users, companies
faced two scaling options: scaling up by using larger, more powerful machines, or
scaling out by using many smaller, commodity machines in a cluster. Scaling up was
costly and limited, whereas scaling out was cheaper and provided better resilience, as
clusters could handle individual machine failures without compromising overall
system reliability.
4. Challenges with Sharding: One workaround was sharding, where data is split across
multiple database servers. However, this method required the application to track
which server held which data. Sharding also broke key relational database features
such as cross-shard querying, referential integrity, transactions, and consistency,
making data management cumbersome and prone to errors, often described by
practitioners as performing "unnatural acts."
5. Licensing and Cost Issues for Relational Databases: Another complication with
scaling out relational databases was the cost. Most relational databases were licensed
on a per-server basis, so running large clusters drastically increased licensing costs.
This led to complicated negotiations and frustrations within organizations trying to
scale using traditional relational database solutions.
1. Origins of "NoSQL": The term "NoSQL" was initially used in the late
90s to refer to an open-source relational database developed by Carlo
Strozzi that didn’t use SQL. The database stored tables as ASCII files and
was manipulated through shell scripts rather than SQL queries. This early
NoSQL database had no direct influence on the NoSQL databases we
know today.
2. Rebirth of "NoSQL" in 2009: The modern use of "NoSQL" began at a
meetup in San Francisco in June 2009, organized by Johan Oskarsson.
Developers interested in alternative data storage systems, inspired by
Google's BigTable and Amazon's Dynamo, came together to share their
work. The name "NoSQL" was chosen as a hashtag-friendly, memorable
term by Eric Evans, though it was not intended to describe the entire
technology trend.
3. Lack of a Strong Definition for NoSQL: "NoSQL" quickly gained
popularity, but it has never had a universally accepted definition. Initially
described as open-source, distributed, non-relational databases, the term
now encompasses a broad range of databases with varying characteristics.
NoSQL databases are not strictly defined, but they share some common
traits.
4. Characteristics of NoSQL Databases: While NoSQL databases generally
don't use SQL, some offer SQL-like query languages (e.g., Cassandra's
CQL). Most are open-source and designed to run on clusters, addressing
the limitations of relational databases in distributed environments.
However, not all NoSQL databases are built for clusters, as some, like
graph databases, focus on handling complex data relationships.
5. Schema-less and Flexible Data Models: NoSQL databases operate
without a fixed schema, allowing for the easy addition of fields to records
without modifying the structure of the database. This flexibility is
particularly beneficial for handling non-uniform data and custom fields,
which are cumbersome to manage in relational databases.
6. Polyglot Persistence: The rise of NoSQL has led to the concept of
polyglot persistence, where organizations use different data storage
technologies based on their specific needs. Instead of relying solely on
relational databases, teams can choose the most appropriate database type
depending on the nature of the data and how it needs to be accessed and
manipulated.
7. Shift to Application Databases: NoSQL databases are seen primarily as
application databases, rather than integration databases, which are
commonly used in traditional systems. This shift toward encapsulating
data in services is viewed as beneficial for modern application
development, even when NoSQL isn't the chosen data storage option.
8. Addressing Big Data and Clustered Environments: The primary
motivation for NoSQL's growth was the need to handle large-scale data in
distributed, clustered environments. Relational databases, which rely on
ACID transactions for consistency, struggle to scale efficiently in such
settings, whereas NoSQL databases offer more flexible approaches to data
consistency and distribution.
9. Improving Developer Productivity: Another major reason for adopting
NoSQL databases is to enhance developer productivity. By simplifying
data interaction and eliminating the "impedance mismatch" between
object-oriented programming and relational databases, NoSQL can
streamline application development, even in cases where the need to scale
is not a factor.
Relational Data Model: The relational data model, dominant in recent decades, organizes
data into tables (or relations), with rows (tuples) representing entities and columns containing
single values. Relationships between entities are established through references in these
columns.
NoSQL Data Models: NoSQL databases move away from the relational model and instead
use various data models, categorized into four types: key-value, document, column-family,
and graph. The first three share a characteristic known as aggregate orientation, which
influences how data is organized and stored.
Aggregates
1. Relational Model Simplicity: The relational model organizes data into tuples (rows),
which are simple structures that contain sets of values. It does not support nested
records or lists, and all operations are based on these tuples, maintaining a simple and
uniform approach to data manipulation.
At this point, an example may help explain what we’re talking about. Let’s assume we have
to build an e-commerce website; we are going to be selling items directly to customers over
the web, and we will have to store information about users, our product catalog, orders,
shipping addresses, billing addresses, and payment data. We can use this scenario to model
the data using a relation data store as well as NoSQL data stores and talk about their pros and
cons. For a relational database, we might start with a data model shown in Figure 2.1
// in customers
{
"id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}
// in orders
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}
],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],
In this model, we have two main aggregates: customer and order. We’ve used the black-
diamond composition marker in UML to show how data fits into the aggregation structure.
The customer contains a list of billing addresses; the order contains a list of order items, a
shipping address, and payments. The payment itself contains a billing address for that
payment.
A single logical address record appears three times in the example data, but instead of using
IDs it’s treated as a value and copied each time. This fits the domain where we would not
want the shipping address, nor the payment’s billing address, to change. In a relational
database, we would ensure that the address rows aren’t updated for this case, making a new
row instead. With aggregates, we can copy the whole address structure into the aggregate as
we need to.
The link between the customer and the order isn’t within either aggregate—it’s a
relationship between aggregates. Similarly, the link from an order item would cross into a
separate aggregate structure for products, which we haven’t gone into. We’ve shown the
product name as part of the order item here—this kind of denormalization is similar to the
tradeoffs with relational databases, but is more common with aggregates because we want to
minimize the number of aggregates we access during a data interaction.
The important thing to notice here isn’t the particular way we’ve drawn the aggregate
boundary so much as the fact that you have to think about accessing that data—and make that
part of your thinking when developing the application data model. Indeed we could draw our
aggregate boundaries differently, putting all the orders for a customer into the customer
aggregate (Figure 2.4 ).
Like most things in modelling, there’s no universal answer for how to draw your aggregate
boundaries. It depends entirely on how you tend to manipulate your data. If you tend to
access a customer together with all of that customer’s orders at once, then you would prefer a
single aggregate. However, if you tend to focus on accessing a single order at a time, then
you should prefer having separate aggregates for each order. Naturally, this is very context-
specific; some applications will prefer one or the other, even within a single system, which is
exactly why many people prefer aggregate ignorance.
3. Blurring of Lines: The distinction between key-value and document databases often
blurs in practice. Document databases may include ID fields for key-value lookups,
while key-value stores can incorporate metadata for indexing, illustrating the evolving
nature of these technologies.
5. Integration of Search Tools: Both key-value and document databases can enhance
querying by integrating search tools. For example, Riak features Solr-like search on
JSON or XML aggregates, showing how key-value stores can adopt document-like
capabilities for improved data retrieval.
Column-Family Stores
BigTable Influence: Google’s BigTable introduced a tabular structure with sparse
columns and no schema, which influenced databases like HBase and Cassandra.
However, it's more accurate to think of it as a two-level map rather than a traditional
table.
Column Stores vs. Column-Family: Pre-NoSQL column stores, like C-Store, used the
relational model but optimized for reading specific columns across many rows. In
contrast, BigTable-style databases abandon the relational model while still grouping
columns into "column families."
Wide vs. Skinny Rows: In Cassandra, rows can vary in column count, with "skinny"
rows having few columns (uniform across rows) and "wide" rows containing many
columns, which are often used to model lists. Each column can represent an element in a
wide row.
Sort Ordering in Columns: Column families can define a sort order for columns,
allowing efficient access to data ranges. This is particularly useful when constructing
composite keys, such as combining date and ID for ordered data retrieval.
Aggregates and Access Patterns: Aggregates group together data frequently accessed
together, such as a customer and their order history. However, different applications may
require separate access to related data, like processing orders independently, requiring
separate customer and order aggregates.
Relational vs. NoSQL for Complex Relationships: While relational databases are
useful for handling multiple relationships, they can become cumbersome and slow with
complex joins. Aggregate-oriented databases struggle with cross-aggregate operations,
suggesting that relational databases are better suited for highly relational data despite
performance challenges.
Graph Databases
Graph databases are an odd fish in the NoSQL pond. Most NoSQL databases were
inspired by the need to run on clusters, which led to aggregate-oriented data models of
large records with simple connections. Graph databases are motivated by a different
frustration with relational databases and thus have an opposite model—small records with
complex interconnections,
1. Graph Databases Structure: Graph databases focus on storing data as nodes
connected by edges, capturing complex relationships such as social networks or
product preferences. Unlike charts or histograms, graph databases enable rich,
interconnected data models.
2. Variation in Graph Data Models: Different graph databases handle node and edge
data differently. For instance, FlockDB stores only nodes and edges without attributes,
while Neo4J and Infinite Graph allow attaching Java objects or properties to nodes
and edges, offering more flexibility.
3. Efficient Relationship Navigation: Graph databases excel at efficiently traversing
relationships, especially for highly connected data models. Unlike relational
databases, where foreign key joins can be slow, graph databases optimize traversal
during data insertion, improving query performance.
4. Querying in Graph Databases: Queries in graph databases typically start with a node
lookup (e.g., using an ID) and then traverse edges to explore relationships. These
queries focus on navigating the graph rather than performing complex joins like in
relational databases.
5. Focus on Relationships: Graph databases differ from aggregate-oriented databases by
emphasizing relationships over aggregates. Their data model, which supports ACID
transactions across nodes and edges, often leads to single-server architectures rather
than distributed clusters.
Schemaless Databases
1. Schemaless Data Storage: Unlike relational databases that require a predefined
schema, NoSQL databases allow for storing data without a strict structure. This
enables flexibility in storing and changing data without predefined rules.
4. Adapting to Change: Schemaless databases make it easier to adapt the data model as
requirements evolve, allowing developers to add new data or remove unnecessary
fields without affecting the existing data.
5. Implicit Schema in Code: While NoSQL databases may not enforce a schema,
application code typically assumes a certain structure, such as field names and data
types. This creates an implicit schema within the code.
8. Relational Database Flexibility: Despite the criticism, relational databases are not
entirely inflexible. Their schemas can be changed using SQL commands, and
nonuniform data can be stored by creating new columns on the fly when necessary.
9. Controlled Data Changes: Both relational and schemaless databases require careful
planning when making changes to the data structure over time. Managing data
migration in NoSQL databases can be just as complex as in relational systems.
Materialized Views
Aggregate-Oriented Disadvantage: While aggregate-oriented models work well for
accessing all data within a single aggregate, they can be inefficient for queries that require
data from across multiple aggregates, such as calculating total sales of a product over time.
Views and Materialized Views: Views in relational databases are computed over base tables
and can simplify complex queries. Materialized views are precomputed and cached versions
of views, useful for read-heavy scenarios where some data staleness is acceptable.
NoSQL and Materialized Views: NoSQL databases do not have views but can implement
precomputed and cached queries, often referred to as materialized views. These are crucial
for queries that don’t align well with the aggregate structure, often built using map-reduce
techniques.
Eager vs. Batch Updates: Materialized views can be updated eagerly (as soon as base data
is updated) or in batch jobs at regular intervals. Eager updates are preferable when fresh data
is needed frequently, while batch updates work when some staleness is acceptable.
Building Materialized Views: Materialized views can be computed within the database or
outside by running jobs to compute and save them back. Databases often support these views
by allowing you to define computations that are triggered based on certain parameters.