NoSql Mod 1 C

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 16

NoSql

The Value of Relational Databases

i)Getting at Persistent Data

 Persistent Data Storage: Databases store large amounts of data persistently, ensuring data
is retained even after power loss or system failures, unlike volatile main memory.

 Memory Structure: Computer systems typically have two types of memory—fast,


volatile “main memory”, and slower, larger “backing stores” (such as disks or persistent
memory) that retain data over time.

 Main Memory Limitations: Main memory is limited in space and loses data during
power outages or operating system issues, so data is stored on a backing store for persistence.

 Back-up Store Organization: Backing stores can be organized as files in a file system for
productivity applications or as databases in enterprise applications, which offer more
flexibility in managing large datasets.

 Database Advantages: Databases allow efficient storage, retrieval, and management of


data, enabling applications to quickly access small portions of the overall dataset when
needed.

ii)Concurrency

 Concurrency Management: In enterprise applications, multiple users may access and


modify the same data simultaneously, leading to potential conflicts like double booking.
Managing concurrency is complex, but relational databases help by controlling access
through transactions.

Transactional Control: Transactions in relational databases ensure coordinated


interactions with data, containing the complexity of concurrency and reducing errors that
arise from simultaneous data modifications.

Error Handling with Transactions: Transactions also aid in error handling, allowing
changes to be rolled back if an error occurs during the process, ensuring data integrity and
consistency

iii)Integration
Inter-application Collaboration: Enterprise applications often need to collaborate and
share data across different teams, requiring updates made by one application to be visible to
others, which can be challenging across organizational boundaries.

Shared Database Integration: A common solution is using a shared database, where


multiple applications store and access data in a single database, benefiting from easy data
sharing and the database's concurrency control to manage simultaneous access.

iv)A (Mostly) Standard Model

Relational databases have succeeded because they provide the core benefits we outlined
earlier in a (mostly) standard way. As a result, developers and database professionals can
learn the basic relational model and apply it in many projects. Although there are differences
between different relational databases, the core mechanisms remain the same: Different
vendors’ SQL dialects are similar, transactions operate in mostly the same way

1. Impedance Mismatch

For application developers, the biggest frustration has been the impedance mismatch: the
difference between the relational model and the in-memory data structures. The
relational data model organizes data into a structure of tables and rows, or more
properly, relations and tuples. In the relational model, a tuple is a set of name-value pairs
and a relation is a set of tuples.

Relational Model Limitations: The relational data model organizes data into relations
and tuples, which must consist of simple values without nested records or lists, limiting
its expressiveness compared to in-memory data structures that can contain more complex
and nested structures.

Object-Oriented Databases: In the 1990s, many believed that relational databases would
be replaced by object-oriented databases that mirrored in-memory data structures to disk,
driven by the rise of object-oriented programming languages. However, object-oriented
databases did not gain long-term success.

Relational Databases' Strength: Relational databases continued to thrive due to their role
as an integration mechanism, supported by the standard SQL language and a growing
professional divide between application developers and database administrators.

Object-Relational Mapping (ORM): ORM frameworks like Hibernate and iBATIS have
helped bridge the gap between in-memory data and relational databases by automating
the translation between the two, though this approach can sometimes lead to
performance issues if developers ignore database-specific considerations.
Application and Integration Databases
1. Relational Databases and Integration: The primary reason relational databases
triumphed over object-oriented databases is that SQL served as an effective
integration mechanism between applications, allowing multiple teams to work on a
consistent set of data within a shared database.
2. Downsides of Shared Database Integration: While shared databases improve
communication, they result in complex structures that serve multiple applications,
making it difficult to change data storage without affecting other systems, leading to
performance and coordination issues between teams.
3. Application Database Approach: In contrast to shared databases, an application
database is accessed by a single application and its team, simplifying database schema
management and allowing the application code to maintain data integrity, without
needing to coordinate with other applications.
4. Shift to Web Services: The 2000s saw a move toward web services (such as HTTP-
based communication) for integration, allowing applications to exchange richer data
structures like nested records in XML or JSON, reducing the reliance on relational
structures enforced by SQL.
5. Flexibility with Application Databases: Using application databases decouples
internal data storage from external services, enabling teams to explore non-relational
databases without affecting external communication, while allowing application code
to handle security and integrity.
6. Text vs Binary Protocols: Text protocols over HTTP (like web services) are easier to
work with and are recommended unless high performance necessitates using binary
protocols, which should be carefully considered before use.
7. Relational Databases Still Prevail: Despite the flexibility offered by application
databases, most teams still preferred relational databases due to their familiarity,
reliability, and adequacy for most enterprise needs, although alternative databases
might have become more prevalent with time.

Attack of the Clusters


1. Dot-Com Bubble Burst and Web Scale Growth: The early 2000s saw the collapse
of the dot-com bubble, leading to skepticism about the future of the internet. However,
this period also witnessed the dramatic growth of large web properties, which began
tracking detailed data such as social networks, user activity logs, and mapping data.
Alongside this, the number of users on these platforms surged, creating a need for
scalable solutions to handle increasing data and traffic.

2. Scaling Up vs. Scaling Out: To cope with the influx of data and users, companies
faced two scaling options: scaling up by using larger, more powerful machines, or
scaling out by using many smaller, commodity machines in a cluster. Scaling up was
costly and limited, whereas scaling out was cheaper and provided better resilience, as
clusters could handle individual machine failures without compromising overall
system reliability.

3. Relational Databases Struggle with Clusters: Traditional relational databases were


not designed to operate efficiently on clusters. Solutions like Oracle RAC or
Microsoft SQL Server used shared disk subsystems, which, while enabling some
cluster functionality, introduced a single point of failure. Clustering relational
databases was complex, and managing data across multiple machines created
significant technical challenges.

4. Challenges with Sharding: One workaround was sharding, where data is split across
multiple database servers. However, this method required the application to track
which server held which data. Sharding also broke key relational database features
such as cross-shard querying, referential integrity, transactions, and consistency,
making data management cumbersome and prone to errors, often described by
practitioners as performing "unnatural acts."

5. Licensing and Cost Issues for Relational Databases: Another complication with
scaling out relational databases was the cost. Most relational databases were licensed
on a per-server basis, so running large clusters drastically increased licensing costs.
This led to complicated negotiations and frustrations within organizations trying to
scale using traditional relational database solutions.

6. Google and Amazon's Move to Cluster-Friendly Databases: Companies like


Google and Amazon, faced with large-scale data needs and operating vast clusters,
began exploring alternatives to traditional relational databases. Both organizations
developed innovative systems—Google's BigTable and Amazon's Dynamo—
explicitly designed to work efficiently in clustered environments. Their technical
needs, combined with their success, spurred them to move away from the limitations
of relational databases.
7. Emergence of Cluster-Optimized Databases: Although Google and Amazon operate
at an extreme scale, the problems they faced began to resonate with a broader range of
organizations as more businesses started capturing and processing larger amounts of
data. As information about systems like BigTable and Dynamo leaked out, interest in
cluster-friendly, non-relational databases grew. This marked the beginning of a serious
challenge to the dominance of relational databases, as companies sought to adopt
similar architectures that better suited a clustered, data-intensive world.

The Emergence of NoSQL

1. Origins of "NoSQL": The term "NoSQL" was initially used in the late
90s to refer to an open-source relational database developed by Carlo
Strozzi that didn’t use SQL. The database stored tables as ASCII files and
was manipulated through shell scripts rather than SQL queries. This early
NoSQL database had no direct influence on the NoSQL databases we
know today.
2. Rebirth of "NoSQL" in 2009: The modern use of "NoSQL" began at a
meetup in San Francisco in June 2009, organized by Johan Oskarsson.
Developers interested in alternative data storage systems, inspired by
Google's BigTable and Amazon's Dynamo, came together to share their
work. The name "NoSQL" was chosen as a hashtag-friendly, memorable
term by Eric Evans, though it was not intended to describe the entire
technology trend.
3. Lack of a Strong Definition for NoSQL: "NoSQL" quickly gained
popularity, but it has never had a universally accepted definition. Initially
described as open-source, distributed, non-relational databases, the term
now encompasses a broad range of databases with varying characteristics.
NoSQL databases are not strictly defined, but they share some common
traits.
4. Characteristics of NoSQL Databases: While NoSQL databases generally
don't use SQL, some offer SQL-like query languages (e.g., Cassandra's
CQL). Most are open-source and designed to run on clusters, addressing
the limitations of relational databases in distributed environments.
However, not all NoSQL databases are built for clusters, as some, like
graph databases, focus on handling complex data relationships.
5. Schema-less and Flexible Data Models: NoSQL databases operate
without a fixed schema, allowing for the easy addition of fields to records
without modifying the structure of the database. This flexibility is
particularly beneficial for handling non-uniform data and custom fields,
which are cumbersome to manage in relational databases.
6. Polyglot Persistence: The rise of NoSQL has led to the concept of
polyglot persistence, where organizations use different data storage
technologies based on their specific needs. Instead of relying solely on
relational databases, teams can choose the most appropriate database type
depending on the nature of the data and how it needs to be accessed and
manipulated.
7. Shift to Application Databases: NoSQL databases are seen primarily as
application databases, rather than integration databases, which are
commonly used in traditional systems. This shift toward encapsulating
data in services is viewed as beneficial for modern application
development, even when NoSQL isn't the chosen data storage option.
8. Addressing Big Data and Clustered Environments: The primary
motivation for NoSQL's growth was the need to handle large-scale data in
distributed, clustered environments. Relational databases, which rely on
ACID transactions for consistency, struggle to scale efficiently in such
settings, whereas NoSQL databases offer more flexible approaches to data
consistency and distribution.
9. Improving Developer Productivity: Another major reason for adopting
NoSQL databases is to enhance developer productivity. By simplifying
data interaction and eliminating the "impedance mismatch" between
object-oriented programming and relational databases, NoSQL can
streamline application development, even in cases where the need to scale
is not a factor.

Chapter 2. Aggregate Data Models


Data Model vs. Storage Model: A data model defines how we perceive and interact with data
in a database, while a storage model describes how the database internally stores and
manipulates that data. Ideally, users interact only with the data model, but some knowledge
of the storage model is necessary to ensure good performance.

Relational Data Model: The relational data model, dominant in recent decades, organizes
data into tables (or relations), with rows (tuples) representing entities and columns containing
single values. Relationships between entities are established through references in these
columns.

NoSQL Data Models: NoSQL databases move away from the relational model and instead
use various data models, categorized into four types: key-value, document, column-family,
and graph. The first three share a characteristic known as aggregate orientation, which
influences how data is organized and stored.

Aggregates
1. Relational Model Simplicity: The relational model organizes data into tuples (rows),
which are simple structures that contain sets of values. It does not support nested
records or lists, and all operations are based on these tuples, maintaining a simple and
uniform approach to data manipulation.

2. Aggregate Orientation: Aggregate orientation recognizes the need to operate on more


complex data structures than just tuples. These structures can include lists or nested
records, and they allow for more flexible data manipulation compared to the relational
model.

3. Definition of Aggregates: In Domain-Driven Design, an aggregate is a collection of


related objects treated as a single unit. This concept aligns with key-value, document,
and column-family databases, which store data in aggregate structures, allowing for
easier management of consistency and updates.

4. Benefits for Clusters and Programmers: Aggregates simplify operations in distributed


systems by providing a natural unit for replication and sharding. They also make it
easier for application programmers to work with data, as they often manipulate
complex data through these aggregate structures.

Example of Relations and Aggregates

At this point, an example may help explain what we’re talking about. Let’s assume we have
to build an e-commerce website; we are going to be selling items directly to customers over
the web, and we will have to store information about users, our product catalog, orders,
shipping addresses, billing addresses, and payment data. We can use this scenario to model
the data using a relation data store as well as NoSQL data stores and talk about their pros and
cons. For a relational database, we might start with a data model shown in Figure 2.1
// in customers
{
"id":1,
"name":"Martin",
"billingAddress":[{"city":"Chicago"}]
}

// in orders
{
"id":99,
"customerId":1,
"orderItems":[
{
"productId":27,
"price": 32.45,
"productName": "NoSQL Distilled"
}

],
"shippingAddress":[{"city":"Chicago"}]
"orderPayment":[
{
"ccinfo":"1000-1000-1000-1000",
"txnId":"abelif879rft",
"billingAddress": {"city": "Chicago"}
}
],

In this model, we have two main aggregates: customer and order. We’ve used the black-
diamond composition marker in UML to show how data fits into the aggregation structure.
The customer contains a list of billing addresses; the order contains a list of order items, a
shipping address, and payments. The payment itself contains a billing address for that
payment.

A single logical address record appears three times in the example data, but instead of using
IDs it’s treated as a value and copied each time. This fits the domain where we would not
want the shipping address, nor the payment’s billing address, to change. In a relational
database, we would ensure that the address rows aren’t updated for this case, making a new
row instead. With aggregates, we can copy the whole address structure into the aggregate as
we need to.

The link between the customer and the order isn’t within either aggregate—it’s a
relationship between aggregates. Similarly, the link from an order item would cross into a
separate aggregate structure for products, which we haven’t gone into. We’ve shown the
product name as part of the order item here—this kind of denormalization is similar to the
tradeoffs with relational databases, but is more common with aggregates because we want to
minimize the number of aggregates we access during a data interaction.

The important thing to notice here isn’t the particular way we’ve drawn the aggregate
boundary so much as the fact that you have to think about accessing that data—and make that
part of your thinking when developing the application data model. Indeed we could draw our
aggregate boundaries differently, putting all the orders for a customer into the customer
aggregate (Figure 2.4 ).

Like most things in modelling, there’s no universal answer for how to draw your aggregate
boundaries. It depends entirely on how you tend to manipulate your data. If you tend to
access a customer together with all of that customer’s orders at once, then you would prefer a
single aggregate. However, if you tend to focus on accessing a single order at a time, then
you should prefer having separate aggregates for each order. Naturally, this is very context-
specific; some applications will prefer one or the other, even within a single system, which is
exactly why many people prefer aggregate ignorance.

Consequences of Aggregate Orientation

1. Aggregate Ignorance in Relational and Graph Databases: While relational


databases can model data relationships, they do not recognize aggregate entities, such
as an order comprising order items, a shipping address, and a payment. This lack of
understanding about aggregates means the database cannot optimize data storage and
distribution based on aggregate structures, making them "aggregate-ignorant."

2. Modelling Techniques and Semantic Variability: Various data modelling


techniques have attempted to represent aggregate or composite structures. However,
modelers often fail to provide a clear semantic distinction for aggregates, leading to
varied interpretations of what constitutes an aggregate relationship.
3. Aggregate Structure Utility: Aggregate-oriented databases provide a clearer
semantic focus on how data is used in applications, emphasizing the unit of interaction
with data storage. However, the lack of aggregate awareness in relational and graph
databases allows for greater flexibility in data analysis and manipulation, as different
relationships can be explored without being constrained by aggregate definitions.

4. Benefits for Cluster Performance: Aggregate orientation is particularly beneficial


for cluster performance, a key advantage of NoSQL systems. By explicitly defining
aggregates, databases can better optimize data distribution and minimize the number
of nodes queried when retrieving related data, enhancing efficiency.

5. Transaction Handling Differences: Relational databases support ACID transactions


across multiple tables, allowing complex operations that maintain atomicity,
consistency, isolation, and durability. In contrast, aggregate-oriented databases
typically focus on atomic manipulation of individual aggregates, requiring
application-level management for transactions spanning multiple aggregates.

6. Complexity of Consistency: The assertion that NoSQL databases sacrifice


consistency due to a lack of ACID support is an oversimplification. While aggregate-
oriented databases may not maintain ACID transactions across aggregates, they often
allow for atomic operations within single aggregates. Additionally, graph and other
aggregate-ignorant databases can support ACID transactions similar to relational
databases, making the topic of consistency more nuanced than just ACID compliance.

Key-Value and Document Data Models


1. Key-Value vs. Document Databases: Key-value and document databases are
aggregate-oriented but handle data differently. Key-value databases treat aggregates as
opaque blobs, lacking inherent structure, while document databases utilize the internal
structure, facilitating more complex interactions.
2. Access and Flexibility: Key-value stores depend on direct key lookups, limiting
querying to simple data retrievals. Document databases enable complex queries based
on internal fields, allowing for specific data retrieval and the creation of indexes for
efficient searching.

3. Blurring of Lines: The distinction between key-value and document databases often
blurs in practice. Document databases may include ID fields for key-value lookups,
while key-value stores can incorporate metadata for indexing, illustrating the evolving
nature of these technologies.

4. Enhanced Query Capabilities: Document databases provide advanced querying


capabilities compared to key-value stores, allowing searches that involve multiple
fields beyond just the aggregate's key. This functionality makes them more suitable
for applications needing intricate data interactions.

5. Integration of Search Tools: Both key-value and document databases can enhance
querying by integrating search tools. For example, Riak features Solr-like search on
JSON or XML aggregates, showing how key-value stores can adopt document-like
capabilities for improved data retrieval.
Column-Family Stores
BigTable Influence: Google’s BigTable introduced a tabular structure with sparse
columns and no schema, which influenced databases like HBase and Cassandra.
However, it's more accurate to think of it as a two-level map rather than a traditional
table.

Column Stores vs. Column-Family: Pre-NoSQL column stores, like C-Store, used the
relational model but optimized for reading specific columns across many rows. In
contrast, BigTable-style databases abandon the relational model while still grouping
columns into "column families."

Column-Family Model: Column-family databases are a two-level aggregate structure.


The first key, often a row identifier, retrieves a map of detailed values called columns,
allowing access to specific fields within the aggregate.

Efficient Access: Column-family databases enable efficient querying by allowing


retrieval of specific columns within a row, making them useful for scenarios where
detailed, field-level access is necessary.

Column Families Organization: Column-family databases organize columns into


column families, with the assumption that data within the same family will be accessed
together. Each column belongs to a single family, and these families represent chunks of
data within a row or record.

Row-Oriented vs. Column-Oriented: Column-family databases can be thought of in


two ways: row-oriented (rows as aggregates, with column families representing data
chunks) or column-oriented (column families defining record types with rows joining
across families).

Wide vs. Skinny Rows: In Cassandra, rows can vary in column count, with "skinny"
rows having few columns (uniform across rows) and "wide" rows containing many
columns, which are often used to model lists. Each column can represent an element in a
wide row.

Sort Ordering in Columns: Column families can define a sort order for columns,
allowing efficient access to data ranges. This is particularly useful when constructing
composite keys, such as combining date and ID for ordered data retrieval.

Chapter 3. More Details on Data Model


Relationships

Aggregates and Access Patterns: Aggregates group together data frequently accessed
together, such as a customer and their order history. However, different applications may
require separate access to related data, like processing orders independently, requiring
separate customer and order aggregates.

Linking Aggregates: To maintain relationships between separate aggregates (e.g.,


customer and order), databases often embed one aggregate's ID in another, like placing
the customer ID in the order data. This method allows for retrieval of related aggregates
through multiple database calls, though the database remains unaware of these
relationships.

Database-Supported Links: Some databases, even key-value stores, provide ways to


make these relationships visible. Document stores allow indexing and querying based on
aggregate content, while key-value stores like Riak use metadata to manage partial
retrievals and relationship navigation.

Atomicity and Updates: Aggregate-oriented databases treat each aggregate as a unit of


data retrieval and only support atomic operations within a single aggregate. Updating
multiple aggregates requires managing failures yourself, unlike relational databases that
offer ACID guarantees for multi-record transactions.

Relational vs. NoSQL for Complex Relationships: While relational databases are
useful for handling multiple relationships, they can become cumbersome and slow with
complex joins. Aggregate-oriented databases struggle with cross-aggregate operations,
suggesting that relational databases are better suited for highly relational data despite
performance challenges.

Graph Databases
Graph databases are an odd fish in the NoSQL pond. Most NoSQL databases were
inspired by the need to run on clusters, which led to aggregate-oriented data models of
large records with simple connections. Graph databases are motivated by a different
frustration with relational databases and thus have an opposite model—small records with
complex interconnections,
1. Graph Databases Structure: Graph databases focus on storing data as nodes
connected by edges, capturing complex relationships such as social networks or
product preferences. Unlike charts or histograms, graph databases enable rich,
interconnected data models.
2. Variation in Graph Data Models: Different graph databases handle node and edge
data differently. For instance, FlockDB stores only nodes and edges without attributes,
while Neo4J and Infinite Graph allow attaching Java objects or properties to nodes
and edges, offering more flexibility.
3. Efficient Relationship Navigation: Graph databases excel at efficiently traversing
relationships, especially for highly connected data models. Unlike relational
databases, where foreign key joins can be slow, graph databases optimize traversal
during data insertion, improving query performance.
4. Querying in Graph Databases: Queries in graph databases typically start with a node
lookup (e.g., using an ID) and then traverse edges to explore relationships. These
queries focus on navigating the graph rather than performing complex joins like in
relational databases.
5. Focus on Relationships: Graph databases differ from aggregate-oriented databases by
emphasizing relationships over aggregates. Their data model, which supports ACID
transactions across nodes and edges, often leads to single-server architectures rather
than distributed clusters.
Schemaless Databases
1. Schemaless Data Storage: Unlike relational databases that require a predefined
schema, NoSQL databases allow for storing data without a strict structure. This
enables flexibility in storing and changing data without predefined rules.

2. Flexible Data Structure: In NoSQL databases, key-value stores, document databases,


column-family databases, and graph databases allow for dynamic data structures,
letting you freely add new columns, edges, or properties without schema constraints.

3. Advantages of Schemalessness: Schemaless databases provide freedom to store


nonuniform data, allowing different records to have different fields without predefined
constraints. This avoids issues like sparse tables or unnecessary columns.

4. Adapting to Change: Schemaless databases make it easier to adapt the data model as
requirements evolve, allowing developers to add new data or remove unnecessary
fields without affecting the existing data.

5. Implicit Schema in Code: While NoSQL databases may not enforce a schema,
application code typically assumes a certain structure, such as field names and data
types. This creates an implicit schema within the code.

6. Challenges of Implicit Schemas: Having an implicit schema in application code can


lead to problems. Understanding the data structure requires reading the application
code, and the database cannot optimize data storage or retrieval based on schema
knowledge.

7. Inconsistent Data Management: Without a formal schema, there is a risk of


inconsistency when multiple applications access the same database. The lack of
schema validation by the database can result in different applications handling the
same data in conflicting ways.

8. Relational Database Flexibility: Despite the criticism, relational databases are not
entirely inflexible. Their schemas can be changed using SQL commands, and
nonuniform data can be stored by creating new columns on the fly when necessary.

9. Controlled Data Changes: Both relational and schemaless databases require careful
planning when making changes to the data structure over time. Managing data
migration in NoSQL databases can be just as complex as in relational systems.

10. Scope of Schemaless Flexibility: The flexibility of schemaless databases mainly


applies within the boundaries of an aggregate. Changing the boundaries of aggregates
is a complex process, similar to schema migrations in relational databases.
//pseudo code
foreach (Record r in records) {
foreach (Field f in r.fields) {
print (f.name, f.value) } }

Materialized Views
Aggregate-Oriented Disadvantage: While aggregate-oriented models work well for
accessing all data within a single aggregate, they can be inefficient for queries that require
data from across multiple aggregates, such as calculating total sales of a product over time.

Relational Database Advantage: Relational databases, lacking aggregate structures, allow


for more flexible data access and can support views, which provide an encapsulated way to
look at data differently than how it's stored.

Views and Materialized Views: Views in relational databases are computed over base tables
and can simplify complex queries. Materialized views are precomputed and cached versions
of views, useful for read-heavy scenarios where some data staleness is acceptable.

NoSQL and Materialized Views: NoSQL databases do not have views but can implement
precomputed and cached queries, often referred to as materialized views. These are crucial
for queries that don’t align well with the aggregate structure, often built using map-reduce
techniques.

Eager vs. Batch Updates: Materialized views can be updated eagerly (as soon as base data
is updated) or in batch jobs at regular intervals. Eager updates are preferable when fresh data
is needed frequently, while batch updates work when some staleness is acceptable.

Building Materialized Views: Materialized views can be computed within the database or
outside by running jobs to compute and save them back. Databases often support these views
by allowing you to define computations that are triggered based on certain parameters.

Materialized Views in Aggregates: In aggregate-oriented databases, materialized views can


be used within the same aggregate, such as an order summary in an order document.
Column-family databases often use separate column families for materialized views,
enabling updates within a single atomic operation.

Modeling for Data Access

You might also like