Recent Trends - Nosql Database Management
Recent Trends - Nosql Database Management
DATABASE MANAGEMENT
• GROUP 19-
• 18BCI0162- SHUBHAM
SAREEN
• 18BCE0754-RAJ ADROJA
• 18BCI0111-VIBHUTI
SRIVASTAVA
INTRODUCTION
• In the computing system (web and business applications), there are enormous data that
comes out every day from the web. A large section of these data is handled by Relational
database management systems (RDBMS). The idea of relational model came with
E.F.Codd’s 1970 paper "A relational model of data for large shared data banks" which
made data modeling and application programming much easier. Beyond the intended
benefits, the relational model is well-suited to client-server programming and today it is
predominant technology for storing structured data in web and business applications.
WHAT IS NOSQL?
• In today’s time data is becoming easier to access and capture through third parties such as Facebook, Google+
and others. Personal user information, social graphs, geo location data, user-generated content and machine
logging data are just a few examples where the data has been increasing exponentially. To avail the above service
properly, it is required to process huge amount of data. Which SQL databases were never designed. The
evolution of NoSql databases is to handle these huge data properly.
• Social-network graph:
• Each record: UserID1, UserID2
• Separate records: UserID, first_name,last_name, age, gender,...
• Task: Find all friends of friends of friends of ... friends of a given user.
BRIEF HISTORY OF NOSQL
• The term NoSQL was coined by Carlo Strozzi in the year 1998. He used this term to name
his Open Source, Light Weight, DataBase which did not have an SQL interface.
• In the early 2009, when last.fm wanted to organize an event on open-source distributed
databases, Eric Evans, a Rackspace employee, reused the term to refer databases which are
non-relational, distributed, and does not conform to atomicity, consistency, isolation, durability
- four obvious features of traditional relational database systems.
• In the same year, the "no:sql(east)" conference held in Atlanta, USA, NoSQL was discussed and
debated a lot.
• And then, discussion and practice of NoSQL got a momentum, and NoSQL saw an
unprecedented growth.
RDBMS VS NOSQL
• RDBMS
• Structured and organized data
• Structured query language (SQL)
• Data and its relationships are stored in separate tables.
• Data Manipulation Language, Data Definition Language
• Tight Consistency
NO SQL
• Explosion of social media sites (Facebook, Twitter) with large data needs
• Rise of cloud-based solutions such as Amazon S3 (simple storage solution)
• Just as moving to dynamically-typed languages (Ruby/Groovy), a shift to dynamically-typed
data with frequent schema changes
• Open-source community
DYNAMO AND BIGTABLE
• Large datasets, acceptance of alternatives, and dynamically-typed data has come together
in a perfect storm
• Not a backlash/rebellion against RDBMS
• SQL is a rich query language that cannot be rivaled by the current list of NoSQL
offerings
NO SQL PROS/CONS
• Advantages :
• High scalability
• Distributed Computing
• Lower cost
• Schema flexibility, semi-structure data
• No complicated Relationships
• Disadvantages
• No standardization
• Limited query capabilities (so far)
• Eventual consistent is not intuitive to program for
CAP THEOREM
There are four general types (most common categories) of NoSQL databases. Each of these
categories has its own specific attributes and limitations. There is not a single solutions which is
better than all the others, however there are some databases that are better to solve specific
problems. To clarify the NoSQL databases, lets discuss the most common categories :
• • Key-value stores
• • Column-oriented
• • Graph
• • Document oriented
CAP THEOREM
The CAP theorem is a tool used to makes system designers aware of the trade-offs while
designing networked shared-data systems. CAP has influenced the design of many
distributed data systems. It made designers aware of a wide range of tradeoffs to consider
while designing distributed data systems. Over the years, the CAP theorem has been a
widely misunderstood tool used to categorize databases. There is much misinformation
floating around about CAP. Most blog posts on CAP are historical and possibly incorrect.
• The CAP theorem applies to distributed systems that store state. Eric Brewer, at the
2000 Symposium on Principles of Distributed Computing (PODC), conjectured that in
any networked shared-data system there is a fundamental trade-off between consistency,
availability, and partition tolerance.
• In 2002, Seth Gilbert and Nancy Lynch of MIT published a formal proof of Brewer's
conjecture. The theorem states that networked shared-data systems can only
guarantee/strongly support two of the following three properties:
• Cons istency — A guarantee that every node in a distributed cluster returns the same, most recent,
successful write. Consistency refers to every client having the same view of the data. There are various
types of consistency models. Consistency in CAP (used to prove the theorem) refers to linearizability or
sequential consistency, a very strong form of consistency.
• Availability — Every non-failing node returns a response for all read and write requests in a reasonable
amount of time. The key word here is every. To be available, every node on (either side of a network
partition) must be able to respond in a reasonable amount of time.
• Partition Tolerant — The system continues to function and upholds its consistency guarantees in spite
of network partitions. Network partitions are a fact of life. Distributed systems guaranteeing partition
tolerance can gracefully recover from partitions once the partition heals.
• The CAP theorem categorizes systems into three categories:
• CP (Consistent and Partition Tolerant) — At first glance, the CP category is confusing, i.e., a
system that is consistent and partition tolerant but never available. CP is referring to a
category of systems where availability is sacrificed only in the case of a network partition.
• CA (Consistent and Available) — CA systems are consistent and available systems in the
absence of any network partition. Often a single node's DB servers are categorized as CA
systems. Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems. The only hole in this theory is that single node DB systems are not a
network of shared data systems and thus do not fall under the preview of CAP. [^11]
• The part where all three sections intersect is white because it is impossible to have all
three properties in networked shared-data systems. A Venn diagram or a triangle is
an incorrect visualization of the CAP. Any CAP theorem visualization such as a triangle or
a Venn diagram is misleading. The correct way to think about CAP is that in case of a
network partition (a rare occurrence) one needs to choose between availability
and consistency.
• In any networked shared-data systems partition tolerance is a must. Network partitions
and dropped messages are a fact of life and must be handled appropriately. Consequently,
system designers must choose between consistency and availability.
• Simplistically speaking, a network partition forces designers to either choose perfect
consistency or perfect availability. Picking consistency means not being able to answer a
client's query as the system cannot guarantee to return the most recent write. This
sacrifices availability.
• Network partition forces nonfailing nodes to reject clients' requests as these nodes
cannot guarantee consistent data. At the opposite end of the spectrum, being available
means being able to respond to a client's request but the system cannot guarantee
consistency, i.e., the most recent value written. Available systems provide the best
possible answer under the given circumstance.
• During normal operation (lack of network partition) the CAP theorem does not
impose constraints on availability or consistency.
• The CAP theorem is responsible for instigating the discussion about the various tradeoffs
in a distributed shared data system. It has played a pivotal role in increasing our
understanding of shared data systems. Nonetheless, the CAP theorem is criticized for
being too simplistic and often misleading. Over a decade after the release of the CAP
theorem, Brewer acknowledges that the CAP theorem oversimplified the choices
available in the event of a network partition.
• According to Brewer, the CAP theorem prohibits only a “tiny part of the design space:
perfect availability and consistency in the presence of partitions, which are rare." System
designers have a broad range of options for dealing and recovering from network
partitions. The goal of every system must be to “maximize combinations of consistency
and availability that make sense for the specific application.”