Unit Ii - Nosql Databases
Unit Ii - Nosql Databases
What is NoSQL?
NoSQL Database is a non-relational Data Management System, that does
not require a fixed schema. It avoids joins, and is easy to scale. The major
purpose of using a NoSQL database is for distributed data stores with
humongous data storage needs. NoSQL is used for Big data and real-time
web apps. For example, companies like Twitter, Facebook and Google
collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better
term would be “NoREL”, NoSQL caught on. Carl Strozz introduced the
NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further
insights. Instead, a NoSQL database system encompasses a wide range of
database technologies that can store structured, semi-structured,
unstructured and polymorphic data. Let’s understand about NoSQL with a
diagram in this NoSQL database tutorial:
In this NoSQL tutorial for beginners, you will learn NoSQL basics like:
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like
Google, Facebook, Amazon, etc. who deal with huge volumes of data. The
system response time becomes slow when you use RDBMS for massive
volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our
existing hardware. This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts
whenever the load increases. This method is known as “scaling out.”
Features of NoSQL
Non-relational
NoSQL databases never follow the relational model
Never provide tables with flat fixed-column records
Work with self-contained aggregates or BLOBs
Doesn’t require object-relational mapping and data normalization
No complex features like query languages, query planners,referential
integrity joins, ACID
Schema-free
NoSQL is Schema-Free
Simple API
Offers easy to use interfaces for storage and querying data provided
APIs allow low-level data manipulation & selection methods
Text-based protocols mostly used with HTTP REST with JSON
Mostly used no standard based NoSQL query language
Web-enabled databases running as internet-facing services
Distributed
For example, a key-value pair may contain a key like “Website” associated
with a value like “Guru99”.
It is one of the most basic NoSQL database example. This kind of NoSQL
database is used as a collection, dictionaries, associative arrays, etc. Key
value stores help the developer to store schema-less data. They work best
for shopping cart contents.
Column-based
Column-oriented databases work on columns and are based on BigTable
paper by Google. Every column is treated separately. Values of single
column databases are stored contiguously.
Column based NoSQL database
Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value
pair but the value part is stored as a document. The document is stored in
JSON or XML formats. The value is understood by the DB and can be
queried.
The document type is mostly used for CMS systems, blogging platforms,
real-time analytics & e-commerce applications. It should not use for
complex transactions which require multiple operations or queries against
varying aggregate structures.
Graph-Based
A graph type database stores entities as well the relations amongst those
entities. The entity is stored as a node with the relationship as edges. An
edge gives a relationship between nodes. Every node and edge has a
unique identifier.
1. Consistency
2. Availability
3. Partition Tolerance
Consistency:
Availability:
Partition Tolerance:
Partition Tolerance means that the system should continue to function even
if the communication among the servers is not stable. For example, the
servers can be partitioned into multiple groups which may not communicate
with each other. Here, if part of the database is unavailable, other parts are
always unaffected.
Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple
machines to get high availability and scalability. Thus, changes made to
any data item on one machine has to be propagated to other replicas.
Advantages of NoSQL
Can be used as Primary or Analytic Data Source
Big Data Capability
No Single Point of Failure
Easy Replication
No Need for Separate Caching Layer
It provides fast performance and horizontal scalability.
Can handle structured, semi-structured, and unstructured data with
equal effect
Object-oriented programming which is easy to use and flexible
NoSQL databases don’t need a dedicated high-performance server
Support Key Developer Languages and Platforms
Simple to implement than using RDBMS
It can serve as the primary data source for online applications.
Handles big data which manages data velocity, variety, volume, and
complexity
Excels at distributed database and multi-data center operations
Eliminates the need for a specific caching layer to store data
Offers a flexible schema design which can easily be altered without
downtime or service disruption
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency
when multiple transactions are performed simultaneously.
When the volume of data increases it is difficult to maintain unique
values as keys become difficult
Doesn’t work as well with relational data
The learning curve is stiff for new developers
Open source options so not so popular for enterprises.
SHARDING
Sharding is a partitioning pattern for the NoSQL age.
It's a partitioning pattern that places each partition in potentially
separate servers—potentially all over the world.
This scale out works well for supporting people all over the world
accessing different parts of the data set with performance.
DOCUMENT BASED
What is MongoDB
MongoDB is an open-source document database that provides high
performance, high availability, and automatic scaling.
MongoDB is available under General Public license for free, and it is also
available under Commercial license from the manufacturer.
History of MongoDB
The initial development of MongoDB began in 2007 when the company
was building a platform as a service similar to window azure.
Window azure is a cloud computing platform and infrastructure, created by Microsoft, to build,
deploy and manage applications and service through a global network.
The first ready production of MongoDB has been considered from version
1.4 which was released in March 2010.
MongoDB2.4.9 was the latest and stable version which was released on
January 10, 2014.
All the modern applications require big data, fast features development,
flexible deployment, and the older database systems not competent
enough, so the MongoDB was needed.
o Scalability
o Performance
o High Availability
o Scaling from single server deployments to large, complex multi-site
architectures.
o Key points of MongoDB
o Develop Faster
o Deploy Easier
o Scale Bigger
1. FirstName = "John",
2. Address = "Detroit",
4. FirstName ="John",
5. Address = "Wick"
Features of MongoDB
These are some important features of MongoDB:
In MongoDB, you can search by field, range query and it also supports
regular expression searches.
2. Indexing
3. Replication
A master can perform Reads and Writes and a Slave copies data from the
master and can only be used for reads or back up (not writes)
4. Duplication of data
MongoDB can run over multiple servers. The data is duplicated to keep
the system up and also keep its running condition in case of hardware
failure.
5. Load balancing
10. Stores files of any size easily without complicating your stack.
SHARDING IN MONGODB
Sharding is a methodology to distribute data across multiple machines.
Sharding is basically used for deployment with a large dataset and high
throughput operations. The single database cannot handle a database with
large datasets as it requires larger storage, and bulk query operations can
use most of the CPU cycles, which slows down processing. For such
scenarios, we need more powerful systems.
Sharded clusters
MongoDB’s sharding consists of the following components:
Shard: Each shard stores a subset of sharded data. Also, each shard
can be deployed as a replica set.
Mongos: Mongos provide an interface between a client application
and sharded cluster to route the query.
Config server: The configuration server stores the metadata and
configuration settings for the cluster. The MongoDB data is sharded
at the collection level and distributed across sharded clusters.
Shard keys: To distribute documents in collections, MongoDB
partitions the collection using the shard key. MongoDB shards data
into chunks. These chunks are distributed across shards in sharded
clusters.
Advantages of sharding
Here are some of the advantages of sharding:
Create Operations –
The create or insert operations are used to insert or add new
documents in the collection. If a collection does not exist, then it will
create a new collection in the database. You can perform, create
operations using the following methods provided by the MongoDB:
Method Description
db.collection.insertOne() It is used to insert a single document in the collection.
Example 1:
In this example, we are inserting details of a single student in the
form of document in the student collection
using db.collection.insertOne() method.
Example 2:
In this example, we are inserting details of the multiple students in
the form of documents in the student collection
using db.collection.insertMany() method.
Read Operations –
The Read operations are used to retrieve documents from the
collection, or in other words, read operations are used to query a
collection for a document. You can perform read operation using the
following method provided by the MongoDB:
Method Description
Example :
In this example, we are retrieving the details of students from the
student collection using db.collection.find() method.
Update Operations –
The update operations are used to update or modify the existing
document in the collection. You can perform update operations
using the following methods provided by the MongoDB:
Method Description
Example 1:
In this example, we are updating the age of Sumit in the student
collection using db.collection.updateOne() method.
Example 2:
Delete Operations –
The delete operation are used to delete or remove the documents
from a collection. You can perform delete operations using the
following methods provided by the MongoDB:
Method Description
It is used to delete a single document from the
db.collection.deleteOne() collection
Example 1:
In this example, we are deleting a document from the student
collection using db.collection.deleteOne() method.
Example 2:
In this example, we are deleting all the documents from the student
collection using db.collection.deleteMany() method.
Mongodb query
The find() Method
To query data from MongoDB collection, you need to use
MongoDB's find() method.
Syntax
The basic syntax of find() method is as follows −
>db.COLLECTION_NAME.find()
find() method will display all the documents in a non-structured way.
Example
Assume we have created a collection named mycol as −
> use sampleDB
switched to db sampleDB
> db.createCollection("mycol")
{ "ok" : 1 }
>
Syntax
>db.COLLECTIONNAME.findOne()
Example
Following example retrieves the document with title MongoDB Overview.
> db.mycol.findOne({title: "MongoDB Overview"})
{
"_id" : ObjectId("5dd6542170fb13eec3963bf0"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
AND in MongoDB
Syntax
To query documents based on the AND condition, you need to use $and keyword.
Following is the basic syntax of AND −
>db.mycol.find({ $and: [ {<key1>:<value1>},
{ <key2>:<value2>} ] })
> db.mycol.find({$and:[{"by":"tutorials point"},{"title":
"MongoDB Overview"}]}).pretty()
{
"_id" : ObjectId("5dd4e2cc0821d3b44607534c"),
"title" : "MongoDB Overview",
"description" : "MongoDB is no SQL database",
"by" : "tutorials point",
"url" : "http://www.tutorialspoint.com",
"tags" : [
"mongodb",
"database",
"NoSQL"
],
"likes" : 100
}
>
OR in MongoDB
Syntax
To query documents based on the OR condition, you need to use $or keyword.
Following is the basic syntax of OR −
>db.mycol.find(
{
$or: [
{key1: value1}, {key2:value2}
]
}
).pretty()
NOR in MongoDB
Syntax
To query documents based on the NOT condition, you need to use $not keyword.
Following is the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$not: [
{key1: value1}, {key2:value2}
]
}
)
Example
Assume we have inserted 3 documents in the collection empDetails as shown
below −
db.empDetails.insertMany(
[
{
First_Name: "Radhika",
Last_Name: "Sharma",
Age: "26",
e_mail: "radhika_sharma.123@gmail.com",
phone: "9000012345"
},
{
First_Name: "Rachel",
Last_Name: "Christopher",
Age: "27",
e_mail: "Rachel_Christopher.123@gmail.com",
phone: "9000054321"
},
{
First_Name: "Fathima",
Last_Name: "Sheik",
Age: "24",
e_mail: "Fathima_Sheik.123@gmail.com",
phone: "9000054321"
}
]
)
Following example will retrieve the document(s) whose first name is not "Radhika"
and last name is not "Christopher"
> db.empDetails.find(
{
$nor:[
40
{"First_Name": "Radhika"},
{"Last_Name": "Christopher"}
]
}
).pretty()
{
"_id" : ObjectId("5dd631f270fb13eec3963bef"),
"First_Name" : "Fathima",
"Last_Name" : "Sheik",
"Age" : "24",
"e_mail" : "Fathima_Sheik.123@gmail.com",
"phone" : "9000054321"
}
NOT in MongoDB
Syntax
To query documents based on the NOT condition, you need to use $not keyword
following is the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$NOT: [
{key1: value1}, {key2:value2}
]
}
).pretty()
Example
Following example will retrieve the document(s) whose age is not greater than 25
> db.empDetails.find( { "Age": { $not: { $gt: "25" } } } )
{
"_id" : ObjectId("5dd6636870fb13eec3963bf7"),
"First_Name" : "Fathima",
"Last_Name" : "Sheik",
"Age" : "24",
"e_mail" : "Fathima_Sheik.123@gmail.com",
"phone" : "9000054321"
}
REPLICATION
Replication is the process of synchronizing data across multiple servers. Replication
provides redundancy and increases data availability with multiple copies of data on
different database servers. Replication protects a database from the loss of a single
server. Replication also allows you to recover from hardware failure and service
interruptions. With additional copies of the data, you can dedicate one to disaster
recovery, reporting, or backup.
Why Replication?
To keep your data safe
High (24*7) availability of data
Disaster recovery
No downtime for maintenance (like backups, index rebuilds, compaction)
Read scaling (extra copies to read from)
Replica set is transparent to the application
Example
mongod --port 27017 --dbpath "D:\set up\mongodb\data" --replSet
rs0
It will start a mongod instance with the name rs0, on port 27017.
Now start the command prompt and connect to this mongod instance.
In Mongo client, issue the command rs.initiate() to initiate a new replica set.
To check the replica set configuration, issue the command rs.conf(). To
check the status of replica set issue the command rs.status().
Syntax
The basic syntax of rs.add() command is as follows −
>rs.add(HOST_NAME:PORT)
Example
Suppose your mongod instance name is mongod1.net and it is running on
port 27017. To add this instance to replica set, issue the command rs.add() in
Mongo client.
>rs.add("mongod1.net:27017")
>
You can add mongod instance to replica set only when you are connected to
primary node. To check whether you are connected to primary or not, issue the
command db.isMaster() in mongo client.
SHARDING
Sharding is the process of storing data records across multiple machines and it is
MongoDB's approach to meeting the demands of data growth. As the size of the
data increases, a single machine may not be sufficient to store the data nor provide
an acceptable read and write throughput. Sharding solves the problem with
horizontal scaling. With sharding, you add more machines to support data growth
and the demands of read and write operations.
Why Sharding?
In replication, all writes go to master node
Latency sensitive queries still go to master
Single replica set has limitation of 12 nodes
Memory can't be large enough when active dataset is big
Local disk is not big enough
Vertical scaling is too expensive
Sharding in MongoDB
The following diagram shows the Sharding in MongoDB using sharded cluster.
In the following diagram, there are three main components −
Shards − Shards are used to store data. They provide high availability and
data consistency. In production environment, each shard is a separate
replica set.
Config Servers − Config servers store the cluster's metadata. This data
contains a mapping of the cluster's data set to the shards. The query router
uses this metadata to target operations to specific shards. In production
environment, sharded clusters have exactly 3 config servers.
Query Routers − Query routers are basically mongo instances, interface with
client applications and direct operations to the appropriate shard. The query
router processes and targets the operations to shards and then returns
results to the clients. A sharded cluster can contain more than one query
router to divide the client request load. A client sends requests to one query
router. Generally, a sharded cluster have many query routers.
MongoDB - Deployment
When you are preparing a MongoDB deployment, you should try to understand how
your application is going to hold up in production. It’s a good idea to develop a
consistent, repeatable approach to managing your deployment environment so that
you can minimize any surprises once you’re in production.
The best approach incorporates prototyping your set up, conducting load testing,
monitoring key metrics, and using that information to scale your set up. The key part
of the approach is to proactively monitor your entire system - this will help you
understand how your production system will hold up before deploying, and
determine where you will need to add capacity. Having insight into potential spikes
in your memory usage, for example, could help put out a write-lock fire before it
starts.
To monitor your deployment, MongoDB provides some of the following commands
−
mongostat
This command checks the status of all running mongod instances and return
counters of database operations. These counters include inserts, queries, updates,
deletes, and cursors. Command also shows when you’re hitting page faults, and
showcase your lock percentage. This means that you're running low on memory,
hitting write capacity or have some performance issue.
To run the command, start your mongod instance. In another command prompt, go
to bin directory of your mongodb installation and type mongostat.
D:\set up\mongodb\bin>mongostat
Following is the output of the command −
mongotop
This command tracks and reports the read and write activity of MongoDB instance
on a collection basis. By default, mongotop returns information in each second,
which you can change it accordingly. You should check that this read and write
activity matches your application intention, and you’re not firing too many writes to
the database at a time, reading too frequently from a disk, or are exceeding your
working set size.
To run the command, start your mongod instance. In another command prompt, go
to bin directory of your mongodb installation and type mongotop.
D:\set up\mongodb\bin>mongotop
Following is the output of the command −
To change mongotop command to return information less frequently, specify a
specific number after the mongotop command.
D:\set up\mongodb\bin>mongotop 30
The above example will return values every 30 seconds.
Apart from the MongoDB tools, 10gen provides a free, hosted monitoring service,
MongoDB Management Service (MMS), that provides a dashboard and gives you a
view of the metrics from your entire cluster.
MONGODB-JAVA
Installation
Before you start using MongoDB in your Java programs, you need to make sure
that you have MongoDB CLIENT and Java set up on the machine. You can check
Java tutorial for Java installation on your machine. Now, let us check how to set up
MongoDB CLIENT.
You need to download the jar mongodb-driver-3.11.2.jar and its
dependency mongodb-driver-core-3.11.2.jar.. Make sure to download the
latest release of these jar files.
You need to include the downloaded jar files into your classpath.
Connect to Database
To connect database, you need to specify the database name, if the database
doesn't exist then MongoDB creates it automatically.
Following is the code snippet to connect to the database −
import com.mongodb.client.MongoDatabase;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
public class ConnectToDB {
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser",
"myDb",
"password".toCharArray());
System.out.println("Connected to the database
successfully");
Now, let's compile and run the above program to create our database myDb as
shown below.
$javac ConnectToDB.java
$java ConnectToDB
On executing, the above program gives you the following output.
Connected to the database successfully
Credentials ::MongoCredential{
mechanism = null,
userName = 'sampleUser',
source = 'myDb',
password = <hidden>,
mechanismProperties = {}
}
Create a Collection
To create a collection, createCollection() method
of com.mongodb.client.MongoDatabase class is used.
Following is the code snippet to create a collection −
import com.mongodb.client.MongoDatabase;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
public class CreatingCollection {
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser",
"myDb",
"password".toCharArray());
System.out.println("Connected to the database
successfully");
//Creating a collection
database.createCollection("sampleCollection");
System.out.println("Collection created successfully");
}
}
Getting/Selecting a Collection
To get/select a collection from the database, getCollection() method
of com.mongodb.client.MongoDatabase class is used.
Following is the program to get/select a collection −
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
public class selectingCollection {
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser",
"myDb",
"password".toCharArray());
System.out.println("Connected to the database
successfully");
// Creating a collection
System.out.println("Collection created successfully");
// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("myCollection");
System.out.println("Collection myCollection selected
successfully");
}
}
Insert a Document
To insert a document into MongoDB, insert() method
of com.mongodb.client.MongoCollection class is used.
Following is the code snippet to insert a document −
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import com.mongodb.MongoClient;
public class InsertingDocument {
public static void main( String args[] ) {
// Creating a collection
database.createCollection("sampleCollection");
System.out.println("Collection created successfully");
// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
System.out.println("Collection sampleCollection selected
successfully");
Document document = new Document("title", "MongoDB")
.append("description", "database")
.append("likes", 100)
.append("url", "http://www.tutorialspoint.com/mongodb/")
.append("by", "tutorials point");
// Creating Credentials
MongoCredential credential;
credential =
MongoCredential.createCredential("sampleUser", "myDb",
"password".toCharArray());
System.out.println("Connected to the database
successfully");
// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
System.out.println("Collection sampleCollection
selected successfully");
Document document1 = new Document("title",
"MongoDB")
.append("description", "database")
.append("likes", 100)
.append("url",
"http://www.tutorialspoint.com/mongodb/")
.append("by", "tutorials point");
Document document2 = new Document("title",
"RethinkDB")
.append("description", "database")
.append("likes", 200)
.append("url",
"http://www.tutorialspoint.com/rethinkdb/")
.append("by", "tutorials point");
List<Document> list = new ArrayList<Document>();
list.add(document1);
list.add(document2);
collection.insertMany(list);
// Getting the iterable object
FindIterable<Document> iterDoc = collection.find();
int i = 1;
// Getting the iterator
Iterator it = iterDoc.iterator();
while (it.hasNext()) {
System.out.println(it.next());
i++;
}
}
}
Update Document
To update a document from the collection, updateOne() method
of com.mongodb.client.MongoCollection class is used.
Following is the program to select the first document −
import com.mongodb.client.FindIterable;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.Filters;
import com.mongodb.client.model.Updates;
import java.util.Iterator;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
public class UpdatingDocuments {
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser",
"myDb",
"password".toCharArray());
System.out.println("Connected to the database
successfully");
Delete a Document
To delete a document from the collection, you need to use the deleteOne() method
of the com.mongodb.client.MongoCollection class.
Following is the program to delete a document −
import com.mongodb.client.FindIterable;
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import com.mongodb.client.model.Filters;
import java.util.Iterator;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
public class DeletingDocuments {
// Creating Credentials
MongoCredential credential;
credential = MongoCredential.createCredential("sampleUser",
"myDb",
"password".toCharArray());
System.out.println("Connected to the database
successfully");
Dropping a Collection
To drop a collection from a database, you need to use the drop() method of
the com.mongodb.client.MongoCollection class.
Following is the program to delete a collection −
import com.mongodb.client.MongoCollection;
import com.mongodb.client.MongoDatabase;
import org.bson.Document;
import com.mongodb.MongoClient;
import com.mongodb.MongoCredential;
public class DropingCollection {
// Creating a collection
System.out.println("Collections created successfully");
// Retrieving a collection
MongoCollection<Document> collection =
database.getCollection("sampleCollection");
// Dropping a Collection
collection.drop();
System.out.println("Collection dropped successfully");
}
}
Create a Collection
Following is the code snippet to create a collection −
<?php
// connect to mongodb
$m = new MongoClient();
echo "Connection to database successfully";
// select a database
$db = $m->mydb;
echo "Database mydb selected";
$collection = $db->createCollection("mycol");
echo "Collection created succsessfully";
?>
Insert a Document
To insert a document into MongoDB, insert() method is used.
Following is the code snippet to insert a document −
<?php
// connect to mongodb
$m = new MongoClient();
echo "Connection to database successfully";
// select a database
$db = $m->mydb;
echo "Database mydb selected";
$collection = $db->mycol;
echo "Collection selected succsessfully";
$document = array(
"title" => "MongoDB",
"description" => "database",
"likes" => 100,
"url" => "http://www.tutorialspoint.com/mongodb/",
"by" => "tutorials point"
);
$collection->insert($document);
echo "Document inserted successfully";
?>
Update a Document
To update a document, you need to use the update() method.
In the following example, we will update the title of inserted document to MongoDB
Tutorial. Following is the code snippet to update a document −
<?php
// connect to mongodb
$m = new MongoClient();
echo "Connection to database successfully";
// select a database
$db = $m->mydb;
echo "Database mydb selected";
$collection = $db->mycol;
echo "Collection selected succsessfully";
// now update the document
$collection->update(array("title"=>"MongoDB"),
array('$set'=>array("title"=>"MongoDB Tutorial")));
echo "Document updated successfully";
Delete a Document
To delete a document, you need to use remove() method.
In the following example, we will remove the documents that has the title MongoDB
Tutorial. Following is the code snippet to delete a document −
<?php
// connect to mongodb
$m = new MongoClient();
echo "Connection to database successfully";
// select a database
$db = $m->mydb;
echo "Database mydb selected";
$collection = $db->mycol;
echo "Collection selected succsessfully";
In MongoDB, you can search by field, range query and it also supports
regular expression searches.
2. Indexing
3. Replication
A master can perform Reads and Writes and a Slave copies data from the
master and can only be used for reads or back up (not writes)
4. Duplication of data
MongoDB can run over multiple servers. The data is duplicated to keep
the system up and also keep its running condition in case of hardware
failure.
5. Load balancing
10. Stores files of any size easily without complicating your stack.
Cassandra:
Cassandra is the modern version of the relational database, albeit where data is
grouped by column instead of row, for fast retrieval. MongoDB stores records as
documents in JSON format. It has a JavaScript shell and a rich set of functions
which makes it easy to work with.
Data Model
Cassandra - Data Model
The data model of Cassandra is significantly different from what we normally see in
an RDBMS..
Cluster
Cassandra database is distributed over several machines that operate together.
The outermost container is known as the Cluster. For failure handling, every node
contains a replica, and in case of a failure, the replica takes charge. Cassandra
arranges the nodes in a cluster, in a ring format, and assigns data to them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of
a Keyspace in Cassandra are −
Replication factor − It is the number of machines in the cluster that will
receive copies of the same data.
Replica placement strategy − It is nothing but the strategy to place replicas
in the ring. We have strategies such as simple strategy (rack-aware
strategy), old network topology strategy (rack-aware strategy),
and network topology strategy (datacenter-shared strategy).
Column families − Keyspace is a container for a list of one or more column
families. A column family, in turn, is a container of a collection of rows. Each
row contains ordered columns. Column families represent the structure of
your data. Each keyspace has at least one and often many column families.
The syntax of creating a Keyspace is as follows −
CREATE KEYSPACE Keyspace name
WITH replication = {'class': 'SimpleStrategy',
'replication_factor' : 3};
The following illustration shows a schematic view of a Keyspace.
Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is
an ordered collection of columns. The following table lists the points that
differentiate a column family from a table of relational databases.
A schema in a relational model is fixed. Once we In Cassandra, although the column families
define certain columns for a table, while inserting are defined, the columns are not. You can
data, in every row all the columns must be filled at freely add any column to any column family
least with a null value. at any time.
Relational tables define only columns and the In Cassandra, a table contains columns, or
user fills in the table with values. can be defined as a super column family.
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
Database is the outermost container that Keyspace is the outermost container that
contains data corresponding to an contains data corresponding to an application.
application.
Tables are the entities of a database. Tables or column families are the entity of a
keyspace.
RDBMS supports the concepts of foreign Relationships are represented using collections.
keys, joins.
KEY SPACE
Syntax
CREATE KEYSPACE <identifier> WITH <properties>
i.e.
CREATE KEYSPACE “KeySpace Name”
WITH replication = {'class': ‘Strategy name’,
'replication_factor' : ‘No.Of replicas’};
Replication
The replication option is to specify the Replica Placement strategy and the
number of replicas wanted. The following table lists all the replica placement
strategies.
Network Topology Strategy Using this option, you can set the replication factor
for each data-center independently.
Old Network Topology This is a legacy replication strategy.
Strategy
Using this option, you can instruct Cassandra whether to use commitlog for
updates on the current KeySpace. This option is not mandatory and by default, it is
set to true.
Example
Given below is an example of creating a KeySpace.
Here we are creating a KeySpace named TutorialsPoint.
We are using the first replica placement strategy, i.e.., Simple Strategy.
And we are choosing the replication factor to 1 replica.
cqlsh.> CREATE KEYSPACE tutorialspoint
WITH replication = {'class':'SimpleStrategy',
'replication_factor' : 3};
Verification
You can verify whether the table is created or not using the command Describe. If
you use this command over keyspaces, it will display all the keyspaces created as
shown below.
cqlsh> DESCRIBE keyspaces;
Durable_writes
By default, the durable_writes properties of a table is set to true, however it can be
set to false. You cannot set this property to simplex strategy.
Example
Given below is the example demonstrating the usage of durable writes property.
cqlsh> CREATE KEYSPACE test
... WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3 }
... AND DURABLE_WRITES = false;
Verification
You can verify whether the durable_writes property of test KeySpace was set to
false by querying the System Keyspace. This query gives you all the KeySpaces
along with their properties.
cqlsh> SELECT * FROM system_schema.keyspaces;
keyspace_name | durable_writes |
strategy_class | strategy_options
----------------+----------------
+------------------------------------------------------
+----------------------------
test | False |
org.apache.cassandra.locator.NetworkTopologyStrategy |
{"datacenter1" : "3"}
tutorialspoint | True |
org.apache.cassandra.locator.SimpleStrategy |
{"replication_factor" : "4"}
system | True |
org.apache.cassandra.locator.LocalStrategy | { }
system_traces | True |
org.apache.cassandra.locator.SimpleStrategy |
{"replication_factor" : "2"}
(4 rows)
Here you can observe the durable_writes property of test KeySpace was set to
false.
Using a Keyspace
You can use a created KeySpace using the keyword USE. Its syntax is as follows −
Syntax:USE <identifier>
Example
In the following example, we are using the KeySpace tutorialspoint.
cqlsh> USE tutorialspoint;
cqlsh:tutorialspoint>
//Query
String query = "CREATE KEYSPACE tp WITH replication "
+ "= {'class':'SimpleStrategy',
'replication_factor':1};";
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Create_KeySpace.java
$java Create_KeySpace
Under normal conditions, it will produce the following output −
Keyspace created
Altering a KeySpace
ALTER KEYSPACE can be used to alter properties such as the number of replicas
and the durable_writes of a KeySpace. Given below is the syntax of this command.
Syntax
ALTER KEYSPACE <identifier> WITH <properties>
i.e.
ALTER KEYSPACE “KeySpace Name”
WITH replication = {'class': ‘Strategy name’,
'replication_factor' : ‘No.Of replicas’};
The properties of ALTER KEYSPACE are same as CREATE KEYSPACE. It has
two properties: replication and durable_writes.
Replication
The replication option specifies the replica placement strategy and the number of
replicas wanted.
Durable_writes
Using this option, you can instruct Cassandra whether to use commitlog for updates
on the current KeySpace. This option is not mandatory and by default, it is set to
true.
Example
Given below is an example of altering a KeySpace.
Here we are altering a KeySpace named TutorialsPoint.
We are changing the replication factor from 1 to 3.
cqlsh.> ALTER KEYSPACE tutorialspoint
WITH replication = {'class':'NetworkTopologyStrategy',
'replication_factor' : 3};
Altering Durable_writes
You can also alter the durable_writes property of a KeySpace. Given below is the
durable_writes property of the test KeySpace.
SELECT * FROM system.schema_keyspaces;
keyspace_name | durable_writes |
strategy_class | strategy_options
----------------+----------------
+------------------------------------------------------
+----------------------------
test | False |
org.apache.cassandra.locator.NetworkTopologyStrategy |
{"datacenter1":"3"}
tutorialspoint | True |
org.apache.cassandra.locator.SimpleStrategy |
{"replication_factor":"4"}
system | True |
org.apache.cassandra.locator.LocalStrategy | { }
system_traces | True |
org.apache.cassandra.locator.SimpleStrategy |
{"replication_factor":"2"}
(4 rows)
ALTER KEYSPACE test
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy',
'datacenter1' : 3}
AND DURABLE_WRITES = true;
Once again, if you verify the properties of KeySpaces, it will produce the following
output.
SELECT * FROM system.schema_keyspaces;
keyspace_name | durable_writes |
strategy_class | strategy_options
----------------+----------------
+------------------------------------------------------
+----------------------------
test | True |
org.apache.cassandra.locator.NetworkTopologyStrategy |
{"datacenter1":"3"}
tutorialspoint | True |
org.apache.cassandra.locator.SimpleStrategy |
{"replication_factor":"4"}
system | True |
org.apache.cassandra.locator.LocalStrategy | { }
system_traces | True |
org.apache.cassandra.locator.SimpleStrategy |
{"replication_factor":"2"}
(4 rows)
You have to store the query in a string variable and pass it to the execute() method
as shown below.
//Query
String query = "ALTER KEYSPACE tp WITH replication " + "=
{'class':'NetworkTopologyStrategy', 'datacenter1':3}" +" AND
DURABLE_WRITES = false;";
session.execute(query);
Given below is the complete program to create and use a keyspace in Cassandra
using Java API.
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
//Query
String query = "ALTER KEYSPACE tp WITH replication " + "=
{'class':'NetworkTopologyStrategy', 'datacenter1':3}"
+ "AND DURABLE_WRITES = false;";
System.out.println("Keyspace altered");
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Alter_KeySpace.java
$java Alter_KeySpace
Under normal conditions, it produces the following output −
Keyspace Altered
Dropping a Keyspace
You can drop a KeySpace using the command DROP KEYSPACE. Given below is
the syntax for dropping a KeySpace.
Syntax
DROP KEYSPACE <identifier>
i.e.
DROP KEYSPACE “KeySpace name”
Example
The following code deletes the keyspace tutorialspoint.
cqlsh> DROP KEYSPACE tutorialspoint;
Verification
Verify the keyspaces using the command Describe and check whether the table is
dropped as shown below.
cqlsh> DESCRIBE keyspaces;
system system_traces
Since we have deleted the keyspace tutorialspoint, you will not find it in the
keyspaces list.
Dropping a Keyspace using Java API
You can create a keyspace using the execute() method of Session class. Follow the
steps given below to drop a keyspace using Java API.
session.execute(query);
Given below is the complete program to create and use a keyspace in Cassandra
using Java API.
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
//Query
String query = "Drop KEYSPACE tp";
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Delete_KeySpace.java
$java Delete_KeySpace
Under normal conditions, it should produce the following output −
Keyspace deleted
TABLE OPERATION
Creating a Table
You can create a table using the command CREATE TABLE. Given below is the
syntax for creating a table.
Syntax
CREATE (TABLE | COLUMNFAMILY) <tablename>
('<column-definition>' , '<column-definition>')
(WITH <option> AND <option>)
Defining a Column
You can define a column as shown below.
column name1 data type,
column name2 data type,
example:
age int,
name text
Primary Key
The primary key is a column that is used to uniquely identify a row.
Therefore,defining a primary key is mandatory while creating a table. A primary key
is made of one or more columns of a table. You can define a primary key of a table
as shown below.
CREATE TABLE tablename(
column1 name datatype PRIMARYKEY,
column2 name data type,
column3 name data type.
)
or
CREATE TABLE tablename(
column1 name datatype PRIMARYKEY,
column2 name data type,
column3 name data type,
PRIMARY KEY (column1)
)
Example
Given below is an example to create a table in Cassandra using cqlsh. Here we are
−
Using the keyspace tutorialspoint
Creating a table named emp
It will have details such as employee name, id, city, salary, and phone number.
Employee id is the primary key.
cqlsh> USE tutorialspoint;
cqlsh:tutorialspoint>; CREATE TABLE emp(
emp_id int PRIMARY KEY,
emp_name text,
emp_city text,
emp_sal varint,
emp_phone varint
);
Verification
The select statement will give you the schema. Verify the table using the select
statement as shown below.
cqlsh:tutorialspoint> select * from emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+----------+----------+-----------+---------
(0 rows)
Here you can observe the table created with the given columns. Since we have
deleted the keyspace tutorialspoint, you will not find it in the keyspaces list.
//Query
String query = "CREATE TABLE emp(emp_id int PRIMARY KEY, "
+ "emp_name text, "
+ "emp_city text, "
+ "emp_sal varint, "
+ "emp_phone varint );";
System.out.println("Table created");
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Create_Table.java
$java Create_Table
Under normal conditions, it should produce the following output −
Table created
Altering a Table
You can alter a table using the command ALTER TABLE. Given below is the
syntax for creating a table.
Syntax
ALTER (TABLE | COLUMNFAMILY) <tablename> <instruction>
Using ALTER command, you can perform the following operations −
Add a column
Drop a column
Adding a Column
Using ALTER command, you can add a column to a table. While adding columns,
you have to take care that the column name is not conflicting with the existing
column names and that the table is not defined with compact storage option. Given
below is the syntax to add a column to a table.
ALTER TABLE table name
ADD new column datatype;
Example
Given below is an example to add a column to an existing table. Here we are
adding a column called emp_email of text datatype to the table named emp.
cqlsh:tutorialspoint> ALTER TABLE emp
... ADD emp_email text;
Verification
Use the SELECT statement to verify whether the column is added or not. Here you
can observe the newly added column emp_email.
cqlsh:tutorialspoint> select * from emp;
Dropping a Column
Using ALTER command, you can delete a column from a table. Before dropping a
column from a table, check that the table is not defined with compact storage
option. Given below is the syntax to delete a column from a table using ALTER
command.
ALTER table name
DROP column name;
Example
Given below is an example to drop a column from a table. Here we are deleting the
column named emp_email.
cqlsh:tutorialspoint> ALTER TABLE emp DROP emp_email;
Verification
Verify whether the column is deleted using the select statement, as shown below.
cqlsh:tutorialspoint> select * from emp;
//Query
String query = "ALTER TABLE emp ADD emp_email text";
System.out.println("Column added");
}
}
Dropping a Table
You can drop a table using the command Drop Table. Its syntax is as follows −
Syntax
DROP TABLE <tablename>
Example
The following code drops an existing table from a KeySpace.
cqlsh:tutorialspoint> DROP TABLE emp;
Verification
Use the Describe command to verify whether the table is deleted or not. Since the
emp table has been deleted, you will not find it in the column families list.
cqlsh:tutorialspoint> DESCRIBE COLUMNFAMILIES;
employee
//Query
String query = "DROP TABLE emp1;";
Cluster cluster =
Cluster.builder().addContactPoint("127.0.0.1").build();
System.out.println("Table dropped");
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Drop_Table.java
$java Drop_Table
Under normal conditions, it should produce the following output −
Table dropped
CRUD OPERATIONS
Creating Data in a Table
You can insert data into the columns of a row in a table using the
command INSERT. Given below is the syntax for creating data in a table.
INSERT INTO <tablename>
(<column1 name>, <column2 name>....)
VALUES (<value1>, <value2>....)
USING <option>
Example
Let us assume there is a table called emp with columns (emp_id, emp_name,
emp_city, emp_phone, emp_sal) and you have to insert the following data into
the emp table.
Use the commands given below to fill the table with required data.
cqlsh:tutorialspoint> INSERT INTO emp (emp_id, emp_name,
emp_city,
emp_phone, emp_sal) VALUES(1,'ram', 'Hyderabad', 9848022338,
50000);
Verification
After inserting data, use SELECT statement to verify whether the data has been
inserted or not. If you verify the emp table using SELECT statement, it will give you
the following output.
cqlsh:tutorialspoint> SELECT * FROM emp;
emp_id | emp_city | emp_name | emp_phone | emp_sal
--------+-----------+----------+------------+---------
1 | Hyderabad | ram | 9848022338 | 50000
2 | Hyderabad | robin | 9848022339 | 40000
3 | Chennai | rahman | 9848022330 | 45000
(3 rows)
Here you can observe the table has populated with the data we inserted.
session.execute(query1);
session.execute(query2);
session.execute(query3);
Given below is the complete program to insert data into a table in Cassandra using
Java API.
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
//queries
String query1 = "INSERT INTO emp (emp_id, emp_name,
emp_city, emp_phone, emp_sal)"
session.execute(query2);
session.execute(query3);
System.out.println("Data created");
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Create_Data.java
$java Create_Data
Under normal conditions, it should produce the following output −
Data created
UPDATE
Updating Data in a Table
UPDATE is the command used to update data in a table. The following keywords
are used while updating data in a table −
Where − This clause is used to select the row to be updated.
Set − Set the value using this keyword.
Must − Includes all the columns composing the primary key.
While updating rows, if a given row is unavailable, then UPDATE creates a fresh
row. Given below is the syntax of UPDATE command −
UPDATE <tablename>
SET <column name> = <new value>
<column name> = <value>....
WHERE <condition>
Example
Assume there is a table named emp. This table stores the details of employees of a
certain company, and it has the following details −
Let us now update emp_city of robin to Delhi, and his salary to 50000. Given below
is the query to perform the required updates.
cqlsh:tutorialspoint> UPDATE emp SET
emp_city='Delhi',emp_sal=50000
WHERE emp_id=2;
Verification
Use SELECT statement to verify whether the data has been updated or not. If you
verify the emp table using SELECT statement, it will produce the following output.
cqlsh:tutorialspoint> select * from emp;
(3 rows)
Here you can observe the table data has got updated.
//query
String query = " UPDATE emp SET
emp_city='Delhi',emp_sal=50000"
System.out.println("Data updated");
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Update_Data.java
$java Update_Data
Under normal conditions, it should produce the following output −
Data updated
READ DATA
Reading Data using Select Clause
SELECT clause is used to read data from a table in Cassandra. Using this clause,
you can read a whole table, a single column, or a particular cell. Given below is the
syntax of SELECT clause.
SELECT FROM <tablename>
Example
Assume there is a table in the keyspace named emp with the following details −
The following example shows how to read a whole table using SELECT clause.
Here we are reading a table called emp.
cqlsh:tutorialspoint> select * from emp;
(4 rows)
emp_name | emp_sal
----------+---------
ram | 50000
robin | 50000
rajeev | 30000
rahman | 50000
(4 rows)
Where Clause
Using WHERE clause, you can put a constraint on the required columns. Its syntax
is as follows −
SELECT FROM <table name> WHERE <condition>;
Note − A WHERE clause can be used only on the columns that are a part of
primary key or have a secondary index on them.
In the following example, we are reading the details of an employee whose salary is
50000. First of all, set secondary index to the column emp_sal.
cqlsh:tutorialspoint> CREATE INDEX ON emp(emp_sal);
cqlsh:tutorialspoint> SELECT * FROM emp WHERE emp_sal=50000;
//queries
String query = "SELECT * FROM emp";
System.out.println(result.all());
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Read_Data.java
$java Read_Data
Under normal conditions, it should produce the following output −
[Row[1, Hyderabad, ram, 9848022338, 50000], Row[2, Delhi, robin,
9848022339, 50000], Row[4, Pune, rajeev, 9848022331, 30000],
Row[3,
Chennai, rahman, 9848022330, 50000]]
DELETE DATA
Deleting Datafrom a Table
You can delete data from a table using the command DELETE. Its syntax is as
follows −
DELETE FROM <identifier> WHERE <condition>;
Example
Let us assume there is a table in Cassandra called emp having the following data −
Verification
Use SELECT statement to verify whether the data has been deleted or not. If you
verify the emp table using SELECT, it will produce the following output −
cqlsh:tutorialspoint> select * from emp;
Verification
Use SELECT statement to verify whether the data has been deleted or not. If you
verify the emp table using SELECT, it will produce the following output −
cqlsh:tutorialspoint> select * from emp;
//query
String query = "DELETE FROM emp WHERE emp_id=3;";
System.out.println("Data deleted");
}
}
Save the above program with the class name followed by .java, browse to the
location where it is saved. Compile and execute the program as shown below.
$javac Delete_Data.java
$java Delete_Data
Under normal conditions, it should produce the following output −
Data deleted
Cassandra CQL Types
Query languages are computer languages used to make queries in databases and
information systems. Cassandra Query Language is the primary query language for
communicating with the Apache Cassandra database.
CQL provides a rich set of built-in data types, including collection types. Along with
these data types, users can also create their own custom data types. The following
table provides a list of built-in data types available in CQL.
Data Type Constants Description
UUID
varchar strings Represents uTF8 encoded string
Collection Types
Cassandra Query Language also provides a collection data types. The following
table provides a list of Collections available in CQL.
Collection Description
User-defined datatypes
Cqlsh provides users a facility of creating their own data types. Given below are the
commands used while dealing with user defined datatypes.
CREATE TYPE − Creates a user-defined datatype.
ALTER TYPE − Modifies a user-defined datatype.
DROP TYPE − Drops a user-defined datatype.
DESCRIBE TYPE − Describes a user-defined datatype.
DESCRIBE TYPES − Describes user-defined datatypes.
CQL COLLECTIONS
CQL provides the facility of using Collection data types. Using these Collection
types, you can store multiple values in a single variable. This chapter explains how
to use Collections in Cassandra.
List
List is used in the cases where
the order of the elements is to be maintained, and
a value is to be stored multiple times.
You can get the values of a list data type using the index of the elements in the list.
Updating a List
Given below is an example to update the list data type in a table called data. Here
we are adding another email to the list.
cqlsh:tutorialspoint> UPDATE data
... SET email = email +['xyz@tutorialspoint.com']
... where name = 'ramu';
Verification
If you verify the table using SELECT statement, you will get the following result −
cqlsh:tutorialspoint> SELECT * FROM data;
name | email
------
+--------------------------------------------------------------
ramu | ['abc@gmail.com', 'cba@yahoo.com',
'xyz@tutorialspoint.com']
(1 rows)
HIVE
Hive is a lightweight, NoSQL database, easy to
implement and also having high benchmark on the
devices and written in the pure dart. ...
Hive having the concept of boxes(which store data).
A Box has to be opened before use.
Hive allows users to read, write, and manage
petabytes of data using SQL.
Hive is built on top of Apache Hadoop, which is an
open-source framework used to efficiently store and
process large datasets.
As a result, Hive is closely integrated with Hadoop, and
is designed to work quickly on petabytes of data.
DATA TYPES
This chapter takes you through the different data types in Hive, which are involved
in the table creation. All the data types in Hive are classified into four types, given
as follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data
range exceeds the range of INT, you need to use BIGINT and if the data range is
smaller than the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
The following table depicts various INT data types:
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes ("
"). It contains two data types: VARCHAR and CHAR. Hive follows C-types escape
characters.
The following table depicts various CHAR data types:
VARCHAR 1 to 65355
CHAR 255
Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It
supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format
“yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance
using create union. The syntax and example is as follows:
UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
The following literals are used in Hive:
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Missing values are represented by the special value NULL.
Complex Types
The Hive complex data types are as follows:
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Maps in Hive are similar to Java Maps.
Syntax: MAP<primitive_type, data_type>
Structs
Structs in Hive is similar to using complex data with comment.
Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
Previous Page Print Page
DATABASE OPERATIONS
CREATE DATABASE
Hive is a database technology that can define databases and tables to analyze
structured data. The theme for structured data analysis is to store the data in a
tabular manner, and pass queries to analyze it. This chapter explains how to create
Hive database. Hive contains a default database named default.
Here, IF NOT EXISTS is an optional clause, which notifies the user that a database
with the same name already exists. We can use SCHEMA in place of DATABASE
in this command. The following query is executed to create a database
named userdb:
hive> CREATE DATABASE [IF NOT EXISTS] userdb;
or
hive> CREATE SCHEMA userdb;
JDBC Program
The JDBC program to create a database is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
Class.forName(driverName);
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default"
, "", "");
Statement stmt = con.createStatement();
con.close();
}
}
Save the program in a file named HiveCreateDb.java. The following commands are
used to compile and execute this program.
$ javac HiveCreateDb.java
$ java HiveCreateDb
Output:
Database userdb created successfully.
The following queries are used to drop a database. Let us assume that the
database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;
The following query drops the database using CASCADE. It means dropping
respective tables before dropping the database.
hive> DROP DATABASE IF EXISTS userdb CASCADE;
JDBC Program
The JDBC program to drop a database is given below.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/default"
, "", "");
Statement stmt = con.createStatement();
stmt.executeQuery("DROP DATABASE userdb");
con.close();
}
}
Save the program in a file named HiveDropDb.java. Given below are the commands
to compile and execute this program.
$ javac HiveDropDb.java
$ java HiveDropDb
Output:
Drop userdb database successful.
PARTITIONING
Hive organizes tables into partitions. It is a way of dividing a table into related parts
based on the values of partitioned columns such as date, city, and department.
Using partition, it is easy to query a portion of the data.
Tables or partitions are sub-divided into buckets, to provide extra structure to the
data that may be used for more efficient querying. Bucketing works based on the
value of hash function of some column of a table.
For example, a table named Tab1 contains employee data such as id, name, dept,
and yoj (i.e., year of joining). Suppose you need to retrieve the details of all
employees who joined in 2012. A query searches the whole table for the required
information. However, if you partition the employee data with the year and store it in
a separate file, it reduces the query processing time. The following example shows
how to partition a file and its data:
The following file contains employeedata table.
/tab1/employeedata/file1
id, name, dept, yoj
1, gopal, TP, 2012
2, kiran, HR, 2012
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
The above data is partitioned into two files using year.
/tab1/employeedata/2012/file2
1, gopal, TP, 2012
2, kiran, HR, 2012
/tab1/employeedata/2013/file3
3, kaleel,SC, 2013
4, Prasanth, SC, 2013
Adding a Partition
We can add partitions to a table by altering the table. Let us assume we have a
table called employee with fields such as Id, Name, Salary, Designation, Dept, and
yoj.
Syntax:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION
partition_spec
[LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...;
partition_spec:
: (p_column = p_col_value, p_column = p_col_value, ...)
Renaming a Partition
The syntax of this command is as follows.
ALTER TABLE table_name PARTITION partition_spec RENAME TO
PARTITION partition_spec;
Dropping a Partition
The following syntax is used to drop a partition:
ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec,
PARTITION partition_spec,...;
HIVEQL
The Hive Query Language (HiveQL) is a query language for Hive to process and
analyze structured data in a Metastore. ... It filters the data using the condition and
gives you a finite result. The built-in operators and functions generate an expression,
which fulfils the condition.
HiveQL - Select-Where
The Hive Query Language (HiveQL) is a query language for Hive to process and
analyze structured data in a Metastore. This chapter explains how to use the
SELECT statement with WHERE clause.
SELECT statement is used to retrieve the data from a table. WHERE clause works
similar to a condition. It filters the data using the condition and gives you a finite
result. The built-in operators and functions generate an expression, which fulfils the
condition.
Syntax
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY
col_list]]
[LIMIT number];
Example
Let us take an example for SELECT…WHERE clause. Assume we have the
employee table as given below, with fields named Id, Name, Salary, Designation,
and Dept. Generate a query to retrieve the employee details who earn a salary of
more than Rs 30000.
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+--------
+
|1201 | Gopal | 45000 | Technical manager | TP
|
|1202 | Manisha | 45000 | Proofreader | PR
|
|1203 | Masthanvali | 40000 | Technical writer | TP
|
|1204 | Krian | 40000 | Hr Admin | HR
|
|1205 | Kranthi | 30000 | Op Admin | Admin
|
+------+--------------+-------------+-------------------+--------
+
The following query retrieves the employee details using the above scenario:
hive> SELECT * FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+--------
+
|1201 | Gopal | 45000 | Technical manager | TP
|
|1202 | Manisha | 45000 | Proofreader | PR
|
|1203 | Masthanvali | 40000 | Technical writer | TP
|
|1204 | Krian | 40000 | Hr Admin | HR
|
+------+--------------+-------------+-------------------+--------
+
JDBC Program
The JDBC program to apply where clause for the given example is as follows.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb",
"", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee
WHERE salary>30000;");
System.out.println("Result:");
System.out.println(" ID \t Name \t Salary \t Designation \t
Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " +
res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Save the program in a file named HiveQLWhere.java. Use the following commands
to compile and execute this program.
$ javac HiveQLWhere.java
$ java HiveQLWhere
Output:
ID Name Salary Designation Dept
1201 Gopal 45000 Technical manager TP
1202 Manisha 45000 Proofreader PR
1203 Masthanvali 40000 Technical writer TP
1204 Krian 40000 Hr Admin HR
HiveQL - Select-Order By
This chapter explains how to use the ORDER BY clause in a SELECT statement.
The ORDER BY clause is used to retrieve the details based on one column and sort
the result set by ascending or descending order.
Syntax
Given below is the syntax of the ORDER BY clause:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
Example
Let us take an example for SELECT...ORDER BY clause. Assume employee table
as given below, with the fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details in order by using Department
name.
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+--------
+
|1201 | Gopal | 45000 | Technical manager | TP
|
|1202 | Manisha | 45000 | Proofreader | PR
|
|1203 | Masthanvali | 40000 | Technical writer | TP
|
|1204 | Krian | 40000 | Hr Admin | HR
|
|1205 | Kranthi | 30000 | Op Admin | Admin
|
+------+--------------+-------------+-------------------+--------
+
The following query retrieves the employee details using the above scenario:
hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT;
On successful execution of the query, you get to see the following response:
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+--------
+
|1205 | Kranthi | 30000 | Op Admin | Admin
|
|1204 | Krian | 40000 | Hr Admin | HR
|
|1202 | Manisha | 45000 | Proofreader | PR
|
|1201 | Gopal | 45000 | Technical manager | TP
|
|1203 | Masthanvali | 40000 | Technical writer | TP
|
+------+--------------+-------------+-------------------+--------
+
JDBC Program
Here is the JDBC program to apply Order By clause for the given example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con =
DriverManager.getConnection("jdbc:hive://localhost:10000/userdb",
"", "");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery("SELECT * FROM employee
ORDER BY DEPT;");
System.out.println(" ID \t Name \t Salary \t Designation \t
Dept ");
while (res.next()) {
System.out.println(res.getInt(1) + " " +
res.getString(2) + " " + res.getDouble(3) + " " +
res.getString(4) + " " + res.getString(5));
}
con.close();
}
}
Output:
ID Name Salary Designation Dept
1205 Kranthi 30000 Op Admin Admin
1204 Krian 40000 Hr Admin HR
1202 Manisha 45000 Proofreader PR
1201 Gopal 45000 Technical manager TP
1203 Masthanvali 40000 Technical writer TP
1204 Krian 40000 Hr Admin HR
HiveQL - Select-Group By
This chapter explains the details of GROUP BY clause in a SELECT statement. The
GROUP BY clause is used to group all the records in a result set using a particular
collection column. It is used to query a group of records.
Syntax
The syntax of GROUP BY clause is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]]
[LIMIT number];
Example
Let us take an example of SELECT…GROUP BY clause. Assume employee table
as given below, with Id, Name, Salary, Designation, and Dept fields. Generate a
query to retrieve the number of employees in each department.
+------+--------------+-------------+-------------------+--------
+
| ID | Name | Salary | Designation | Dept
|
+------+--------------+-------------+-------------------+--------
+
|1201 | Gopal | 45000 | Technical manager | TP
|
|1202 | Manisha | 45000 | Proofreader | PR
|
|1203 | Masthanvali | 40000 | Technical writer | TP
|
|1204 | Krian | 45000 | Proofreader | PR
|
|1205 | Kranthi | 30000 | Op Admin | Admin
|
+------+--------------+-------------+-------------------+--------
+
The following query retrieves the employee details using the above scenario.
hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;
On successful execution of the query, you get to see the following response:
+------+--------------+
| Dept | Count(*) |
+------+--------------+
|Admin | 1 |
|PR | 2 |
|TP | 3 |
+------+--------------+
JDBC Program
Given below is the JDBC program to apply the Group By clause for the given
example.
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
// get connection
Connection con = DriverManager.
getConnection("jdbc:hive://localhost:10000/userdb", "",
"");
// create statement
Statement stmt = con.createStatement();
// execute statement
Resultset res = stmt.executeQuery(“SELECT Dept,count(*) ” +
“FROM employee GROUP BY DEPT; ”);
System.out.println(" Dept \t count(*)");
while (res.next()) {
System.out.println(res.getString(1) + " " +
res.getInt(2));
}
con.close();
}
}
Output:
Dept Count(*)
Admin 1
PR 2
TP 3
HiveQL - Select-Joins
JOIN is a clause that is used for combining specific fields from two tables by using
values common to each one. It is used to combine records from two or more tables
in the database.
Syntax
join_table:
Example
We will use the following two tables in this chapter. Consider the following table
named CUSTOMERS..
+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |
+----+----------+-----+-----------+----------+
Consider another table ORDERS as follows:
+-----+---------------------+-------------+--------+
|OID | DATE | CUSTOMER_ID | AMOUNT |
+-----+---------------------+-------------+--------+
| 102 | 2009-10-08 00:00:00 | 3 | 3000 |
| 100 | 2009-10-08 00:00:00 | 3 | 1500 |
| 101 | 2009-11-20 00:00:00 | 2 | 1560 |
| 103 | 2008-05-20 00:00:00 | 4 | 2060 |
+-----+---------------------+-------------+--------+
There are different types of joins given as follows:
JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN
is same as OUTER JOIN in SQL. A JOIN condition is to be raised using the primary
keys and foreign keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and
retrieves the records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT
FROM CUSTOMERS c JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+----+----------+-----+--------+
| ID | NAME | AGE | AMOUNT |
+----+----------+-----+--------+
| 3 | kaushik | 23 | 3000 |
| 3 | kaushik | 23 | 1500 |
| 2 | Khilan | 25 | 1560 |
| 4 | Chaitali | 25 | 2060 |
+----+----------+-----+--------+
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
FULL OUTER JOIN
The HiveQL FULL OUTER JOIN combines the records of both the left and the right
outer tables that fulfil the JOIN condition. The joined table contains either all the
records from both the tables, or fills in NULL values for missing matches on either
side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and
ORDER tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE
FROM CUSTOMERS c
FULL OUTER JOIN ORDERS o
ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:
+------+----------+--------+---------------------+
| ID | NAME | AMOUNT | DATE |
+------+----------+--------+---------------------+
| 1 | Ramesh | NULL | NULL |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
| 5 | Hardik | NULL | NULL |
| 6 | Komal | NULL | NULL |
| 7 | Muffy | NULL | NULL |
| 3 | kaushik | 3000 | 2009-10-08 00:00:00 |
| 3 | kaushik | 1500 | 2009-10-08 00:00:00 |
| 2 | Khilan | 1560 | 2009-11-20 00:00:00 |
| 4 | Chaitali | 2060 | 2008-05-20 00:00:00 |
+------+----------+--------+---------------------+
OrientDB ships with a generic vertex persistent class, called V, as well as a class
for edges, called E. As an example, you can create a new vertex using
the INSERT command with V.
orientdb> INSERT INTO V SET name='Jay'
While you have the option of working with vertexes and edges in your database as
they are, you can also extend the standard V and E classes to suit the particular
needs of your application. The advantages of this approach are,
For example, consider a social network based on restaurants. You need to start
with a class for individual customers and another for the restaurants they
patronize. Create these classes to extend the V class.
orientdb> CREATE CLASS Person EXTENDS V
Creating Edges
For the moment, these vertices are independent of one another, tied together only
by the classes to which they belong. That is, they are not yet connected by edges.
Before you can make these connections, you first need to create a class that
extends E.
orientdb> CREATE CLASS Eat EXTENDS E
This creates the class Eat, which extends the class E. Eat represents the
relationship between the vertex Person and the vertex Restaurant.
When you create the edge from this class, note that the orientation of the vertices
is important, because it gives the relationship its meaning. For instance, creating
an edge in the opposite direction, (from Restaurant to Person), would call for a
separate class, such as Attendee.
The user Luca eats at the pizza joint Dante. Create an edge that represents this
connection:
orientdb> CREATE EDGE Eat FROM ( SELECT FROM Person WHERE name='Luca' )
TO ( SELECT FROM Restaurant WHERE name='Dante' )
In the event that you know the Record ID of the vertices, you can connect them
directly with a shorter and faster command. For example, the person Bill also eats
at the restaurant Dante and the person Jay eats at the restaurant Charlie. Create
edges in the class Eat to represent these connections.
orientdb> CREATE EDGE Eat FROM #11:1 TO #12:0
Querying Graphs
In the above example you created and populated a small graph of a social network
of individual users and the restaurants at which they eat. You can now begin to
experiment with queries on a graph database.
To cross edges, you can use special graph functions, such as:
For example, to know all of the people who eat in the restaurant Dante, which has
a Record ID of #12:0, you can access the record for that restaurant and traverse
the incoming edges to discover which entries in the Person class connect to it.
orientdb> SELECT IN() FROM Restaurant WHERE name='Dante'
-------+----------------+
@RID | in |
-------+----------------+
#-2:1 | [#11:0, #11:1] |
-------+----------------+
This query displays the record ID's from the Person class that connect to the
restaurant Dante. In cases such as this, you can use the EXPAND() special function
to transform the vertex collection in the result-set by expanding it.
orientdb> SELECT EXPAND( IN() ) FROM Restaurant WHERE name='Dante'
-------+-------------+-------------+---------+
@RID | @CLASS | Name | out_Eat |
-------+-------------+-------------+---------+
#11:0 | Person | Luca | #12:0 |
#11:1 | Person | Bill | #12:0 |
-------+-------------+-------------+---------+
Your application at this point shows connections between individual users and the
restaurants they patronize. While this is interesting, it does not yet function as a
social network. To do so, you need to establish edges that connect the users to
one another.
-------+-------------+-------------+---------+-----------+
@RID | @CLASS | Name | out_Eat | in_Friend |
-------+-------------+-------------+---------+-----------+
#11:2 | Person | Jay | #12:1 | #11:0 |
-------+-------------+-------------+---------+-----------+
Here, the BOTH() function takes the edge class Friend as an argument, crossing
only relationships of the Friend kind, (that is, it skips the Eat class, at this time).
Note in the result-set that the relationship with Luca, with a Record ID of #11:0 in
the in_ field.
You can also now view all the restaurants patronized by friends of Luca.
orientdb> SELECT EXPAND( BOTH('Friend').out('Eat') ) FROM Person
WHERE name='Luca'
-------+-------------+-------------+-------------+--------+
@RID | @CLASS | Name | Type | in_Eat |
-------+-------------+-------------+-------------+--------+
#12:1 | Restaurant | Charlie | French | #11:2 |
-------+-------------+-------------+-------------+--------+
Lightweight Edges
In version 1.4.x, OrientDB begins to manage some edges as Lightweight Edges.
Lightweight Edges do not have Record ID's, but are physically stored as links
within vertices. Note that OrientDB only uses a Lightweight Edge only when the
edge has no properties, otherwise it uses the standard Edge.
From the logic point of view, Lightweight Edges are Edges in all effects, so that all
graph functions work with them. This is to improve performance and reduce disk
space.
Because Lightweight Edges don't exist as separate records in the database, some
queries won't work as expected. For instance,
orientdb> SELECT FROM E
For most cases, an edge is used connecting vertices, so this query would not
cause any problems in particular. But, it would not return Lightweight Edges in the
result-set. In the event that you need to query edges directly, including those with
no properties, disable the Lightweight Edge feature.
– OrientDB Features