Bda Lab
Bda Lab
Acropolis Institute of
Technology and
Research, Indore
Department of CSE
Submitted To: Prof. Sumit Jain
(Artificial Intelligence &
Machine Learning)
Submitted By:
Tushar Padihar
0827AL201060
AL F-1/4th Yr/ 8th Sem
CERTIFICATE
This is to certify that the experimental work entered in this journal as per the
B. TECH. IV year syllabus prescribed by the RGPV was done by Mr./ Ms.
Analytics Laboratory of this institute during the academic year 2023- 2024.
In this lab, students will be able to learn and develop application using
Big Data Analytics concepts. Students can expand their skill set by
deriving practical solutions using data analytics. More, this lab provides
the understanding of the importance of various algorithms in Data
Science. The primary objective of this lab is to optimize business
decisions and create a competitive advantage with Big Data analytics.
This lab will introduce the basics required to develop map reduce
programs; derive business benefit from unstructured data. This lab will
also give an overview of the architectural concepts of Hadoop and
introducing map reduce paradigm. Another objective of this lab is to
introduce programming tools PIG & HIVE in Hadoop ecosystem.
GENERAL INSTRUCTIONS FOR LABORATORY CLASSES
DO’S
While entering into the LAB students should wear their ID cards.
Students should come with observation and record note book to the laboratory.
DONT’S
Module1: Introduction to Big data, Big data characteristics, Types of big data,
Traditional Versus Big data, Evolution of Big data, challenges with Big Data,
Technologies available for Big Data, Infrastructure for Big data, Use of Data Analytics,
Desired properties of Big Data system.
Module3: Introduction to Hive Hive Architecture, Hive Data types, Hive Query
Language, Introduction to Pig, Anatomy of Pig, Pig on Hadoop, Use Case for Pig, ETL
Processing, Data types in Pig running Pig, Execution model of Pig, Operators,
functions,Data types of Pig.
Course Objectives
The student should be made to:
1. To analyze big data using machine learning techniques such as Decision tree
classification and clustering.
2. To realize storage of big data using MongoDB.
3. To implement MapReduce programs for processing big data.
4. To Explain the structure and unstructured data by using NoSQL commands.
5. To Develop problem solving and critical thinking skills in fundamental enable
techniques like Hadoop & MapReduce.
Course Outcomes
At the end of the course student will be able to:
1. Students should be able to understand the concept and challenges of Big data.
2. Students should be able to demonstrate knowledge of big data analytics.
3. Students should be able to develop Big Data Solutions using Hadoop Eco System
4. Students should be able to gain hands-on experience on large-scale analytics tools.
5. Students should be able to analyze the social network graphs.
Index
Grade &
Date Page Date of
S.No Name of the Experiment Sign of the
of No. Submission
Faculty
Exp.
1 09
Perform setting up and installing single node
Hadoop in windows environment.
A B C D F Tutor
Page 9
3.2 Program
Install Hadoop
Step 1: Click here to download the Java 8 Package. Save this file in your home directory.
Step 2: Extract the Java Tar File.
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.
To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
Page 10
Step 6: Edit the Hadoop Configuration files.
Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS &
MapReduce.
Page 11
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like number of
JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.
Page 12
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.
This formats the HDFS via NameNode. This command is only executed for the first time.
Formatting the file system means initializing the directory specified by the dfs.name.dir
variable.Never format, up and running Hadoop filesystem. You will lose all your data stored
in the HDFS.
Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all
the daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
The above command is a combination of start-dfs.sh, start-yarn.sh & mr-jobhistory
daemon.sh
Page 13
Step 14: To check that all the Hadoop services are up and running, run the below command.
Step 15: Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check
the NameNode interface.
4 Results: Congratulations, you have successfully installed a single-node Hadoop cluster in one go.
Page 14
Page 15
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802- Title: Java Program
C)
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 16
commonly supported by queues include peek (to view the front element without removing
it) and isEmpty (to check if the queue is empty).
Here's a simple algorithm for a queue:
Initialize: Create an empty queue.
Enqueue: Add an element to the rear end of the queue.
Dequeue: Remove and return the element from the front end of the queue.
Peek: Return the element at the front end of the queue without removing it.
isEmpty: Check if the queue is empty.
Size: Return the number of elements currently in the
queue Map:
Insertion: When inserting an element into a map, the algorithm typically involves hashing
the key to determine the index or position where the key-value pair should be stored. If
there's a collision (i.e., two keys hash to the same index), most implementations use
techniques like chaining or open addressing to handle it.
Retrieval: To retrieve a value from a map given a key, the key is hashed to find its
position in the underlying data structure. If chaining is used to handle collisions, the
algorithm may iterate through the chain at that position to find the key. If open
addressing is used, it may involve probing until the key is found.
Deletion: Deleting an element from a map typically involves finding the key in the
underlying data structure and removing it. Again, this may require handling collisions
appropriately.
Set:
Insertion: When inserting an element into a set, the algorithm typically involves
hashing the element to determine its position in the underlying data structure, similar to
how it's done in maps. However, since sets only store unique elements, implementations
need to check for duplicates before inserting.
Search: Searching for an element in a set also involves hashing the element to find its
position in the data structure. If the element is found at that position, it exists in the set;
otherwise, it doesn't.
Deletion: Deleting an element from a set is similar to deletion in a map. The element is
located in the underlying data structure, and if found, it's removed.
Page 17
3.2 Program
Linked List
import java.io.*;
public class LinkedList
{ Node head;
static class Node
{ int data;
Node next;
Node(int d)
{
data = d;
next = null;
}
}
public static LinkedList insert(LinkedList list, int data)
{
Node new_node = new Node(data);
if (list.head == null) {
list.head = new_node;
}
else {
Node last = list.head;
while (last.next != null) {
last = last.next;
}
last.next = new_node;
}
return list;
}
public static void printList(LinkedList list)
{
Node currNode = list.head;
System.out.print("LinkedList: ");
currNode = currNode.next;
}
}
}
Queue:
Page 18
import java.util.LinkedList;
public class Queue<T> {
private LinkedList<T> elements;
public Queue() {
elements = new LinkedList<>();
}
public void enqueue(T item) {
elements.addLast(item);
}
public T dequeue()
{ if (isEmpty()) {
throw new IllegalStateException("Queue is empty");
}
return elements.removeFirst();
}
public T peek()
{ if (isEmpty())
{
throw new IllegalStateException("Queue is empty");
}
return elements.getFirst();
}
public boolean isEmpty()
{ return
elements.isEmpty();
}
public int size() {
return elements.size();
}
}
Map:
import java.util.HashMap;
public class Main {
public static void main(String[] args)
{ HashMap<String, Integer> map = new
HashMap<>(); map.put("John", 25);
map.put("Alice", 30);
map.put("Bob", 20);
System.out.println("Age of John: " + map.get("John"));
System.out.println("Is Bob present? " + map.containsKey("Bob"));
System.out.println("Is age 30 present? " + map.containsValue(30));
map.remove("Alice");
System.out.println("Size of map after removing Alice: " + map.size());
Page 19
}
Page
110
}
Set:
import java.util.HashSet;
public class Main {
public static void main(String[] args)
{ HashSet<Integer> set = new HashSet<>();
set.add(5);
set.add(10);
set.add(15);
set.add(20);
set.add(25);
System.out.println("Is 10 present? " +
set.contains(10)); System.out.println("Is 30 present? "
+ set.contains(30)); set.remove(20);
System.out.println("Size of set after removing 20: " + set.size());
}
}
4 Tabulation Sheet
INPUT OUTPUT
1, 2, 3, 4, 5, 6, 7, 8
3
10
10
2
20
20,30,40
map.put("John", 25);
map.put("Alice", 30);
map.put("Bob", 20);
5 Results
Page
111
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
Title: File Management
C)
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 20
Hadoop Java API) to add the file or directory to HDFS.
If adding a directory, recursively add all files and subdirectories within it.
Check for errors and handle them appropriately.
Retrieving files:
Connect to the Hadoop cluster.
Specify the file or directory you want to retrieve from HDFS.
Use the Hadoop command line interface or an API to retrieve the file or directory from
HDFS.
Transfer the file or directory from HDFS to your local file system or another location as
needed.
Check for errors and handle them appropriately.
Deleting files:
Connect to the Hadoop cluster.
Identify the file or directory you want to delete from HDFS.
Use the Hadoop command line interface or an API to delete the file or directory from
HDFS.
Optionally, recursively delete all files and subdirectories within the specified directory.
Check for errors and handle them appropriately.
3.2 Program
Step-1 Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
into HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working
directory of /user/$USER, where $USER is your login user name. This directory isn‘t
automatically created for you, though, so let‘s create it with the mkdir command. For the
purpose of illustration, we use chuck. You should substitute your user name in the example
commands.
hadoop fs -mkdir /user/chuck
hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck
Step-2 Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
Page 21
hadoop fs -cat example.txt
Step-3 Deleting Files from HDFS
hadoop fs -rm example.txt
Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.
Adding directory is done through the command “hdfs dfs –put lendi_english /”.
4 Results
Page 22
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: NoSQL Database Operations
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 23
name: "Test Student",
course: { courseName: "MCA", duration: "3 Years" },
address: { city: "Bangalore", state: "KA", country: "India" }
})
An entry has been made into the collection called student.
Querying a document from a collection (Read) to retrieve (Select) the inserted document,
run the below command. The find() command will retrieve all the documents of the given
collection.
db.collection_name.find()
If a record is to be retrieved based on some criteria, the find() method should be called
passing parameters, then the record will be retrieved based on the attributes specified.
db.collection_name.find({"fieldname":"value"})
For Example: Let us retrieve the record from the student collection where the attribute
regNo is 3014and the query for the same is as shown below:
db.students.find({"regNo":"3014"})
Page 24
})
Page 25
Removing an entry from the collection (Delete)
Let us now look into the deleting an entry from a collection. In order to delete an entry from
a collection, run the command as shown below :
db.collection_name.remove({"fieldname":"value"})
For Example : db.student.remove({"regNo":"3014"})
Page 26
ISODate("2020-05-20T16:35:57.461Z")
}
]
}
2. Create and Query a Document
Let's create a blog post document. We will use a database called as blogs and a
collection called as posts. The code is written in mongoshell (an interactive JavaScript
interface to MongoDB). Mongo shell is started from the command line and is connected
to the MongoDB server. From the shell: use blogs
NEW_POST =
{
name: "Working with Arrays",
user: "Database Rebel",
desc: "Maintaining an array of objects in a document",
content: "some content...",
created: ISODate(),
updated:
ISODate(), tags: [
"mongodb", "arrays"
]
}
3. Update an Array Element
Let's update the comment posted by "Database Rebel" with modified text field :
NEW_CONTENT = "Thank you, please look for updates - updated the post".
db.posts.updateOne(
{ _id : ObjectId("5ec55af811ac5e2e2aafb2b9"), "comments.user": "Database
Rebel" }, { $set: { "comments.$.text": NEW_CONTENT } }
)
The $set update operator is used to change a field's value. The positional $ operator
identifies an element in an array to update without explicitly specifying the position of
the element in the array. The first matching element is updated. The updated comment
object:
Page 27
"comments" : [
{
"user" : "Database Rebel",
"text" : "Thank you, please look for updates - updated",
"updated" : ISODate("2020-05-20T16:48:25.506Z")
}
]
Page 28
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Implement Functions
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 29
count : {$sum : 1}
And the result is
{ "_id" : "C", "count" : 3 }
{ "_id" : "D", "count" : 5 }
SORT
Sorts all input documents and returns them to the pipeline in sorted order. You can sort on a
maximum of 32 keys. MongoDB does not store documents in a collection in a particular
order. When sorting on a field which contains duplicate values, documents containing those
values may be returned in any order. If consistent sort order is desired, include at least one
field in your sort that contains unique values. The easiest way to guarantee this is to include
the _id field in your sort query.
Consider the following restaurant
collection: db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
The following command uses the $sort stage to sort on the borough field:
db.restaurants.aggregate(
[ { $sort : { borough : 1 } }
])
Examples Ascending/Descending Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an ascending or
descending sort respectively, as in the following example:
db.users.aggregate( [
{ $sort : { age : -1, posts: 1 } }
])
This operation sorts the documents in the users collection, in descending order according by
the age field and then in ascending order according to the value in the posts field.
LIMIT
Page
210
Sorts all input documents and returns them to the pipeline in sorted order. $sort takes a
document that specifies the field(s) to sort by and the respective sort order. can have one of
the following values: Value Description Sort ascending, Sort descending.
If sorting on multiple fields, sort order is evaluated from left to right. For example, in the
form above, documents are first sorted by. You can sort on a maximum of 32 keys.
MongoDB does not store documents in a collection in a particular order. When sorting on a
field which contains duplicate values, documents containing those values may be returned in
any order. If consistent sort order is desired, include at least one field in your sort that
contains unique values. The easiest way to guarantee this is to include the _id field in your
sort query. Consider the following restaurant collection:
db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
The following command uses the $sort stage to sort on the borough field:
db.restaurants.aggregate( [
{ $sort : { borough : 1 } }
])
Examples Ascending/Descending Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an ascending or
descending sort respectively, as in the following example:
db.users.aggregate( [
{ $sort : { age : -1, posts: 1 } }
])
SKIP
Skips over the specified number of documents that pass into the stage and passes the
remaining documents to the next stage in the pipeline. The $skip stage has the following
prototype form:
{ $skip: <positive 64-bit integer>}
Page 30
4 Tabulation Sheet
INPUT OUTPUT
Page 31
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Creating a database
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 32
Connect to MongoDB: Use a MongoDB client, such as the MongoDB shell or a graphical
user interface (GUI) tool like MongoDB Compass, to connect to the MongoDB server.
Switch to the Desired Database (or Create If Not Exist): Use the use command in the
MongoDB shell to switch to the desired database. If the database doesn't exist, MongoDB
will create it when you first insert data into it.
3.2 Program
Let's create a MongoDB database named 'STD' and a collection named 'students' with the
specified fields using the MongoDB shell.
use STD
db.createCollection("students")
Inserting Documents: Insert additional documents into the collection.
// Insert another student document
db.students.insertOne({
"No.": 1,
"Stu_Name": "Jane Smith",
"Enrol.": "2024002",
"Branch": "Electrical Engineering",
"Contact": "9876543210",
"e-mail": "jane.smith@example.com",
"Score": 78
})
Querying Documents: Retrieve documents from the collection.
// Find all students
db.students.find().pretty()
// Find a specific student by their enrollment number
db.students.find({"Enrol.": "2024001"}).pretty()
// Find students with a score greater than 80
db.students.find({"Score": {"$gt": 80}}).pretty()
Updating Documents: Update existing documents in the collection.
// Update the score of a student
db.students.updateOne({"No.": 1}, {"$set": {"Score": 90}})
Page 33
4. Results
Page 34
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Clustering techniques using SPARK
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
1 Title
2 Neatly Drawn and labeled experimental setup
3 Theoretical solution of the instant problem
k-means clustering is a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining. The approach k-means follows to solve the problem
is called Expectation-Maximization. It can be described as follows:
Assign some cluter centers
Repeated until converged
o E-Step: assign points to the nearest center
o M-step: set the cluster center to the mean
3.1 Algorithm
K-Means Clustering:
K-Means is one of the simplest and most widely used clustering algorithms. It partitions the
data into k clusters, where each cluster is represented by its centroid. The algorithm works as
follows:
Initialize k centroids randomly.
Page 35
Assign each data point to the nearest centroid, forming k clusters.
Recalculate the centroids of the clusters based on the mean of the data points assigned to
each cluster.
Repeat steps 2 and 3 until convergence (when the centroids no longer change
significantly or a maximum number of iterations is reached).
3.2 Program
Set up spark context and SparkSession
Load dataset
You can also get the Statistical resutls from the data frame (Unfortunately, it only works for
Page 36
numerical).
Page 37
Elbow method to determine the optimal number of clusters for k-means clustering
4 Results
Page 38
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: MongoDB / Pig using Hadoop / R.
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 39
4 Results
Data stored in MongoDB
Page 40
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Graph of 50 nodes and edges
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 41
4 Tabulation Sheet
INPUT OUTPUT
1. Number of nodes (50) 1. Empty graph with nodes
2. Number of edges 2. Graph with edges
Page 42
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Betweenness Measure
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular
A B C D F Tutor
Page 43
3.2 Program
4 Tabulation Sheet
INPUT OUTPUT
1. Graph data representing the social 1. Betweenness centrality scores
network (nodes and edges)
2. Graph object
Page 44