0% found this document useful (0 votes)
115 views47 pages

Bda Lab

This document outlines the syllabus for a Big Data Analytics laboratory course. It includes modules covering Hadoop, Hive, Pig, NoSQL, and social network analysis. It provides hardware and software requirements, prerequisites, course objectives, and expected outcomes for students.

Uploaded by

Manas Patidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views47 pages

Bda Lab

This document outlines the syllabus for a Big Data Analytics laboratory course. It includes modules covering Hadoop, Hive, Pig, NoSQL, and social network analysis. It provides hardware and software requirements, prerequisites, course objectives, and expected outcomes for students.

Uploaded by

Manas Patidar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

2023-24

Acropolis Institute of
Technology and
Research, Indore
Department of CSE
Submitted To: Prof. Sumit Jain
(Artificial Intelligence &
Machine Learning)

Big Data Analytics (AL802-C)

Submitted By:
Tushar Padihar
0827AL201060
AL F-1/4th Yr/ 8th Sem

ACROPOLIS INSTITUTE OF TECHNOLOGY & RESEARCH,


[LAB ASSIGNMENT BIG DATA ANALYTICS (AL-802-C)]

TheObjectiveofthislaboratoryworkisto enlighten the student with knowledge base in B


ig DataAnalyticsanditsapplications.Also learn how to determine specific data needs
and to selecttools and techniques to be used in the investigation.
ACROPOLIS INSTITUTE OF TECHNOLOGY & RESEARCH,
INDORE

Department of CSE (Artificial Intelligence & Machine Learning)

CERTIFICATE

This is to certify that the experimental work entered in this journal as per the

B. TECH. IV year syllabus prescribed by the RGPV was done by Mr./ Ms.

Tushar Padihar B.TECH IV year VIII semester in the Big Data

Analytics Laboratory of this institute during the academic year 2023- 2024.

Signature of the Faculty


ABOUT THE LABORATORY

In this lab, students will be able to learn and develop application using
Big Data Analytics concepts. Students can expand their skill set by
deriving practical solutions using data analytics. More, this lab provides
the understanding of the importance of various algorithms in Data
Science. The primary objective of this lab is to optimize business
decisions and create a competitive advantage with Big Data analytics.
This lab will introduce the basics required to develop map reduce
programs; derive business benefit from unstructured data. This lab will
also give an overview of the architectural concepts of Hadoop and
introducing map reduce paradigm. Another objective of this lab is to
introduce programming tools PIG & HIVE in Hadoop ecosystem.
GENERAL INSTRUCTIONS FOR LABORATORY CLASSES

 DO’S

 Without Prior permission do not enter into the Laboratory.

 While entering into the LAB students should wear their ID cards.

 The Students should come with proper uniform.

 Students should sign in the LOGIN REGISTER before entering into


the laboratory.

 Students should come with observation and record note book to the laboratory.

 Students should maintain silence inside the laboratory.

 After completing the laboratory exercise, make sure to shutdown the


system properly.

 DONT’S

 Students bringing the bags inside the laboratory.

 Students using the computers in an improper way.

 Students scribbling on the desk and mishandling the chairs.

 Students using mobile phones inside the laboratory.

 Students making noise inside the laboratory.


SYLLABUS
Course: AL802-C (Big Data Analytics)
Branch/Year/Sem: Artificial Intelligence & Machine Learning / IV / VIII

Module1: Introduction to Big data, Big data characteristics, Types of big data,
Traditional Versus Big data, Evolution of Big data, challenges with Big Data,
Technologies available for Big Data, Infrastructure for Big data, Use of Data Analytics,
Desired properties of Big Data system.

Module2: Introduction to Hadoop, Core Hadoop components, Hadoop Eco system,


Hive Physical Architecture, Hadoop limitations, RDBMS Versus Hadoop, Hadoop
Distributed File system, Processing Data with Hadoop, Managing Resources and
Application with HadoopYARN, Map Reduce programming.

Module3: Introduction to Hive Hive Architecture, Hive Data types, Hive Query
Language, Introduction to Pig, Anatomy of Pig, Pig on Hadoop, Use Case for Pig, ETL
Processing, Data types in Pig running Pig, Execution model of Pig, Operators,
functions,Data types of Pig.

Module4: Introduction to NoSQL, NoSQL Business Drivers, NoSQL Data architectural


patterns, Variations of NOSQL architectural patterns using NoSQL to Manage Big Data,
Introduction to Mango DB.

Module5: Mining social Network Graphs: Introduction Applications of social Network


mining, Social Networks as a Graph, Types of social Networks, Clustering of social
Graphs Direct Discovery of communities in a social graph, Introduction to recommender
system.

HARDWARE AND SOFTWARE REQUIREMENTS:

S. Name of Item Specification


No.
1 Computer System Hard Drive SSD is preferred 500GB
minimum
RAM: 16 GB
CPU Intel i5 minimum Intel i7 or i9
preferred

S. Name of Item Specification


No.
1 Operating system Operating Systems Windows 10 or 11
(Home or Pro) Mac: High Sierra or later
Editor Hadoop or Atlas.ti or HPCC or Storm
PREREQUISITE: -

To work in Big Data analytics, it is helpful to have a knowledge of Hadoop, SQL, R,


Python, and other programming language

COURSE OBJECTIVES AND OUTCOMES

 Course Objectives
The student should be made to:

1. To analyze big data using machine learning techniques such as Decision tree
classification and clustering.
2. To realize storage of big data using MongoDB.
3. To implement MapReduce programs for processing big data.
4. To Explain the structure and unstructured data by using NoSQL commands.
5. To Develop problem solving and critical thinking skills in fundamental enable
techniques like Hadoop & MapReduce.

 Course Outcomes
At the end of the course student will be able to:

1. Students should be able to understand the concept and challenges of Big data.
2. Students should be able to demonstrate knowledge of big data analytics.
3. Students should be able to develop Big Data Solutions using Hadoop Eco System
4. Students should be able to gain hands-on experience on large-scale analytics tools.
5. Students should be able to analyze the social network graphs.
Index
Grade &
Date Page Date of
S.No Name of the Experiment Sign of the
of No. Submission
Faculty
Exp.
1 09
Perform setting up and installing single node
Hadoop in windows environment.

2 Write a java program to implement Linked List, 15


Queue, Set and Map data structures.

To implement the following file management tasks 21


3 in Hadoop System (HDFS): Adding files and
directories, Retrieving files, Deleting files
4 Implement NoSQL Database Operations: CRUD 24
operations, Arrays using MongoDB).

Implement Functions: Count – Sort – Limit – Skip 29


5 – Aggregate using MongoDB.

Create a database ‘STD’ and make a collection (e.g. 33

6 "student" with fields 'No., Stu_Name, Enrol.,


Branch, Contact, e-mail, Score') using Mongodb.
Perform various operations in following
experiments.
7 36
Implement clustering techniques using SPARK.

Implement an application that stores big data in 40


8 MongoDB / Pig using Hadoop / R.
To implement a graph of 50 nodes and edges 42
9 between nodes using netwokx library in python.

Write and implement betweenness measure 44


10
between nodes across the social network.
Program Outcome (PO)

The engineering graduate of this institute will demonstrate:


a) Apply knowledge of mathematics, science, computing and engineering fundamentals to computer
science engineering problems.
b) Able to identify, formulate, and demonstrate with excellent programming, and problem solving skills.
c) Design solutions for engineering problems including design of experiment and processes to meet
desired needs within reasonable constraints of manufacturability, sustainability, ecological,
intellectual and health and safety considerations.
d) Propose and develop effective investigational solution of complex problems using research
methodology; including design of experiment, analysis and interpretation of data, and combination of
information to provide suitable conclusion. synthesis
e) Ability to create, select and use the modern techniques and various tools to solve engineering
problems and to evaluate solutions with an understanding of the limitations.
f) Ability to acquire knowledge of contemporary issues to assess societal, health and safety, legal and
cultural issues.
g) Ability to evaluate the impact of engineering solutions on individual as well as organization in a
societal and environmental context, and recognize sustainable development, and will be aware of
emerging technologies and current professional issues.
h) Capability to possess leadership and managerial skills, and understand and commit to professional
ethics and responsibilities.
i) Ability to demonstrate the team work and function effectively as an individual, with an ability to
design, develop, test and debug the project, and will be able to work with a multi-disciplinary team.
j) Ability to communicate effectively on engineering problems with the community, such as being able
to write effective reports and design documentation.
k) Flexibility to feel the recognition of the need for, and have the ability to engage in independent and
life- long learning by professional development and quality enhancement programs in context of
technological change.
l) A practice of engineering and management principles and apply these to one’s own work, as a
member and leader in a team, to manage projects and entrepreneurship
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802- Title: Single Node Hadoop Cluster
C)
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: Single Node Hadoop Cluster


2 Neatly Drawn and labeled experimental setup: Not Applicable
3 Theoretical solution of the instant problem
3.1 Algorithm
While in a Multi-node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi-node cluster is practically used in
organizations for analyzing Big Data. Considering the above example, in real-time when we
deal with petabytes of data, it needs to be distributed across hundreds of machines to be
processed. Thus, here we use a multi-node cluster.
Prerequisites
 VIRTUAL BOX: it is used for installing the operating system on it.
 OPERATING SYSTEM: You can install Hadoop on Linux-based operating systems.
Ubuntu and CentOS are very commonly used. In this tutorial, we are using CentOS.
 JAVA: You need to install the Java 8 package on your system.
 HADOOP: You require Hadoop 2.7.3 package.

Page 9
3.2 Program
Install Hadoop
Step 1: Click here to download the Java 8 Package. Save this file in your home directory.
Step 2: Extract the Java Tar File.

Step 3: Download the Hadoop 2.7.3 Package.

Step 4: Extract the Hadoop tar File.

Step 5: Add the Hadoop and Java paths in the bash file (.bashrc).
Open. bashrc file. Now, add Hadoop and Java Path as shown below.

Then, save the bash file and close it.


For applying all these changes to the current Terminal, execute the source command.

To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.

Page 10
Step 6: Edit the Hadoop Configuration files.

Step 7: Open core-site.xml and edit the property mentioned below inside configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It contains
configuration settings of Hadoop core such as I/O settings that are common to HDFS &
MapReduce.

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
hdfs-site.xml contains configuration settings of HDFS daemons (i.e. NameNode, DataNode,
Secondary NameNode). It also includes the replication factor and block size of HDFS.

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>

Page 11
</property>
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
mapred-site.xml contains configuration settings of MapReduce application like number of
JVM that can run in parallel, the size of the mapper and the reducer process, CPU cores
available for a process, etc.
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.

<?xml version="1.0" encoding="UTF-8"?>


<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Step 10: Edit yarn-site.xml and edit the property mentioned below inside configuration tag:
yarn-site.xml contains configuration settings of ResourceManager and NodeManager like
application memory management size, the operation needed on program & algorithm, etc.

Page 12
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run Hadoop
like Java home path, etc.

Step 12: Go to Hadoop home directory and format the NameNode.

This formats the HDFS via NameNode. This command is only executed for the first time.
Formatting the file system means initializing the directory specified by the dfs.name.dir
variable.Never format, up and running Hadoop filesystem. You will lose all your data stored
in the HDFS.
Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start all
the daemons.
Command: cd hadoop-2.7.3/sbin
Either you can start all daemons with a single command or do it individually.
Command: ./start-all.sh
The above command is a combination of start-dfs.sh, start-yarn.sh & mr-jobhistory
daemon.sh

Page 13
Step 14: To check that all the Hadoop services are up and running, run the below command.

Step 15: Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check
the NameNode interface.

4 Results: Congratulations, you have successfully installed a single-node Hadoop cluster in one go.

Page 14
Page 15
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802- Title: Java Program
C)
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: Java Program


2 Neatly Drawn and labeled experimental setup: Not Applicable
3 Theoretical solution of the instant problem
3.1 Algorithm
Linked list is a hot-selling topic in terms of interview preparation and to have a solid grip
on such a topic makes you the upper hand during an interview. Linked lists can be
implemented using languages like C++ and Java.
A linked list consists of a node, which contains 2 parts:
 Data → Contains data to be stored in that particular node.
 Pointer to the next node → Contains address to the next node.
A linked list is a collection of such nodes. The first node is termed the HEAD node, and the
last node is called the TAIL node of the linked list.
A queue is a linear data structure that follows the First In, First Out (FIFO) principle,
meaning that the element inserted first will be the one removed first. It supports two main
operations: enqueue, which adds an element to the rear end of the queue, and dequeue,
which removes an element from the front end of the queue. Some additional operations

Page 16
commonly supported by queues include peek (to view the front element without removing
it) and isEmpty (to check if the queue is empty).
Here's a simple algorithm for a queue:
 Initialize: Create an empty queue.
 Enqueue: Add an element to the rear end of the queue.
 Dequeue: Remove and return the element from the front end of the queue.
 Peek: Return the element at the front end of the queue without removing it.
 isEmpty: Check if the queue is empty.
 Size: Return the number of elements currently in the
queue Map:
 Insertion: When inserting an element into a map, the algorithm typically involves hashing
the key to determine the index or position where the key-value pair should be stored. If
there's a collision (i.e., two keys hash to the same index), most implementations use
techniques like chaining or open addressing to handle it.
 Retrieval: To retrieve a value from a map given a key, the key is hashed to find its
position in the underlying data structure. If chaining is used to handle collisions, the
algorithm may iterate through the chain at that position to find the key. If open
addressing is used, it may involve probing until the key is found.
 Deletion: Deleting an element from a map typically involves finding the key in the
underlying data structure and removing it. Again, this may require handling collisions
appropriately.
Set:
 Insertion: When inserting an element into a set, the algorithm typically involves
hashing the element to determine its position in the underlying data structure, similar to
how it's done in maps. However, since sets only store unique elements, implementations
need to check for duplicates before inserting.
 Search: Searching for an element in a set also involves hashing the element to find its
position in the data structure. If the element is found at that position, it exists in the set;
otherwise, it doesn't.
 Deletion: Deleting an element from a set is similar to deletion in a map. The element is
located in the underlying data structure, and if found, it's removed.

Page 17
3.2 Program
Linked List
import java.io.*;
public class LinkedList
{ Node head;
static class Node
{ int data;
Node next;
Node(int d)
{
data = d;
next = null;
}
}
public static LinkedList insert(LinkedList list, int data)
{
Node new_node = new Node(data);
if (list.head == null) {
list.head = new_node;
}
else {
Node last = list.head;
while (last.next != null) {
last = last.next;
}
last.next = new_node;
}
return list;
}
public static void printList(LinkedList list)
{
Node currNode = list.head;
System.out.print("LinkedList: ");

while (currNode != null)


{ System.out.print(currNode.data + " ");

currNode = currNode.next;
}
}
}
Queue:

Page 18
import java.util.LinkedList;
public class Queue<T> {
private LinkedList<T> elements;
public Queue() {
elements = new LinkedList<>();
}
public void enqueue(T item) {
elements.addLast(item);
}
public T dequeue()
{ if (isEmpty()) {
throw new IllegalStateException("Queue is empty");
}
return elements.removeFirst();
}
public T peek()
{ if (isEmpty())
{
throw new IllegalStateException("Queue is empty");
}
return elements.getFirst();
}
public boolean isEmpty()
{ return
elements.isEmpty();
}
public int size() {
return elements.size();
}
}
Map:
import java.util.HashMap;
public class Main {
public static void main(String[] args)
{ HashMap<String, Integer> map = new
HashMap<>(); map.put("John", 25);
map.put("Alice", 30);
map.put("Bob", 20);
System.out.println("Age of John: " + map.get("John"));
System.out.println("Is Bob present? " + map.containsKey("Bob"));
System.out.println("Is age 30 present? " + map.containsValue(30));
map.remove("Alice");
System.out.println("Size of map after removing Alice: " + map.size());
Page 19
}

Page
110
}
Set:
import java.util.HashSet;
public class Main {
public static void main(String[] args)
{ HashSet<Integer> set = new HashSet<>();
set.add(5);
set.add(10);
set.add(15);
set.add(20);
set.add(25);
System.out.println("Is 10 present? " +
set.contains(10)); System.out.println("Is 30 present? "
+ set.contains(30)); set.remove(20);
System.out.println("Size of set after removing 20: " + set.size());
}
}
4 Tabulation Sheet

INPUT OUTPUT
1, 2, 3, 4, 5, 6, 7, 8
3
10
10
2
20
20,30,40
map.put("John", 25);
map.put("Alice", 30);
map.put("Bob", 20);
5 Results

Page
111
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
Title: File Management
C)
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: File Management


2 Neatly Drawn and labeled experimental setup: Not Applicable
3 Theoretical solution of the instant problem
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running
on top of the underlying filesystem of the operating system. HDFS keeps track of where the
data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
line utilities that work similarly to the Linux file commands, and serve as your primary
interface with HDFS. We‘re going to have a look into HDFS by interacting with it from the
command line.
3.1 Algorithm
Adding files and directories:
 Connect to the Hadoop cluster.
 Determine the file or directory you want to add to HDFS.
 Use the Hadoop command line interface (hadoop fs command) or an API (such as

Page 20
Hadoop Java API) to add the file or directory to HDFS.
 If adding a directory, recursively add all files and subdirectories within it.
 Check for errors and handle them appropriately.
Retrieving files:
 Connect to the Hadoop cluster.
 Specify the file or directory you want to retrieve from HDFS.
 Use the Hadoop command line interface or an API to retrieve the file or directory from
HDFS.
 Transfer the file or directory from HDFS to your local file system or another location as
needed.
 Check for errors and handle them appropriately.
Deleting files:
 Connect to the Hadoop cluster.
 Identify the file or directory you want to delete from HDFS.
 Use the Hadoop command line interface or an API to delete the file or directory from
HDFS.
 Optionally, recursively delete all files and subdirectories within the specified directory.
 Check for errors and handle them appropriately.
3.2 Program
Step-1 Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data
into HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working
directory of /user/$USER, where $USER is your login user name. This directory isn‘t
automatically created for you, though, so let‘s create it with the mkdir command. For the
purpose of illustration, we use chuck. You should substitute your user name in the example
commands.
 hadoop fs -mkdir /user/chuck
 hadoop fs -put example.txt
 hadoop fs -put example.txt /user/chuck
Step-2 Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:

Page 21
 hadoop fs -cat example.txt
Step-3 Deleting Files from HDFS
 hadoop fs -rm example.txt
Command for creating a directory in hdfs is “hdfs dfs –mkdir /lendicse”.
Adding directory is done through the command “hdfs dfs –put lendi_english /”.
4 Results

Page 22
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: NoSQL Database Operations
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: NoSQL Database Operations


2 Neatly Drawn and labeled experimental setup: Not Applicable
3 Theoretical solution of the instant problem
3.1 Algorithm
MongoDB is a document-oriented NoSQL database system that provides high scalability,
flexibility, and performance. Unlike standard relational databases, MongoDB stores data in
a JSON document structure form. This makes it easy to operate with dynamic and
unstructured data and MongoDB is an open-source and cross-platform database System.
3.2 Program
CRUD operations refer to the basic Insert, Read, Update and Delete operations. Inserting a
document into a collection (Create) the command db.collection.insert()will perform an
insert operation into a collection of a document.
Let us insert a document to a student collection. You must be connected to a database for
doing any insert. It is done as follows:
db.student.insert({
regNo: "3014",

Page 23
name: "Test Student",
course: { courseName: "MCA", duration: "3 Years" },
address: { city: "Bangalore", state: "KA", country: "India" }
})
An entry has been made into the collection called student.

Querying a document from a collection (Read) to retrieve (Select) the inserted document,
run the below command. The find() command will retrieve all the documents of the given
collection.
db.collection_name.find()
If a record is to be retrieved based on some criteria, the find() method should be called
passing parameters, then the record will be retrieved based on the attributes specified.
db.collection_name.find({"fieldname":"value"})
For Example: Let us retrieve the record from the student collection where the attribute
regNo is 3014and the query for the same is as shown below:
db.students.find({"regNo":"3014"})

Updating a document in a collection (Update) In order to update specific field values of a


collection in MongoDB, run the below query. db.collection_name.update()
update() method specified above will take the fieldname and the new value as argument to
update a document.
Let us update the attribute name of the collection student for the document with regNo 3014.
db.student.update({
"regNo": "3014" },
$set: { "name": "Viraj"

Page 24
})

Page 25
Removing an entry from the collection (Delete)
Let us now look into the deleting an entry from a collection. In order to delete an entry from
a collection, run the command as shown below :
db.collection_name.remove({"fieldname":"value"})
For Example : db.student.remove({"regNo":"3014"})

Working with Arrays in MongoDB


1. Introduction
In a MongoDB database, data is stored in collections and a collection has documents. A
document has fields and values, like in a JSON. The field types include scalar types
(string, number, date, etc.) and composite types (arrays and objects). In this article we
will look at an example of using the array field type.
The example is an application where users create blog posts and write comments for the
posts. The relationship between the posts and comments is One-to-Many; i.e., a post can
have many comments. We will consider a collection of blog posts with their comments.
That is a post document will also store the related comments. In MongoDB's document
model, a 1:N relationship data can be stored within a collection; this is a de-normalized
form of data. The related data is stored together and can be accessed (and updated)
together. The comments are stored as an array; an array of comment objects.
A sample document of the blog posts with comments:
{
"_id" : ObjectId("5ec55af811ac5e2e2aafb2b9"),
"name" : "Working with Arrays",
"user" : "Database Rebel",
"desc" : "Maintaining an array of objects in a document",
"content" : "some content ...",
"created" : ISODate("2020-05-20T16:28:55.468Z"),
"updated" : ISODate("2020-05-20T16:28:55.468Z"),
"tags" : [ "mongodb", "arrays" ],
"comments" : [ { "user" : "DB Learner", "content" : "Nice post.", "updated" :

Page 26
ISODate("2020-05-20T16:35:57.461Z")
}
]
}
2. Create and Query a Document
Let's create a blog post document. We will use a database called as blogs and a
collection called as posts. The code is written in mongoshell (an interactive JavaScript
interface to MongoDB). Mongo shell is started from the command line and is connected
to the MongoDB server. From the shell: use blogs
NEW_POST =
{
name: "Working with Arrays",
user: "Database Rebel",
desc: "Maintaining an array of objects in a document",
content: "some content...",
created: ISODate(),
updated:
ISODate(), tags: [
"mongodb", "arrays"
]
}
3. Update an Array Element
Let's update the comment posted by "Database Rebel" with modified text field :
NEW_CONTENT = "Thank you, please look for updates - updated the post".
db.posts.updateOne(
{ _id : ObjectId("5ec55af811ac5e2e2aafb2b9"), "comments.user": "Database
Rebel" }, { $set: { "comments.$.text": NEW_CONTENT } }
)
The $set update operator is used to change a field's value. The positional $ operator
identifies an element in an array to update without explicitly specifying the position of
the element in the array. The first matching element is updated. The updated comment
object:

Page 27
"comments" : [
{
"user" : "Database Rebel",
"text" : "Thank you, please look for updates - updated",
"updated" : ISODate("2020-05-20T16:48:25.506Z")
}
]

Page 28
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Implement Functions
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: Implement Functions


2 Neatly Drawn and labeled experimental setup: Not Applicable
3 Theoretical solution of the instant problem
3.1 Algorithm
3.2 Program
COUNT
How do you get the number of Debit and Credit transactions? One way to do it is by using
count() function as below
> db.transactions.count({cr_dr : "D"}); or
> db.transactions.find({cr_dr : "D"}).length();
But what if you do not know the possible values of cr_dr upfront. Here Aggregation
framework comes to play. See the below Aggregate query.
> db.transactions.aggregate(
[{
$group : {
_id : '$cr_dr', ] );

Page 29
count : {$sum : 1}
And the result is
{ "_id" : "C", "count" : 3 }
{ "_id" : "D", "count" : 5 }
SORT
Sorts all input documents and returns them to the pipeline in sorted order. You can sort on a
maximum of 32 keys. MongoDB does not store documents in a collection in a particular
order. When sorting on a field which contains duplicate values, documents containing those
values may be returned in any order. If consistent sort order is desired, include at least one
field in your sort that contains unique values. The easiest way to guarantee this is to include
the _id field in your sort query.
Consider the following restaurant
collection: db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
The following command uses the $sort stage to sort on the borough field:
db.restaurants.aggregate(
[ { $sort : { borough : 1 } }
])
Examples Ascending/Descending Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an ascending or
descending sort respectively, as in the following example:
db.users.aggregate( [
{ $sort : { age : -1, posts: 1 } }
])
This operation sorts the documents in the users collection, in descending order according by
the age field and then in ascending order according to the value in the posts field.
LIMIT

Page
210
Sorts all input documents and returns them to the pipeline in sorted order. $sort takes a
document that specifies the field(s) to sort by and the respective sort order. can have one of
the following values: Value Description Sort ascending, Sort descending.
If sorting on multiple fields, sort order is evaluated from left to right. For example, in the
form above, documents are first sorted by. You can sort on a maximum of 32 keys.
MongoDB does not store documents in a collection in a particular order. When sorting on a
field which contains duplicate values, documents containing those values may be returned in
any order. If consistent sort order is desired, include at least one field in your sort that
contains unique values. The easiest way to guarantee this is to include the _id field in your
sort query. Consider the following restaurant collection:
db.restaurants.insertMany( [
{ "_id" : 1, "name" : "Central Park Cafe", "borough" : "Manhattan"},
{ "_id" : 2, "name" : "Rock A Feller Bar and Grill", "borough" : "Queens"},
{ "_id" : 3, "name" : "Empire State Pub", "borough" : "Brooklyn"},
{ "_id" : 4, "name" : "Stan's Pizzaria", "borough" : "Manhattan"},
{ "_id" : 5, "name" : "Jane's Deli", "borough" : "Brooklyn"},
])
The following command uses the $sort stage to sort on the borough field:
db.restaurants.aggregate( [
{ $sort : { borough : 1 } }
])
Examples Ascending/Descending Sort
For the field or fields to sort by, set the sort order to 1 or -1 to specify an ascending or
descending sort respectively, as in the following example:
db.users.aggregate( [
{ $sort : { age : -1, posts: 1 } }
])
SKIP
Skips over the specified number of documents that pass into the stage and passes the
remaining documents to the next stage in the pipeline. The $skip stage has the following
prototype form:
{ $skip: <positive 64-bit integer>}

Page 30
4 Tabulation Sheet

INPUT OUTPUT

Page 31
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Creating a database
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: Creating a database


2 Neatly Drawn and labeled experimental setup: Not Applicable
3 Theoretical solution of the instant problem
3.1 Algorithm
MongoDB is a popular, open-source, NoSQL database program. It's classified as a
document-oriented database, belonging to the family of NoSQL databases. MongoDB stores
data in flexible, JSON-like documents, meaning fields can vary from document to document
and data structure can be changed over time.
MongoDB uses a flexible and dynamic schema, allowing for easier integration of data in
certain applications compared to traditional relational databases. It is designed to scale out
horizontally by using sharding, allowing for high availability and scalability of both read
and write operations.
Here's a simplified algorithm to create a database in MongoDB:
Start MongoDB: Ensure the MongoDB service is running on your system. You can start
MongoDB using the appropriate command for your operating system (e.g., mongod for
Unix-like systems).

Page 32
Connect to MongoDB: Use a MongoDB client, such as the MongoDB shell or a graphical
user interface (GUI) tool like MongoDB Compass, to connect to the MongoDB server.
Switch to the Desired Database (or Create If Not Exist): Use the use command in the
MongoDB shell to switch to the desired database. If the database doesn't exist, MongoDB
will create it when you first insert data into it.
3.2 Program
Let's create a MongoDB database named 'STD' and a collection named 'students' with the
specified fields using the MongoDB shell.
use STD
db.createCollection("students")
Inserting Documents: Insert additional documents into the collection.
// Insert another student document
db.students.insertOne({
"No.": 1,
"Stu_Name": "Jane Smith",
"Enrol.": "2024002",
"Branch": "Electrical Engineering",
"Contact": "9876543210",
"e-mail": "jane.smith@example.com",
"Score": 78
})
Querying Documents: Retrieve documents from the collection.
// Find all students
db.students.find().pretty()
// Find a specific student by their enrollment number
db.students.find({"Enrol.": "2024001"}).pretty()
// Find students with a score greater than 80
db.students.find({"Score": {"$gt": 80}}).pretty()
Updating Documents: Update existing documents in the collection.
// Update the score of a student
db.students.updateOne({"No.": 1}, {"$set": {"Score": 90}})

Page 33
4. Results

Page 34
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Clustering techniques using SPARK
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title
2 Neatly Drawn and labeled experimental setup
3 Theoretical solution of the instant problem
k-means clustering is a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining. The approach k-means follows to solve the problem
is called Expectation-Maximization. It can be described as follows:
 Assign some cluter centers
 Repeated until converged
o E-Step: assign points to the nearest center
o M-step: set the cluster center to the mean
3.1 Algorithm
K-Means Clustering:
K-Means is one of the simplest and most widely used clustering algorithms. It partitions the
data into k clusters, where each cluster is represented by its centroid. The algorithm works as
follows:
 Initialize k centroids randomly.

Page 35
 Assign each data point to the nearest centroid, forming k clusters.
 Recalculate the centroids of the clusters based on the mean of the data points assigned to
each cluster.
 Repeat steps 2 and 3 until convergence (when the centroids no longer change
significantly or a maximum number of iterations is reached).

3.2 Program
Set up spark context and SparkSession

Load dataset

check the data set

Then you will get

You can also get the Statistical resutls from the data frame (Unfortunately, it only works for

Page 36
numerical).

Then you will get

Convert the data to dense vector (features)

Transform the dataset to DataFrame

Deal With Categorical Variables

Now you check your dataset with

you will get

Page 37
Elbow method to determine the optimal number of clusters for k-means clustering

4 Results

Page 38
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: MongoDB / Pig using Hadoop / R.
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: MongoDB / Pig using Hadoop / R.


2 Neatly Drawn and labeled experimental setup
3 Theoretical solution of the instant problem
3.1 Algorithm
 Collect raw big data from the source(s).
 Preprocess the data if necessary (e.g., handle missing values, remove duplicates,
transform data format).
 Connect to MongoDB database.
 Create a collection or collections in MongoDB to store the data.
 Insert the preprocessed data into MongoDB collections.
3.2 Program

Page 39
4 Results
Data stored in MongoDB

Page 40
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Graph of 50 nodes and edges
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title: Graph of 50 nodes and edges


2 Neatly Drawn and labeled experimental setup
3 Theoretical solution of the instant problem
3.1 Algorithm
 Import the networkx library for graph operations.
 Create an empty graph using networkx
 Add 50 nodes to the graph.
 Add edges between nodes based on the defined criteria.
 Ensure that the graph remains connected or meets specific connectivity requirements if
necessary.
3.2 Program

Page 41
4 Tabulation Sheet

INPUT OUTPUT
1. Number of nodes (50) 1. Empty graph with nodes
2. Number of edges 2. Graph with edges

Page 42
Acropolis Institute of Technology and Research, Indore
Department of CSE (Artificial Intelligence & Machine Learning)
Lab: Big Data Analytics (AL802-
C)
Title: Betweenness Measure
EVALUATION RECORD Type/ Lab Session:
Name Tushar Padihar Enrollment No. 0827AL201060
Performing on First submission Second submission
Extra Regular

Grade and Remarks by the Tutor


1. Clarity about the objective of experiment
2. Clarity about the Outcome
3. Submitted the work in desired format
4. Shown capability to solve the problem
5. Contribution to the team work

Additional remarks Grade: Cross the grade.

A B C D F Tutor

1 Title : Betweenness Measure


2 Neatly Drawn and labeled experimental setup
3 Theoretical solution of the instant problem
3.1 Algorithm
 Input the graph data representing the social network, including nodes and edges.
 Create a graph object using the network representation
 For each node in the graph, compute its betweenness centrality score.
 The betweenness centrality of a node is defined as the fraction of shortest paths between
all node pairs that pass through that node.
 Sum the fractions for each node to obtain its betweenness centrality score.
 Output the betweenness centrality scores for each node in the social network.

Page 43
3.2 Program

4 Tabulation Sheet

INPUT OUTPUT
1. Graph data representing the social 1. Betweenness centrality scores
network (nodes and edges)
2. Graph object

Page 44

You might also like