Bda Lab Manual
Bda Lab Manual
Bda Lab Manual
Lab Manual
Final year
Computer Engineering
Practical Objectives:
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to deliver high-
availability, the library itself is designed to detect and handle failures at the application
layer, so delivering a highly available service on top of a cluster of computers, each of
which may be prone to failures.
Hadoop Ecosystem:
Hadoop has gained its popularity due to its ability of storing, analyzing and accessing large
amount of data, quickly and cost effectively through clusters of commodity hardware. It
wont be wrong if we say that Apache Hadoop is actually a collection of several components
and not just a single product.
With Hadoop Ecosystem there are several commercial along with an open source products
which are broadly used to make Hadoop laymen accessible and more usable.
The following sections provide additional information on the individual components:
Hadoop MapReduce is a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-
tolerant manner. In terms of programming, there are two functions which are most common
in MapReduce.
• The Map Task: Master computer or node takes input and convert it into divide it into
smaller parts and distribute it on other worker nodes. All worker nodes solve their
own small problem and give answer to the master node.
• The Reduce Task: Master node combines all answers coming from worker node and
forms it in some form of output which is answer of our big distributed problem.
Generally both the input and the output are reserved in a file-system. The framework is
responsible for scheduling tasks, monitoring them and even re-executes the failed tasks.
Hive is part of the Hadoop ecosystem and provides an SQL like interface to Hadoop. It is
a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc
queries, and the analysis of large datasets stored in Hadoop compatible file systems.
It provides a mechanism to project structure onto this data and query the data using a SQL-
like language called HiveQL. Hive also allows traditional map/reduce programmers to plug
in their custom mappers and reducers when it is inconvenient or inefficient to express this
logic in HiveQL.
The main building blocks of Hive are –
1. Metastore – To store metadata about columns, partition and system catalogue.
2. Driver – To manage the lifecycle of a HiveQL statement
3. Query Compiler – To compiles HiveQL into a directed acyclic graph.
4. Execution Engine – To execute the tasks in proper order which are produced by the
5. HiveServer – To provide a Thrift interface and a JDBC / ODBC server.
ZooKeeper is a centralized service for maintaining configuration information, naming,
providing distributed synchronization and providing group services which are very useful
for a variety of distributed systems. HBase is not operational without ZooKeeper.
Mahout is a scalable machine learning library that implements various different approaches
machine learning. At present Mahout contains four main groups of algorithms:
• Recommendations, also known as collective filtering
• Classifications, also known as categorization
• Clustering
• Frequent itemset mining, also known as parallel frequent pattern mining
Algorithms in the Mahout library belong to the subset that can be executed in a distributed
fashion and have been written to be executable in MapReduce. Mahout is scalable along
three dimensions: It scales to reasonably large data sets by leveraging algorithm properties
or implementing versions based on Apache Hadoop.
Apache Spark:
Apache Spark is a general compute engine that offers fast data analysis on a large scale.
Spark is built on HDFS but bypasses MapReduce and instead uses its own data processing
framework. Common uses cases for Apache Spark include real-time queries, event stream
processing, iterative algorithms, complex operations and machine learning.
Pig is a platform for analyzing and querying huge data sets that consist of a high-level
language for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. Pig’s built-in operations can make sense of semi-structured data, such as
log files, and the language is extensible using Java to add support for custom data types
and transformations.
Pig has three main key properties:
• Extensibility
• Optimization opportunities
• Ease of programming
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets. At the present
time, Pig’s infrastructure layer consists of a compiler that produces sequences of
MapReduce programs.
Apache Oozie is a workflow/coordination system to manage Hadoop jobs.
Flume is a framework for harvesting, aggregating and moving huge amounts of log data or
text files in and out of Hadoop. Agents are populated throughout ones IT infrastructure
inside web servers, application servers and mobile devices. Flume itself has a query
processing engine, so it’s easy to transform each new batch of data before it is shuttled to
the intended sink.
Ambari was created to help manage Hadoop. It offers support for many of the tools in the
Hadoop ecosystem including Hive, HBase, Pig, Sqoop and Zookeeper. The tool features a
management dashboard that keeps track of cluster health and can help diagnose
performance issues.
Hadoop is powerful because it is extensible and it is easy to integrate with any component.
Its popularity is due in part to its ability to store, analyze and access large amounts of data,
quickly and cost effectively across clusters of commodity hardware. Apache Hadoop is not
actually a single product but instead a collection of several components. When all these
components are merged, it makes the Hadoop very user friendly.
Experiment No. 02
Practical Objectives:
Hadoop in short is an open source software framework for storing and processing big
data into distributed way on large clusters of commodity hardware. Basically it
accomplished the following 2 tasks.
2. Faster processing
1. Scalable
2. Fault tolerance
3. Economical
4. Handle hardware failure
To install hadoop core-cluster needed are
a) Install java into the computer
b) Install VMware
c) Download VM file
d) Load it into VMware and start
Steps that to be followed for installation of hadoop using IBM Infosphere biginsight are:-
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion
to rescue the system from possible data losses in case of failure. HDFS also makes
applications available to parallel processing.
Features of HDFS
• The built-in servers of namenode and datanode help users to easily check the
status of cluster.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
• Open Web Client
• From here also You can Upload File into Hadoop
• Will have a look of all components of websphere
• Dashboard, Cluster Status , Files , Application ,Application status , Bigsheet.
Step 3: From here also, you can Upload File into Hadoop
The distributed file system which is used only for larger databases. Here, we have
studied HDFS and executed basic commands and file operations for Hadoop.
NoSQL databases have grown in popularity with the rise of Big Data applications. In
comparison to relational databases, NoSQL databases are much cheaper to scale, capable
of handling unstructured data, and better suited to current agile development approaches.
The advantages of NoSQL technology are compelling but the thought of replacing a
legacy relational system can be daunting. To explore the possibilities of NoSQL in your
enterprise, consider a small-scale trial of a NoSQL database like MongoDB. NoSQL
databases are typically open source so you can download the software and try it out for
free. From this trial, you can assess the technology without great risk or cost to your
Commands of Neo4j:
1.Create database
2.Insert data
3.Display data
MATCH(dept: Dept) Return dept.deptno, dept.dname
4.Create node
MATCH(dept: Dept)
Return dept
5.Movie graph
Post Lab Assignment : Deign and Generate a Dependency graph for Existing
Project in Neo4j.
Experiment No. 05
Map reduce
Map reduce is a java based system created by google where actual data from HDFS store
gets processed efficiently map reduce breaks down a big data processing job into smaller
tasks.Map reduce is responsible for analyzing large datasets in parallel before reducing it
to find the results.
Pig is one of the data accessing component in hadoop ecosystem.It is a convenient tool
developed by yahoo for analyzing huge data sets efficiently it is a high level flow
language that is optimized,extensible and easy to use.
Load:To load the file
Dump:to display
Limit:to limit the range
Abc=LIMIT abcd 2;
Describe:Schema Definition
Group:To make group by Entity type
Group abcd by id
Conclusion : Hence we have implemented and run Hello world program successfully.
Post Lab Assignment :
Practical Objectives:
For the implementation of frequent item set using pig we used the Apriori algorithm. The
Apriori algorithm for finding frequent pairs is a two-pass algorithm that limits the amount
of main memory needed by using the down word-closure property of support to avoid
counting pairs that will turn out to be infrequent at the end.
Let s be the minimum support required. Let n the number of items. This required in the
first pass, we read the baskets and count in main memory the occurrences of each item.
when we then remove all item whose frequency is lesser than S to get the set of
frequency items. This requires memory proportional to n
In the second pass, we read the baskets again and count in main memory only those pairs
where both items are frequent items. This pass will require memory proportional to
square of frequent items only (for counts) plus a list of the frequent items (so you know
what must be counted). In fig main memory in two pass of Apriori.
Apriori Algorithm:
1. Load text
2. Tokenize text
3. Retain first letter
4. Group by letter
5. Count occurrences
6. Grab first element
7. Display/store results
The Apriori Algorithm uses the monotonocity property to reduce the number of
pairs that must be counted, at the expense of performing two passes over data rather
than one pass.
Conclusion: Hence we have implemented frequent item set algorithm
Practical Objectives:
To implement Word count program using Map reduce Execute following steps:
Open eclipse
• Name : MapperAnalysis
• i/p key : longwritable
• I/p Value : text
• o/p Key: text
• o/p value intwritable
• Click on next
• Name : ReducerAnalysis
• o/p Key: text
• o/p value intwritable
• Click on next
• Name : DriverAnalysis
• Click on finish
Name : MapperAnalysis , i/p key : longwritable , I/p Value : text , o/p Key: text ,o/p
value intwritable
Click on next
Click on next
Name : DriverAnalysis
Click on finish
Practical Objectives:
Map reduce
MapReduce is a style of computing that has been implemented in several systems,
including Google’s internal implementation (simply called MapReduce) and the popular
open-source implementation Hadoop which can be obtained, along with the HDFS file
system from the Apache Foundation. You can use an implementation of MapReduce to
manage many large-scale computations in a way that is tolerant of hardware faults. All
you need to write are two functions, called Map and Reduce, while the system manages
the parallel execution, coordination of tasks that execute Map or Reduce, and also deals
with the possibility that one of these tasks will fail to execute. In brief, a MapReduce
computation executes as follows:
1. Some number of Map tasks each are given one or more chunks from a distributed file
system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-
value pairs are produced from the input data is determined by the code written by the user
for the Map function.
2. The key-value pairs from each Map task are collected by a master controller and sorted
by key. The keys are divided among all the Reduce tasks, so all key-value pairs with the
same key wind up at the same Reduce task.
3. The Reduce tasks work on one key at a time, and combine all the values associated
with that key in some way. The manner of combination of values is determined by the
code written by the user for the Reduce function.
Matrix Multiplication
Suppose we have an nxn matrix M, whose element in row i and column j will be denoted
by Mij. Suppose we also have vector v of length n, whose jth element is Vj . Then the
matrix vector product is the vector of length n, whose ith element xi .
Aim: To analyze and summarize large data with Graphical Representation Using
Practical Objectives:
IBM technologies enrich this open source framework with analytical software, enterprise
software integration, platform extensions, and tools. BigSheets is a browser-based analytic
tool initially developed by IBM's Emerging Technologies group. Today, BigSheets is
included with BigInsights to enable business users and non-programmers to explore and
analyze data in distributed file systems. BigSheets presents a spreadsheet-like interface so
users can model, filter, combine, explore, and chart data collected from various sources.
The BigInsights web console includes a tab at top to access BigSheets.
Figure 1 depicts a sample data workbook in BigSheets. While it looks like a typical
spreadsheet, this workbook contains data from blogs posted to public websites, and
analysts can even click on links included in the workbook to visit the site that published
the source content.
Figure 1 - BigSheets workbook based on social media data, with links to source content
After defining a BigSheets workbook, an analyst can filter or transform its data as desired.
Behind the scenes, BigSheets translates user commands, expressed through a graphical
interface, into Pig scripts executed against a subset of the underlying data. In this manner,
an analyst can iteratively explore various transformations efficiently. When satisfied, the
user can save and run the workbook, which causes BigSheets to initiate MapReduce jobs
over the full set of data, write the results to the distributed
Figure 1 : Extract data to Bigsheets
file system, and display the contents of the new workbook. Analysts can page through or
manipulate the full set of data as desired.
Eg :
2. Text Analysis