Hadoop_Lab program
Hadoop_Lab program
AIM:-
DESCRIPTION:
The java.util package contains all the classes and interfaces for Collection framework.
There are many methods declared in the Collection interface. They are as follows:
retainAll(Collection c)
6 public int size() return the total number of elements in the
collection.
7 public void clear() removes the total no of element from the collection.
8 public boolean contains(Object is used to search an element.
element)
boolean collection.
containsAll(Collection c)
10 public Iterator iterator() returns an iterator.
11 public Object[] toArray() converts collection into array.
12 public boolean isEmpty() checks if collection is empty.
3
{ int size();
booleanisEmpty();
Iterator<E> iterator();
Object[] toArray();
void clear();
int hashCode();
}
4
Boolean add(E e)
Boolean addAll(Collection)
Remove(Collection)retailAll(Collection)
IteratorlistIterator()
10. END
5
(employee.txt)
e100,james,asst.prof,cse,8000,16000,4000,8.7
e101,jack,asst.prof,cse,8350,17000,4500,9.2
e102,jane,assoc.prof,cse,15000,30000,8000,7.8
e104,john,prof,cse,30000,60000,15000,8.8
e105,peter,assoc.prof,cse,16500,33000,8600,6.9
e106,david,assoc.prof,cse,18000,36000,9500,8.3
e107,daniel,asst.prof,cse,9400,19000,5000,7.9
e108,ramu,assoc.prof,cse,17000,34000,9000,6.8
e109,rani,asst.prof,cse,10000,21500,4800,6.4
e110,murthy,prof,cse,35000,71500,15000,9,3
EXPECTED OUTPUT:-
2) How do you store a primitive data type within a Vector or other collectionsclass?
AIM:-
Standalone
PseudoDistributed
FullyDistributed
DESCRIPTION:
Hadoop is written in Java, so you will need to have Java installed on your machine,
version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others have
been reported to work.
Hadoop runs on Unix and on Windows. Linux is the only supported production platform,
but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development.
Windows is only supported as a development platform, and additionally requires Cygwin to run.
During the Cygwin installation process, you should include the openssh package if you plan to
run Hadoop in pseudo-distributed mode
ALGORITHM
ALGORITHM
ALGORITHM
$stop-all.sh
3. Copy public key to all three hosts to get a password less SSHaccess
$cd $HADOOP_HOME/etc/hadoop
$nano core-site.xml
$ nano hdfs-site.xml
$ nano slaves
6. Configure $ nanoyarn-site.xml
7. Do in MasterNode
$ start-dfs.sh
$start-yarn.sh
8. FormatNameNode
10
10. END
INPUT
OUTPUT:
2) How to restartNamenode?
DESCRIPTION
Hadoop set up can be managed by different web based tools, which can be easy for the
user to identify the running daemons. Few of the tools used in the real world are:
a) Apache Ambari
b) HortonWorks
c) ApacheSpark
Horton Works Tool for Managing Map Reduce Jobs in Apache Pig
14
Running Map Reduce Jobs in Horton Works for Pig Latin Script
15
AIM:-
DESCRIPTION:-
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running on top of the under
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step-1
Step-2
The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:
Step-3
Step-4
SAMPLE INPUT:
EXPECTED OUTPUT:
18
VIVA-VOCE Questions
1) What is the command used to copy the data from local tohdfs
4) What is command used to list out directories of Data Node through webtool
19
AIM:-
Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm
DESCRIPTION:--
ALGORITHM
MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three main parts:
1. Mapper
2. Reducer
3. Driver
20
AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
which provides <key, value> pairs as the input. A Mapper implementation mayoutput
<key,value> pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.
Pseudo-code
output.collect(x, 1);
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemblea
single result. Here, the WordCount program will sum up the occurrence of each word to pairsas
<word, occurrence>.
Pseudo-code
sum+=x;
final_output.collect(keyword, sum);
21
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
INPUT:-
3) Explain what is JobTracker in Hadoop? What are the actions followed byHadoop?
5) Explain what combiners is and when you should use a combiner in a MapReduceJob?
AIM:-
DESCRIPTION:
Climate change has been seeking a lot of attention since long time. The antagonistic
effect of this climate is being felt in every part of the earth. There are many examples for these,
such as sea levels are rising, less rainfall, increase in humidity. The propose system overcomes
the some issues that occurred by using other techniques. In this project we use the concept of
Big data Hadoop. In the proposed architecture we are able to process offline data, which is
stored in the National Climatic Data Centre (NCDC). Through this we are able to find out the
maximum temperature and minimum temperature of year, and able to predict the future weather
forecast. Finally, we plot the graph for the obtained MAX and MIN temperature for each moth
of the particular year to visualize the temperature. Based on the previous year data weather data
of coming year is predicted.
ALGORITHM:-
MAPREDUCEPROGRAM
WordCount is a simple program which counts the number of occurrences of each word in agiven
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three mainparts:
1. Mapper
2. Reducer
3. Mainprogram
25
AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
which provides <key, value> pairs as the input. A Mapper implementation mayoutput
<key,value> pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.
Pseudo-code
output.collect(x, 1);
output.collect(x, 1);
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemblea
single result. Here, the WordCount program will sum up the occurrence of each word to pairsas
<word, occurrence>.
26
sum+=x;
final_output.collect(max_temp, sum);
sum+=x;
final_output.collect(min_temp, sum);
3. WriteDriver
The Driver program configures and run the MapReduc job. We use the main program to
perform basic configurations such as:
Executable (Jar) Class: the main executable class. For here, WordCount.
Mapper Class: class which overrides the "map" function. For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
INPUT:-
OUTPUT:-
28
AIM:-
DESCRIPTION:
We can represent a as a relation (table) in RDBMS where each cell in the matrix
can be represented as a record (i,j,value). As an example let us consider the following matrix
and its representation. It is important to understand that this relation is a very inefficient relation
if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store only 30
values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30 values in
other sense we are tripling the data. So a natural question arises why we need to store in this
format ? In practice most of the matricesare sparse matrices . In sparse matrices not all cells
used to have any values , so we don‘t have to store those cells in DB. So this turns out to bevery
efficient in storing such matrices.
MapReduceLogic
Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell ofoutput
(0,0)hasmultiplicationandsummationof
elementsfromrow0ofthematrixAandelementsfromcol0of matrixB.Todothe
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we needto
use (0,0) as output key of mapphase and value should have array of values from row 0 of matrix
A and column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm
output from map phase should be having a <key,value> , where key represents the output cell
location (0,0) , (0,1) etc.. and value will be list of all values required for reducer to do
computation. Let us take an example for calculatiing value at output cell (00). Here we need to
collect values from row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) as
key. So a single reducer can do the calculation.
30
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding matrix
element value. The output files for matrix C=A*B are in the same format.
In the pseudo-code for the individual strategies below, we have intentionally avoided
factoring common code for the purposes of clarity.
Note that in all the strategies the memory footprint of both the mappers and the reducers is flat at
scale.
Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse
matrices we do not emit zero elements. That said, the simple pseudo-code for multiplying the
individual blocks shown here is certainly not optimal for sparse matrices. As a learning exercise,
our focus here is on mastering the MapReduce complexities, not on optimizing the sequential
matrix multipliation algorithm for the individual blocks.
31
1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB =(K-1)/KB+1
4. var NJB =(J-1)/JB+1
5. map (key,value)
6. if from matrix A with key=(i,k) andvalue=a(i,k)
7. for 0 <= jb <NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB,a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib <NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb,
then by m. Note that m = 0 for A data and m = 1 for B data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the datafor
the A block immediately preceding the data for the Bblock.
13. var A = new matrix of dimensionIBxKB
14. var B = new matrix of dimensionKBxJB
15. var sib =-1
16. var skb =-1
32
INPUT:-
Set of Data sets over different Clusters are taken as Rows and Columns
33
OUTPUT:-
AIM:-
Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the
data.
DESCRIPTION
Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is
instead declarative. In SQL users can specify that data from two tables must be joined, but not
what join implementation to use (You can specify the implementation of JOIN in SQL, thus "...
for many SQL applications the query writer may not have enough knowledge of the data or
enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an
implementation or aspects of an implementation to be used in executing a script in several
ways. In effect, Pig Latin programming is similar to specifying a query execution plan, making it
easier for programmers to explicitly control the flow of their data processingtask.
SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has
no built in mechanism for splitting a data processing stream and applying different operators to
each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.
Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline
development. If SQL is used, data must first be imported into the database, and then the
cleansing and transformation process canbegin.
35
6) DescribeData
Describe DATA;
7) DUMPData
Dump DATA;
8) FILTERData
9) GROUPData
10) IteratingData
12) LIMITData
INPUT:
OUTPUT:
37
VIVA-VOCE Questions
AIM:-
Install and Run Hive then use Hive to Create, alter and drop databases, tables, views,
functions and Indexes.
DESCRIPTION
Hive, allows SQL developers to write Hive Query Language (HQL) statements that are
similar to standard SQL statements; now you should be aware that HQL is limited in the
commands it understands, but it is still pretty useful. HQL statements are broken down by the
Hive serviceinto MapReduce jobs and executed across a Hadoop cluster. Hive looks very much
like traditional database code with SQL access. However, because Hive is based on Hadoop and
MapReduce operations, there are several key differences. The first is that Hadoop is intended for
long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a
very high latency (many minutes). This means that Hive would not be appropriate for
applications that need very fast response times, as you would expect with a database such as
DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that
typically involves a high percentage of write operations.
ALGORITHM:
1) InstallMySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName andPassword
3) Creating User and granting allPrivileges
Mysql –uroot –proot
Create user <USER_NAME> identified by <PASSWORD>
4) Extract and Configure ApacheHive
39
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://localhost:3306/hive?
createDatabaseIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
DATABASE Creation
Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;
Syntax
Dropping View
Syntax:
Functions in HIVE
INDEXES
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_re
sult;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
Dropping Index
INPUT
1) I do not need the index created in the first question anymore. How can I delete the above
index namedindex_bonuspay?
2) What is the use ofHcatalog?
3) Write a query to rename a table Student toStudent_New.
4) Is it possible to overwrite Hadoop MapReduce configuration inHive?
5) What is the use of explode inHive?
46
AIM : Write a program to analyze the Web server log stream data usingApache
Flume Framework
DESCRIPTION:
Apache Flume is a distributed, reliable, and available system for efficiently collecting,
aggregating and moving large amounts of log data from many different sources to a centralized
data store. Flume is currently undergoing incubation at The Apache Software Foundation. At a
high-level, Flume NG uses a single-hop message delivery guarantee semantics to provideend-to-
end reliability for the system.
The purpose of Flume is to provide a distributed, reliable, and available system for efficiently
collecting, aggregating and moving large amounts of log data from many different sources to a
centralized data store. The architecture of Flume NG is based on a few concepts that together
help achieve this objective. Some of these concepts have existed in the past implementation, but
have changed drastically. Here is a summary of concepts that Flume NG introduces, redefines, or
reuses from earlierimplementation:
Event: A byte payload with optional string headers that represent the unit of data that
Flume can transport from it‘s point of origination to it‘s finaldestination.
Flow: Movement of events from the point of origin to their final destination is considered
a data flow, or simply flow. This is not a rigorous definition and is used only at a high
level for descriptionpurposes.
Client: An interface implementation that operates at the point of origin of events and
delivers them to a Flume agent. Clients typically operate in the process space of the
application they are consuming data from. For example, Flume Log4j Appender is a
client.
47
ALGORITHM:
agent.channels.memory-channel.type = memory
INPUT:-
VIVA-VOCE QUESTIONS
1) What is FlumeAgent
2) What is FlumeSink
3) What is FlumeChannel
51
DESCRIPTION :
COMBINERS:
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it
pays to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function‘s
output forms the input to the reduce function. Since the combiner function is an
optimization, Hadoop does not provide a guarantee of how many times it will call it for a
particularmapoutputrecord,ifatall.Inotherwords,callingthecombinerfunctionzero,
one, or many times should produce the same output from the reducer. One can think of
Combinersas―mini-reducers‖thattakeplaceontheoutputofthemappers,priortothe
shuffle and sort phase. Each combiner operates in isolation and therefore does not have
access to intermediate output from other mappers. The combiner is provided keys and
values associated with each key (the same types as the mapper output keys and values).
Critically, one cannot assume that a combiner will have the opportunity to process all
values associated with the same key. The combiner can emit any number of key-value
pairs, but the keys and values must be of the same type as the mapper output (same as the
reducer input). In cases where an operation is both associative and commutative (e.g.,
addition or multiplication), reducers can directly serve as combiners
PARTITIONERS
COMBINING
1) Divide the data source ( the data files ) into fragments or blocks which are sent toa
mapper. These are calledsplits.
2) These splits are further divided into records and these records are provided one at atime
to the mapper for processing. This is achieved through a class called as RecordReader.
Create a Class and extend from TextInputFormat class to create own NLinesInputFormat.
Then create our own RecordReader class called NLinesRecordReader where we will implement the logic
Make a change in the driver program to use new NLinesInputFormatclass.
To prove that are really getting 3 lines at a time, instead of actually counting words ( which already know n
PARTIONING
3. Now suppose if your hashCode() method does not uniformly distribute other keys data
over partitions range. So the data is not evenly distributed in partitions as well as
reducers.Since each partition is equivalent to a reducer.So here some reducers willhave
more data than other reducers.So other reducers will wait for one reducer(one with user
defined keys) due to the work load it share
54
INPUT:
OUTPUT: