0% found this document useful (0 votes)
2 views

Hadoop_Lab program

lab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hadoop_Lab program

lab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

1

HADOOP AND BIG DATA


EXERCISE-1:-

AIM:-

Implement the following Data Structures in Java

a)Linked Lists b)Stacks c)Queues d)Set e)Map

DESCRIPTION:

The java.util package contains all the classes and interfaces for Collection framework.

Methods of Collection interface

There are many methods declared in the Collection interface. They are as follows:

No. Method Description


2

HADOOP AND BIG DATA

1 public boolean add(Object is used to insert an element in this collection.


element)

2 public boolean is used to insert the specified collection elements in


addAll(Collection c) the invoking collection.

3 public boolean remove(Object is used to delete an element from this collection.


element)

4 public boolean is used to delete all the elements of specified


removeAll(Collection c) collection from the invoking collection.

5 public is used to delete all the elements of invoking

boolean collection except the specified collection.

retainAll(Collection c)
6 public int size() return the total number of elements in the
collection.

7 public void clear() removes the total no of element from the collection.
8 public boolean contains(Object is used to search an element.
element)

9 public is used to search the specified collection inthis

boolean collection.

containsAll(Collection c)
10 public Iterator iterator() returns an iterator.
11 public Object[] toArray() converts collection into array.
12 public boolean isEmpty() checks if collection is empty.
3

HADOOP AND BIG DATA


SKELTON OF JAVA.UTIL.COLLECTION INTERFACE

public interface Collection<E> extends Iterable<E>

{ int size();

booleanisEmpty();

boolean contains(Object o);

Iterator<E> iterator();

Object[] toArray();

<T> T[] toArray(T[] a);

boolean add(E e);

boolean remove(Object o);

boolean addAll(Collection<? extends E> c);

boolean removeAll(Collection<?> c);

boolean retainAll(Collection<?> c);

void clear();

boolean equals(Object o);

int hashCode();

}
4

HADOOP AND BIG DATA


ALGORITHM for All Collection Data Structures:-

Steps of Creation of Collection

1. Create a Object of Generic Type E,T,K orV

2. Create a Model class or Plain Old Java Object (POJO) oftype.

3. Generate Setters and Getters

4. Create a Collection Object of type either Set or List or Map orQueue

5. Add Objects to the collection

Boolean add(E e)

6. Add Collection to theCollection.

Boolean addAll(Collection)

7. Remove or retain data from Collection

Remove(Collection)retailAll(Collection)

8. Iterate Objects using Enumeration or Iterator orListIterator

IteratorlistIterator()

9. Display Objects fromCollection

10. END
5

HADOOP AND BIG DATA


SAMPLE INPUT:

Sample Employee Data Set:

(employee.txt)

e100,james,asst.prof,cse,8000,16000,4000,8.7

e101,jack,asst.prof,cse,8350,17000,4500,9.2

e102,jane,assoc.prof,cse,15000,30000,8000,7.8

e104,john,prof,cse,30000,60000,15000,8.8

e105,peter,assoc.prof,cse,16500,33000,8600,6.9

e106,david,assoc.prof,cse,18000,36000,9500,8.3

e107,daniel,asst.prof,cse,9400,19000,5000,7.9

e108,ramu,assoc.prof,cse,17000,34000,9000,6.8

e109,rani,asst.prof,cse,10000,21500,4800,6.4

e110,murthy,prof,cse,35000,71500,15000,9,3

EXPECTED OUTPUT:-

Prints the information of employee with all its attributes


6

HADOOP AND BIG DATA


VIVA VOCE QUESTIONS (LIST)

1) What is the difference between ArrayList andVector?


2) What is the difference between ArrayList andLinkedList?
3) What is the root interface in collection hierarchy?

4) What is the difference between Collection and Collections?

5) How to reverse the List inCollections?

6) What is the difference between Enumeration and Iterator interface?

VIVA VOCE QUESTIONS (QUEUE)

1) What is the difference between Queue and Stack?

2) What collection will you use to implement aqueue?

VIVA VOCE QUESTIONS (MAP)

1) What is the between Hashtable andHashMap?

2) How do you store a primitive data type within a Vector or other collectionsclass?

3) What is the difference between HashMap andTreeMap?

4) What are the two types of Map implementations available in theCollections?


7

HADOOP AND BIG DATA


EXERCISE-2:-

AIM:-

i) Perform setting up and Installing Hadoop in its three operatingmodes:

 Standalone
 PseudoDistributed
 FullyDistributed

DESCRIPTION:

Hadoop is written in Java, so you will need to have Java installed on your machine,
version 6 or later. Sun's JDK is the one most widely used with Hadoop, although others have
been reported to work.

Hadoop runs on Unix and on Windows. Linux is the only supported production platform,
but other flavors of Unix (including Mac OS X) can be used to run Hadoop for development.
Windows is only supported as a development platform, and additionally requires Cygwin to run.
During the Cygwin installation process, you should include the openssh package if you plan to
run Hadoop in pseudo-distributed mode

ALGORITHM

STEPS INVOLVED IN INSTALLING HADOOP IN STANDALONE MODE:-

1. Command for installing ssh is “sudo apt-get installssh”.


2. Command for key generation is ssh-keygen –t rsa –P “”.
3. Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub>>
$HOME/.ssh/authorized_keys
4. Extract the java by using the command tar xvfzjdk-8u60-linux-i586.tar.gz.
5. Extract the eclipse by using the command tar xvfzeclipse-jee-mars-R-linux-gtk.tar.gz
6. Extract the hadoop by using the command tar xvfzhadoop-2.7.1.tar.gz
8

HADOOP AND BIG DATA


7. Move the java to /usr/lib/jvm/ and eclipse to /opt/ paths. Configure the java path in the
eclipse.inifile
8. Export java path and hadoop path in./bashrc
9. Check the installation successful or not by checking the java version and hadoopversion
10. Check the hadoop instance in standalone mode working correctly or not by usingan
implicit hadoop jar file named as word count.
11. If the word count is displayed correctly in part-r-00000 file it means that standalonemode
is installedsuccessfully.

ALGORITHM

STEPS INVOLVED IN INSTALLING HADOOP IN PSEUDO DISTRIBUTED MODE:-

1. In order install pseudo distributed mode we need to configure the hadoop


configuration files resides in the directory/home/lendi/hadoop-2.7.1/etc/hadoop.
2. First configure the hadoop-env.sh file by changing the javapath.
3. Configure the core-site.xml which contains a property tag, it contains nameand
value. Name as fs.defaultFS and value as hdfs://localhost:9000
4. Configurehdfs-site.xml.
5. Configureyarn-site.xml.
6. Configure mapred-site.xml before configure the copy mapred-site.xml.template to
mapred-site.xml.
7. Now format the name node by using command hdfs namenode–format.
8. Type the command start-dfs.sh,start-yarn.sh means that starts the daemons like
NameNode,DataNode,SecondaryNameNode,ResourceManager,NodeManager.
9. Run JPS which views all daemons. Create a directory in the hadoop by using
command hdfs dfs –mkdr /csedir and enter some data into lendi.txt using command
nano lendi.txt and copy from local directory to hadoop using command hdfs dfs –
copyFromLocal lendi.txt /csedir/and run sample jar file wordcount to checkwhether
pseudo distributed mode is working ornot.
9

HADOOP AND BIG DATA


10. Display the contents of file by using command hdfs dfs –cat/newdir/part-r-00000.

FULLY DISTRIBUTED MODE INSTALLATION:

ALGORITHM

1. Stop all single nodeclusters

$stop-all.sh

2. Decide one as NameNode (Master) and remaining asDataNodes(Slaves).

3. Copy public key to all three hosts to get a password less SSHaccess

$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24

4. Configure all Configuration files, to name Master and SlaveNodes.

$cd $HADOOP_HOME/etc/hadoop

$nano core-site.xml

$ nano hdfs-site.xml

5. Add hostnames to file slaves and saveit.

$ nano slaves

6. Configure $ nanoyarn-site.xml

7. Do in MasterNode

$ hdfs namenode –format

$ start-dfs.sh

$start-yarn.sh

8. FormatNameNode
10

HADOOP AND BIG DATA


9. Daemons Starting in Master and SlaveNodes

10. END

INPUT

ubuntu @localhost> jps

OUTPUT:

Data node, name nodem Secondary name node,

NodeManager, Resource Manager

VIVA VOCE QUESTIONS:-

1) What does ‗jps‘ commanddo?

2) How to restartNamenode?

3) Differentiate between Structured and Unstructureddata?

4) What are the main components of a HadoopApplication?

5) Explain the difference between NameNode, Backup Node and CheckpointNameNode.


11

HADOOP AND BIG DATA


II) Using Web Based Tools to Manage Hadoop Set-up

DESCRIPTION

Hadoop set up can be managed by different web based tools, which can be easy for the
user to identify the running daemons. Few of the tools used in the real world are:

a) Apache Ambari

b) HortonWorks

c) ApacheSpark

LIST OF CLUSTERS INHADOOP

Apache Hadoop Running at Local Host


12

HADOOP AND BIG DATA

AMBARI Admin Page for Managing Hadoop Clusters


13

HADOOP AND BIG DATA


AMBARI Admin Page for Viewing Hadoop Map Reduce Jobs

Horton Works Tool for Managing Map Reduce Jobs in Apache Pig
14

HADOOP AND BIG DATA

Running Map Reduce Jobs in Horton Works for Pig Latin Script
15

HADOOP AND BIG DATA


EXERCISE-3:-

AIM:-

Implement the following file management tasks in Hadoop:

 Adding files anddirectories


 Retrievingfiles
 DeletingFiles

DESCRIPTION:-

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running on top of the under

 Adding files and directories toHDFS


 Retrieving files from HDFS to localfilesystem
 Deleting files fromHDFS

ALGORITHM:-

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS

Step-1

Adding Files and Directories to HDFS


16

HADOOP AND BIG DATA


Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t automatically created
for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we
use chuck. You should substitute your user name in the examplecommands.

hadoop fs -mkdir /user/chuck


hadoop fs -putexample.txt
hadoop fs -put example.txt/user/chuck

Step-2

Retrieving Files from HDFS

The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve
example.txt, we can run the following command:

hadoop fs -cat example.txt

Step-3

Deleting Files from HDFS

hadoop fs -rm example.txt


 Command for creating a directory in hdfs is “hdfs dfs –mkdir/lendicse”.
 Adding directory is done through the command “hdfs dfs –put lendi_english/”.

Step-4

Copying Data from NFS to HDFS


17

HADOOP AND BIG DATA

Copying from directory command is “hdfs dfs –copyFromLocal


/home/lendi/Desktop/shakes/glossary /lendicse/”
 View the file by using the command “hdfs dfs –cat/lendi_english/glossary”
 Command for listing of items in Hadoop is “hdfs dfs –lshdfs://localhost:9000/”.
 Command for Deleting files is “hdfs dfs –rm r/kartheek”.

SAMPLE INPUT:

Input as any data format of type structured, Unstructured or Semi Structured

EXPECTED OUTPUT:
18

HADOOP AND BIG DATA

VIVA-VOCE Questions

1) What is the command used to copy the data from local tohdfs

2) What is the command used to run the hadoop jarfile

3) What command is used to remove directory from Hadooprecursively?

4) What is command used to list out directories of Data Node through webtool
19

HADOOP AND BIG DATA


EXERCISE-4

AIM:-

Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm

DESCRIPTION:--

MapReduce is the heart of Hadoop. It is this programming paradigm that allowsfor


massivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.
TheMapReduceconceptisfairlysimpletounderstandforthosewhoarefamiliarwithclustered
scale-out data processing solutions. The term MapReduce actually refers to two separate and
distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of data
and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). The reduce job takes the output from a map as input and combines those data
tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce
job is always performed after the map job.

ALGORITHM

MAPREDUCE PROGRAM

WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three main parts:

1. Mapper

2. Reducer

3. Driver
20

HADOOP AND BIG DATA


Step-1. Write a Mapper

AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
which provides <key, value> pairs as the input. A Mapper implementation mayoutput
<key,value> pairs using the provided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.

Pseudo-code

void Map (key, value){

for each word x in value:

output.collect(x, 1);

Step-2. Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemblea
single result. Here, the WordCount program will sum up the occurrence of each word to pairsas
<word, occurrence>.

Pseudo-code

void Reduce (keyword, <list of value>){

for each x in <list of value>:

sum+=x;

final_output.collect(keyword, sum);
21

HADOOP AND BIG DATA


}

Step-3. Write Driver

The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:

 Job Name : name of thisJob


 Executable (Jar) Class: the main executable class. For here,WordCount.
 Mapper Class: class which overrides the "map" function. For here,Map.
 Reducer: class which override the "reduce" function. For here ,Reduce.
 Output Key: type of output key. For here,Text.
 Output Value: type of output value. For here,IntWritable.
 File Input Path
 File OutputPath

INPUT:-

Set of Data Related Shakespeare Comedies, Glossary, Poems


22

HADOOP AND BIG DATA


OUTPUT:-
23

HADOOP AND BIG DATA


VIVA VOCE QUESTIONS:

1) What is Hadoop Map Reduce?

2) Explain what is shuffling in MapReduce?

3) Explain what is JobTracker in Hadoop? What are the actions followed byHadoop?

4) Explain what is heartbeat inHDFS?

5) Explain what combiners is and when you should use a combiner in a MapReduceJob?

6) What happens when a datanode fails?


24

HADOOP AND BIG DATA


EXERCISE-5:-

AIM:-

Write a Map Reduce Program that mines Weather Data.

DESCRIPTION:

Climate change has been seeking a lot of attention since long time. The antagonistic
effect of this climate is being felt in every part of the earth. There are many examples for these,
such as sea levels are rising, less rainfall, increase in humidity. The propose system overcomes
the some issues that occurred by using other techniques. In this project we use the concept of
Big data Hadoop. In the proposed architecture we are able to process offline data, which is
stored in the National Climatic Data Centre (NCDC). Through this we are able to find out the
maximum temperature and minimum temperature of year, and able to predict the future weather
forecast. Finally, we plot the graph for the obtained MAX and MIN temperature for each moth
of the particular year to visualize the temperature. Based on the previous year data weather data
of coming year is predicted.

ALGORITHM:-

MAPREDUCEPROGRAM

WordCount is a simple program which counts the number of occurrences of each word in agiven
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three mainparts:

1. Mapper

2. Reducer

3. Mainprogram
25

HADOOP AND BIG DATA

Step-1. Write a Mapper

AMapperoverridesthe―map‖functionfromtheClass"org.apache.hadoop.mapreduce.Mapper"
which provides <key, value> pairs as the input. A Mapper implementation mayoutput
<key,value> pairs using the provided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the key
would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each
word in the line of text.

Pseudo-code

void Map (key, value){

for each max_temp x in value:

output.collect(x, 1);

void Map (key, value){

for each min_temp x in value:

output.collect(x, 1);

Step-2 Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemblea
single result. Here, the WordCount program will sum up the occurrence of each word to pairsas
<word, occurrence>.
26

HADOOP AND BIG DATA


Pseudo-code

void Reduce (max_temp, <list of value>){

for each x in <list of value>:

sum+=x;

final_output.collect(max_temp, sum);

void Reduce (min_temp, <list of value>){

for each x in <list of value>:

sum+=x;

final_output.collect(min_temp, sum);

3. WriteDriver

The Driver program configures and run the MapReduc job. We use the main program to
perform basic configurations such as:

Job Name : name of this Job

Executable (Jar) Class: the main executable class. For here, WordCount.

Mapper Class: class which overrides the "map" function. For here, Map.

Reducer: class which override the "reduce" function. For here , Reduce.

Output Key: type of output key. For here, Text.

Output Value: type of output value. For here, IntWritable.


27

HADOOP AND BIG DATA


File Input Path

File Output Path

INPUT:-

Set of Weather Data over the years

OUTPUT:-
28

HADOOP AND BIG DATA


VIVA VOCE QUESTIONS:

1) Explain what is the function of MapReducerpartitioner?

2) Explain what is difference between an Input Split and HDFSBlock?

3) Explain what isSequencefileinputformat?

4) In Hadoop what isInputSplit?

5) Explain what is a sequence file inHadoop?


29

HADOOP AND BIG DATA


EXERCISE-6:-

AIM:-

Write a Map Reduce Program that implements Matrix Multiplication.

DESCRIPTION:

We can represent a as a relation (table) in RDBMS where each cell in the matrix
can be represented as a record (i,j,value). As an example let us consider the following matrix
and its representation. It is important to understand that this relation is a very inefficient relation
if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store only 30
values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30 values in
other sense we are tripling the data. So a natural question arises why we need to store in this
format ? In practice most of the matricesare sparse matrices . In sparse matrices not all cells
used to have any values , so we don‘t have to store those cells in DB. So this turns out to bevery
efficient in storing such matrices.

MapReduceLogic

Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell ofoutput
(0,0)hasmultiplicationandsummationof
elementsfromrow0ofthematrixAandelementsfromcol0of matrixB.Todothe
computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we needto
use (0,0) as output key of mapphase and value should have array of values from row 0 of matrix
A and column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm
output from map phase should be having a <key,value> , where key represents the output cell
location (0,0) , (0,1) etc.. and value will be list of all values required for reducer to do
computation. Let us take an example for calculatiing value at output cell (00). Here we need to
collect values from row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) as
key. So a single reducer can do the calculation.
30

HADOOP AND BIG DATA

ALGORITHM

We assume that the input files for A and B are streams of (key,value) pairs in sparse
matrix format, where each key is a pair of indices (i,j) and each value is the corresponding matrix
element value. The output files for matrix C=A*B are in the same format.

We have the following input parameters:

The path of the input file or directory for matrix A.


The path of the input file or directory for matrix B.
The path of the directory for the output files for matrix C.
strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in A and rows in B.
J = the number of columns in B and C.
IB = the number of rows per A block and C block.
KB = the number of columns per A block and rows per B block.
JB = the number of columns per B block and C block.

In the pseudo-code for the individual strategies below, we have intentionally avoided
factoring common code for the purposes of clarity.

Note that in all the strategies the memory footprint of both the mappers and the reducers is flat at
scale.

Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse
matrices we do not emit zero elements. That said, the simple pseudo-code for multiplying the
individual blocks shown here is certainly not optimal for sparse matrices. As a learning exercise,
our focus here is on mastering the MapReduce complexities, not on optimizing the sequential
matrix multipliation algorithm for the individual blocks.
31

HADOOP AND BIG DATA


Steps

1. setup ()
2. var NIB = (I-1)/IB+1
3. var NKB =(K-1)/KB+1
4. var NJB =(J-1)/JB+1
5. map (key,value)
6. if from matrix A with key=(i,k) andvalue=a(i,k)
7. for 0 <= jb <NJB
8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB,a(i,k))
9. if from matrix B with key=(k,j) and value=b(k,j)
10. for 0 <= ib <NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))

Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb,
then by m. Note that m = 0 for A data and m = 1 for B data.

The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:

11. r = ((ib*JB + jb)*KB + kb) modR

12. These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the datafor
the A block immediately preceding the data for the Bblock.
13. var A = new matrix of dimensionIBxKB
14. var B = new matrix of dimensionKBxJB
15. var sib =-1
16. var skb =-1
32

HADOOP AND BIG DATA


Reduce (key, valueList)

17. if key is (ib, kb, jb,0)


18. // Save the Ablock.
19. sib = ib
20. skb =kb
21. Zero matrixA
22. for each value = (i, k, v) in valueList A(i,k) =v
23. if key is (ib, kb, jb,1)
24. if ib != sib or kb != skb return // A[ib,kb] must bezero!
25. // Build the Bblock.
26. Zero matrixB
27. for each value = (k, j, v) in valueList B(k,j) =v
28. // Multiply the blocks and emit theresult.
29. ibase =ib*IB
30. jbase =jb*JB
31. for 0 <= i < row dimension ofA
32. for 0 <= j < column dimension ofB
33. sum = 0
34. for 0 <= k < column dimension of A = row dimension ofB
a. sum +=A(i,k)*B(k,j)
35. if sum != 0 emit (ibase+i, jbase+j),sum

INPUT:-

Set of Data sets over different Clusters are taken as Rows and Columns
33

HADOOP AND BIG DATA

OUTPUT:-

VIVA VOCE QUESTIONS:

1) Explain what is―map‖and what is―reducer‖ in Hadoop?

2) Mention what daemons run on a master node and slavenodes?

3) Mention what is the use of ContextObject?

4) What is partitioner inHadoop?

5) Explain what is the purpose of RecordReader inHadoop?


34

HADOOP AND BIG DATA


EXERCISE-7:-

AIM:-

Install and Run Pig then write Pig Latin scripts to sort, group, join, project and filter the
data.

DESCRIPTION

Apache Pig is a high-level platform for creating programs that run on


The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in
MapReduce, Apache Tez, or . Pig Latin abstracts the programming from
theMapReduce idiom into a notation which makes MapReduce programming high level,
similar to that of for Pig Latin can be extended using
(UDFs) which the user can write in Java, or andthen
call directly from the language.

Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is
instead declarative. In SQL users can specify that data from two tables must be joined, but not
what join implementation to use (You can specify the implementation of JOIN in SQL, thus "...
for many SQL applications the query writer may not have enough knowledge of the data or
enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an
implementation or aspects of an implementation to be used in executing a script in several
ways. In effect, Pig Latin programming is similar to specifying a query execution plan, making it
easier for programmers to explicitly control the flow of their data processingtask.

SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has
no built in mechanism for splitting a data processing stream and applying different operators to
each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.

Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline
development. If SQL is used, data must first be imported into the database, and then the
cleansing and transformation process canbegin.
35

HADOOP AND BIG DATA


ALGORITHM

STEPS FOR INSTALLING APACHE PIG

1) Extract the pig-0.15.0.tar.gz and move to homedirectory

2) Set the environment of PIG in bashrc file.

3) Pig can run in twomodes


LocalMode and Hadoop Mode
Pig–xlocal and pig
4) GruntShell
Grunt >
5) LOADING Data into GruntShell
DATA = LOAD <CLASSPATH> USING PigStorage(DELIMITER) as (ATTRIBUTE :
DataType1, ATTRIBUTE : DataType2…..)

6) DescribeData

Describe DATA;

7) DUMPData

Dump DATA;

8) FILTERData

FDATA = FILTER DATA by ATTRIBUTE = VALUE;

9) GROUPData

GDATA = GROUP DATA by ATTRIBUTE;

10) IteratingData

FOR_DATA = FOREACH DATA GENERATE GROUP AS GROUP_FUN,


ATTRIBUTE = <VALUE>
36

HADOOP AND BIG DATA


11) Sorting Data

SORT_DATA = ORDER DATA BY ATTRIBUTE WITH CONDITION;

12) LIMITData

LIMIT_DATA = LIMIT DATA COUNT;

13) JOIN Data

JOIN DATA1 BY (ATTRIBUTE1,ATTRIBUTE2….) , DATA2 BY


(ATTRIBUTE3,ATTRIBUTE….N)

INPUT:

Input as Website Click Count Data

OUTPUT:
37

HADOOP AND BIG DATA

VIVA-VOCE Questions

1) What do you mean by a bag inPig?


2) Differentiate between PigLatin andHiveQL
3) How will you merge the contents of two or more relations and divide a single relation
into two or morerelations?
4) What is the usage of foreach operation in Pigscripts?
5) What does Flatten do inPig?
38

HADOOP AND BIG DATA


EXERCISE-8:-

AIM:-

Install and Run Hive then use Hive to Create, alter and drop databases, tables, views,
functions and Indexes.

DESCRIPTION

Hive, allows SQL developers to write Hive Query Language (HQL) statements that are
similar to standard SQL statements; now you should be aware that HQL is limited in the
commands it understands, but it is still pretty useful. HQL statements are broken down by the
Hive serviceinto MapReduce jobs and executed across a Hadoop cluster. Hive looks very much
like traditional database code with SQL access. However, because Hive is based on Hadoop and
MapReduce operations, there are several key differences. The first is that Hadoop is intended for
long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a
very high latency (many minutes). This means that Hive would not be appropriate for
applications that need very fast response times, as you would expect with a database such as
DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that
typically involves a high percentage of write operations.

ALGORITHM:

Apache HIVE INSTALLATION STEPS

1) InstallMySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName andPassword
3) Creating User and granting allPrivileges
Mysql –uroot –proot
Create user <USER_NAME> identified by <PASSWORD>
4) Extract and Configure ApacheHive
39

HADOOP AND BIG DATA


tar xvfz apache-hive-1.0.1.bin.tar.gz
5) Move Apache Hive from Local directory to Homedirectory
6) Set CLASSPATH inbashrc
Export HIVE_HOME = /home/apache-hive
Export PATH = $PATH:$HIVE_HOME/bin
7) Configuring hive-default.xml by adding My SQL ServerCredentials

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value> jdbc:mysql://localhost:3306/hive?
createDatabaseIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>

8) Copying mysql-java-connector.jar to hive/libdirectory.


40

HADOOP AND BIG DATA


SYNTAX for HIVE Database Operations

DATABASE Creation

CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>

Drop Database Statement


DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];

Creating and Dropping Table in HIVE

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name


[(col_name data_type [COMMENT col_comment], ...)]

[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format]

Loading Data into table log_data

Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data;

Alter Table in HIVE

Syntax

ALTER TABLE name RENAME TO new_name


ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

Creating and Dropping View


41

HADOOP AND BIG DATA

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT


column_comment], ...) ] [COMMENT table_comment] AS SELECT ...

Dropping View

Syntax:

DROP VIEW view_name

Functions in HIVE

String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc

Date and Time Functions:- year(), month(), day(), to_date() etc

Aggregate Functions :- sum(), min(), max(), count(), avg() etc

INDEXES

CREATE INDEX index_name ON TABLE base_table_name (col_name, ...)


AS 'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[PARTITIONED BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
42

HADOOP AND BIG DATA


Creating Index

CREATE INDEX index_ip ON TABLE log_data(ip_address) AS


'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD;
Altering and Inserting Index

ALTER INDEX index_ip_address ON log_data REBUILD;


Storing Index Data in Metastore

SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/index_ipaddress_re
sult;

SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;

Dropping Index

DROP INDEX INDEX_NAME on TABLE_NAME;

INPUT

Input as Web Server Log Data


43

HADOOP AND BIG DATA


OUTPUT
44

HADOOP AND BIG DATA


45

HADOOP AND BIG DATA


VIVA-VOCE Questions

1) I do not need the index created in the first question anymore. How can I delete the above
index namedindex_bonuspay?
2) What is the use ofHcatalog?
3) Write a query to rename a table Student toStudent_New.
4) Is it possible to overwrite Hadoop MapReduce configuration inHive?
5) What is the use of explode inHive?
46

HADOOP AND BIG DATA


EXERCISE-10:-

AIM : Write a program to analyze the Web server log stream data usingApache
Flume Framework

DESCRIPTION:

Apache Flume is a distributed, reliable, and available system for efficiently collecting,
aggregating and moving large amounts of log data from many different sources to a centralized
data store. Flume is currently undergoing incubation at The Apache Software Foundation. At a
high-level, Flume NG uses a single-hop message delivery guarantee semantics to provideend-to-
end reliability for the system.

The purpose of Flume is to provide a distributed, reliable, and available system for efficiently
collecting, aggregating and moving large amounts of log data from many different sources to a
centralized data store. The architecture of Flume NG is based on a few concepts that together
help achieve this objective. Some of these concepts have existed in the past implementation, but
have changed drastically. Here is a summary of concepts that Flume NG introduces, redefines, or
reuses from earlierimplementation:

 Event: A byte payload with optional string headers that represent the unit of data that
Flume can transport from it‘s point of origination to it‘s finaldestination.
 Flow: Movement of events from the point of origin to their final destination is considered
a data flow, or simply flow. This is not a rigorous definition and is used only at a high
level for descriptionpurposes.
 Client: An interface implementation that operates at the point of origin of events and
delivers them to a Flume agent. Clients typically operate in the process space of the
application they are consuming data from. For example, Flume Log4j Appender is a
client.
47

HADOOP AND BIG DATA


 Agent: An independent process that hosts flume components such as sources, channels
and sinks, and thus has the ability to receive, store and forward events to their next-hop
destination.
 Source: An interface implementation that can consume events delivered to it via a
specific mechanism. For example, an Avro source is a source implementation that can be
used to receive Avro events from clients or other agents in the flow. When a source
receives an event, it hands it over to one or morechannels.
 Channel: A transient store for events, where events are delivered to the channel via
sources operating within the agent. An event put in a channel stays in that channel until a
sink removes it for further transport. An example of channel is the JDBC channel that
uses a file-system backed embedded database to persist the events until they are removed
by a sink. Channels play an important role in ensuring durability of theflows.
 Sink: An interface implementation that can remove events from a channel and transmit
them to the next agent in the flow, or to the event‘s final destination. Sinks that transmit
the event to it‘s final destination are also known as terminal sinks. The Flume HDFS sink
is an example of a terminal sink. Whereas the Flume Avro sink is an example of a regular
sink that can transmit messages to other agents that are running an Avrosource.

ALGORITHM:

Steps to Configure Apache Flume to Web Server

a) Define a memory channel on agent calledmemory-channel.

agent.channels.memory-channel.type = memory

b) Define a source on agent and connect to channelmemory-channel.


agent.sources.tail-source.type = exec
agent.sources.tail-source.command = tail -F /var/log/system.log
agent.sources.tail-source.channels = memory-channel
48

HADOOP AND BIG DATA


c) Define a sink that outputs to logger.
agent.sinks.log-sink.channel =memory-channel
agent.sinks.log-sink.type =logger
d) Define a sink that outputs to hdfs.
agent.sinks.hdfs-sink.channel =memory-channel
agent.sinks.hdfs-sink.type =hdfs
agent.sinks.hdfs-sink.hdfs.path = hdfs://localhost:54310/tmp/system.log/
agent.sinks.hdfs-sink.hdfs.fileType = DataStream
e) Finally,activate.
agent.channels = memory-channel
agent.sources = tail-source
agent.sinks = log-sink hdfs-sink
f) # Run flume-ng, with log messages to the console.
$ bin/flume-ng agent --conf ./conf/ -f conf/flume.conf \-
Dflume.root.logger=DEBUG,console -n agent

INPUT:-

Huge Amount of Streaming Data from Serverf as Input


49

HADOOP AND BIG DATA


OUTPUT:-

Flume Data Streamed from Server


50

HADOOP AND BIG DATA

VIVA-VOCE QUESTIONS

1) What is FlumeAgent
2) What is FlumeSink
3) What is FlumeChannel
51

HADOOP AND BIG DATA


EXERCISE-10:-

AIM : Write a program to implement combining and partitioning in hadoop to implement


acustom partitioner and Combiner

DESCRIPTION :

COMBINERS:

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it
pays to minimize the data transferred between map and reduce tasks. Hadoop allows the
user to specify a combiner function to be run on the map output—the combiner function‘s
output forms the input to the reduce function. Since the combiner function is an
optimization, Hadoop does not provide a guarantee of how many times it will call it for a
particularmapoutputrecord,ifatall.Inotherwords,callingthecombinerfunctionzero,
one, or many times should produce the same output from the reducer. One can think of
Combinersas―mini-reducers‖thattakeplaceontheoutputofthemappers,priortothe
shuffle and sort phase. Each combiner operates in isolation and therefore does not have
access to intermediate output from other mappers. The combiner is provided keys and
values associated with each key (the same types as the mapper output keys and values).
Critically, one cannot assume that a combiner will have the opportunity to process all
values associated with the same key. The combiner can emit any number of key-value
pairs, but the keys and values must be of the same type as the mapper output (same as the
reducer input). In cases where an operation is both associative and commutative (e.g.,
addition or multiplication), reducers can directly serve as combiners

PARTITIONERS

A common misconception for first-time MapReduce programmers is to use only


asinglereducer.Itiseasytounderstandthatsuchaconstraintisanonsenseandthat
52

HADOOP AND BIG DATA


using more than one reducer is most of the time necessary, else the map/reduce concept
would not be very useful. With multiple reducers, we need some way to determine the
appropriate one to send a (key/value) pair outputted by a mapper. The default behavior is
to hash the key to determine the reducer. The partitioning phase takes place after the map
phase and before the reduce phase. The number of partitions is equal to the number of
reducers. The data gets partitioned across the reducers according to the partitioning
function. This approach improves the overall performance and allows mappers to operate
completely independently. For all of its output key/value pairs, each mapper determines
which reducer will receive them. Because all the mappers are using the same partitioning
for any key, regardless of which mapper instance generated it, the destination partition is
the same. Hadoop uses an interface called Partitioner to determine which partition a
key/value pair will go to. A single partition refers to all key/value pairs that will be sent
to a single reduce task. You can configure the number of reducers in a job driver by
setting a number of reducers on the job object (job.setNumReduceTasks). Hadoopcomes
with a default partitioner implementation i.e. HashPartitioner, which hashes a record‘s
key to determine which partition the record belongs in. Each partition is processed by a
reduce task, so the number of partitions is equal to the number of reduce tasks for the job,
When the map function starts producing output, it is not simply written to disk. Each map
task has a circular memory buffer that it writes the output to. When the contents of the
buffer reach a certain threshold size, a background thread will start tospill the contents to
disk. Map outputs will continue to be written to the buffer while the spill takes place, but
if the buffer fills up during this time, the map will block until the spill is complete. Before
it writes to disk, the thread first divides the data into partitions corresponding to the
reducers that they will ultimately be sent to. Within each partition, the background thread
performs an in-memory sort by key, and if there is a combiner function, it is run on the
output of the sort.
53

HADOOP AND BIG DATA


ALGORITHM:

COMBINING

1) Divide the data source ( the data files ) into fragments or blocks which are sent toa
mapper. These are calledsplits.
2) These splits are further divided into records and these records are provided one at atime
to the mapper for processing. This is achieved through a class called as RecordReader.
Create a Class and extend from TextInputFormat class to create own NLinesInputFormat.
Then create our own RecordReader class called NLinesRecordReader where we will implement the logic
Make a change in the driver program to use new NLinesInputFormatclass.
To prove that are really getting 3 lines at a time, instead of actually counting words ( which already know n

PARTIONING

1. First,the key appearing more will be send to onepartition

2. Second, all other keys will be send to partitions according to theirhashCode().

3. Now suppose if your hashCode() method does not uniformly distribute other keys data
over partitions range. So the data is not evenly distributed in partitions as well as
reducers.Since each partition is equivalent to a reducer.So here some reducers willhave
more data than other reducers.So other reducers will wait for one reducer(one with user
defined keys) due to the work load it share
54

HADOOP AND BIG DATA

INPUT:

Data sets from different sources as Input

OUTPUT:

hduser@lendi-3446:/usr/local/hadoop/bin$ hadoop fs -ls /partitionerOutput/


14/12/01 17:50:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Found 4 items
-rw-r--r-- 1 hduser supergroup 0 2014-12-01 17:49 /partitionerOutput/_SUCCESS
-rw-r--r-- 1 hduser supergroup 10 2014-12-01 17:48 /partitionerOutput/part-r-
00000
-rw-r--r-- 1 hduser supergroup 10 2014-12-01 17:48 /partitionerOutput/part-r-
00001
-rw-r--r-- 1 hduser supergroup 9 2014-12-01 17:49 /partitionerOutput/part-r-00002

VIVA VOCE Questions

1) Where is the Mapper Output (intermediate kay-value data) stored?

2) When is the reducers are started in a MapReducejob?

3) Mention what is the number of default partitioner inHadoop?


4) Mention how many InputSplits is made by a HadoopFramework?

You might also like