Activity 2
Activity 2
Activity2
Eclipse Project Creation Instructions, executing Activity2
Using Hadoop on cloudera and AWS
Hadoop on cloudera
Purpose of Activity2.
In this we will learn how to use Hadoop MapReduce model, storing data in HDFS,
visualizing data using HUE and as well as in AWS.
QUESTION:
Map Phase:
Input Splitting: The input data is divided into smaller chunks called input splits.
Map Function Execution: Each input split is processed by multiple worker nodes in
parallel. The Map function processes the input data and produces intermediate
key-value pairs.
Intermediate Key-Value Pairs: The Map function generates these intermediate
pairs, which are then grouped by key.
Reduce Phase:
Reduce Function Execution: Each reducer processes one partition of the sorted
intermediate data. The Reduce function takes the sorted key-value pairs and
performs computations on them.
Output Generation: The Reduce function generates the final output key-value pairs,
which are typically aggregated results or summaries.
Activity Overview
WORD COUNT: (VOWELS and CONSONANTS)
Task 1: Use code block 1 and count the frequency of words that start with letter
‘a’.
Task 2:Use code block 2 and count the frequency of words that has odd count.
Eclipse Project creation steps
Hit "Add External JARs...", then, File System > usr > lib > hadoop :
We may want to select all jars, and click OK:
We need to add more external libs. Go to "Add External JARs..." again, then grab all libs
in "client": Then, hit "OK"
CODE BLOCK 1:
This code block 1 explains about counting the frequency of words from
the given text file that starts with any two of the alphabets from vowels
followed any two alphabets from consonants.
Example: Starts with A, E (vowels) followed by S, R (Consonants)
Task 1
Use code block 1 and count the frequency of words that start with
letter ‘a’.
Command Explanation
/home/cloudera/Ex2.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
Ex2_outputthis is the output file name
Ex2_output/part-r-00000
command prompt
This Code Block 2 explain about frequency of words that has even count
Task 2
Use code block 2 and count the frequency of words that has odd count.
Once Class is Created then copy code from this file even.java which is
available in canvas under activity 2
Command Explanation
/home/cloudera/even.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
even_outputthis is the output file name
Visualize the Output in Hue
even_output/part-r-00000
Create Directory:
Both commands will create a new directory named data within the
/user/myuser directory in HDFS.
Both commands will copy the local file localfile.txt to the data directory
in HDFS.
Move File:
Both commands will move the file file.txt from the data directory to the
archive directory in HDFS.
Delete File:
Step-2: choose a unique bucket name and at last click on the create bucket option.
Step-3 After creating bucket successfully click on bucket name (to enter bucket)
Step 4: After entering into bucket, we have to do following steps:
Step 4-a: click on the upload button and choose add file and upload our jar file and click upload at
bottom of the page.
After uploading the file the dashboard of bucket looks like bellow:
Step 4-b: click on create folder icon and name it as inputFile.
Then go inside the inputFolder and upload your input file which is given on canvas named as sample.txt,
after uploading the file the dashboard looks like bellow:
Step 4-c: Navigating to JAR file S3 URI:
Click on switch to the old console which is more user friendly for creating clusters.
After converting into old console click on create cluster
After clicking in create cluster enter unique cluster name and then choose Go to advanced options
In the next page come down and choose step type as custom Jar and then click add step
The Add step screen looks as follows:
In the above column Name we have to give our java class name where we write our code. In my case
Name is WordCount(make sure give your class name then only our cluster will run successfully) .
JAR location we have to take it from S3 bucket. Already we noted down location in Step 4-C.
Arguments we have to give both input and output file separated by space
Giving input file: Input file location we are already noted down at Step 4-d.In my case my input file S3
URI is : s3://bigdata5300activity2/inputFile/sample.txt
Giving output file: For output file no need to create file manually AWS s3 bucket will take care whenever
we run our cluster based on our arguments it will create output files in specified path. So I am just giving
name of the outputfile
s3://bigdata5300activity2/inputFile/output118
After adding all arguments, it should be looks like bellow:
Then click on Add and next and next In step-3 of General Cluster Settings make sure our name is
reflected or not if not means rename it.
After changing the name click next and select create cluster
The cluster runs internally it takes some time to execute steps and gives our output.
At finally our cluster will execute all steps successfully.
NOTE: For both the tasks 1 and task2 steps are same (1. creating Jar and adding that jar into our cluster,
2. Passing input file to cluster) so we are not explaining task2.