0% found this document useful (0 votes)
28 views31 pages

Activity 2

This document provides instructions for completing Activity 2 which involves using Hadoop on Cloudera and AWS. It includes: - An overview of the MapReduce process and the tasks to be completed involving word counting. - Steps for creating an Eclipse project, adding Hadoop libraries, and code blocks for counting word frequencies. - Implementing the tasks using the provided code blocks and visualizing outputs in Hue. - Instructions for uploading files to AWS S3 and creating an EMR cluster to run the MapReduce job on AWS.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views31 pages

Activity 2

This document provides instructions for completing Activity 2 which involves using Hadoop on Cloudera and AWS. It includes: - An overview of the MapReduce process and the tasks to be completed involving word counting. - Steps for creating an Eclipse project, adding Hadoop libraries, and code blocks for counting word frequencies. - Implementing the tasks using the provided code blocks and visualizing outputs in Hue. - Instructions for uploading files to AWS S3 and creating an EMR cluster to run the MapReduce job on AWS.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CSCE5300 Introduction to Big Data and Data Science

Activity2
Eclipse Project Creation Instructions, executing Activity2
Using Hadoop on cloudera and AWS

Hadoop on cloudera

Purpose of Activity2.

In this we will learn how to use Hadoop MapReduce model, storing data in HDFS,
visualizing data using HUE and as well as in AWS.

QUESTION:

Provide the detailed understanding of the activity at the end and


perform tasks where ever mentioned.

Use case Description:

Map Phase:
Input Splitting: The input data is divided into smaller chunks called input splits.
Map Function Execution: Each input split is processed by multiple worker nodes in
parallel. The Map function processes the input data and produces intermediate
key-value pairs.
Intermediate Key-Value Pairs: The Map function generates these intermediate
pairs, which are then grouped by key.

Shuffle and Sort Phase:


Partitioning: The intermediate key-value pairs are partitioned based on the key's
hash value. Each partition is assigned to a reducer.
Sorting: Within each partition, the intermediate pairs are sorted based on their
keys. This step is crucial for efficient grouping and reducing.

Reduce Phase:
Reduce Function Execution: Each reducer processes one partition of the sorted
intermediate data. The Reduce function takes the sorted key-value pairs and
performs computations on them.
Output Generation: The Reduce function generates the final output key-value pairs,
which are typically aggregated results or summaries.

Activity Overview
WORD COUNT: (VOWELS and CONSONANTS)

 Task 1: Use code block 1 and count the frequency of words that start with letter
‘a’.

WORD COUNT: (EVEN and ODD)

 Task 2:Use code block 2 and count the frequency of words that has odd count.
Eclipse Project creation steps

Step by step instructions:

File > New > Java Project > Next.


"WordCount" as our project name and click "Finish":
Getting references to hadoop libraries
Right click on WordCount project and select "Properties":

Hit "Add External JARs...", then, File System > usr > lib > hadoop :
We may want to select all jars, and click OK:
We need to add more external libs. Go to "Add External JARs..." again, then grab all libs
in "client": Then, hit "OK"
CODE BLOCK 1:

WORD COUNT: (VOWELS and CONSONANTS)

This code block 1 explains about counting the frequency of words from
the given text file that starts with any two of the alphabets from vowels
followed any two alphabets from consonants.
Example: Starts with A, E (vowels) followed by S, R (Consonants)

Vowels: Words that start with letters ‘a’,’e’,’i’,’o’,’u’.


Consonants:Other than vowels are consonants.

Code is available in Ex2.java file which is there in canvas under


Activity2

Task 1

Use code block 1 and count the frequency of words that start with
letter ‘a’.

Please perform Task 1 by using code block 1 as reference

Implementation for Code Block 1

Step1: Intially placed sample.txt in Downloads folder and


Displaying the content in commandPrompt using below
command
Command: cat /home/cloudera/Downloads/sample.txt
Here cat is used to display the contents in command prompt
Step2:Creating a Directory named pravallika and placing
sample.txt in that folder using belowcommands
Commands: hadoop fs -mkdir pravallika
hadoop fs -put /home/cloudera/Downloads/sample.txt pravallika/
Here mkdir command is used to create a directory in HDFS
put command is used to copy the file from local file
system(/Downloads/sample.txt) toHDFS(pravallika/)
Visualize sample.txt in Hue

Create a new class named Ex2 under WordCount project


Once Class is Created then copy code from this file Ex2.java which is
available in canvas under activity 2

once we saved code do right click on project->export->select JAR file in


Java->Rename JARFile->next->finish

By using below command we can run jar file in command prompt to


visualize the output in HUE

Command: hadoop jar /home/cloudera/Ex2.jar Ex2


pravallika/sample.txt Ex2_output

Command Explanation
/home/cloudera/Ex2.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
Ex2_outputthis is the output file name

Visualize the Output in Hue


In command prompt to display output we are using below command
Command: hadoop fs -cat

Ex2_output/part-r-00000

cat is used to display the content in

command prompt

CODE BLOCK 2: (EVEN AND ODD NUMBERS)

This Code Block 2 explain about frequency of words that has even count

Code is available in even.java file which is there in canvas under


Activity2

Task 2

Use code block 2 and count the frequency of words that has odd count.

Please perform Task 2 by using code block 2 as reference


Implementation for Code Block 2

Create a new class named even under WordCount project

Once Class is Created then copy code from this file even.java which is
available in canvas under activity 2

once we saved code do right click on project->export->select JAR file in


Java->Rename JARFile->next->finish

By using below command we can run jar file in command prompt to


visualize the output in HUE

Command: hadoop jar /home/cloudera/even.jar even


pravallika/sample.txt even_output

Command Explanation
/home/cloudera/even.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
even_outputthis is the output file name
Visualize the Output in Hue

In command prompt to display output we are using below command


Command: hadoop fs -cat

even_output/part-r-00000

cat is used to display the content in command prompt


List of Commands using hadoop fs and hdfs dfs

List Files and Directories:

Using hadoop fs:

hadoop fs -ls /user/myuser

Using hdfs dfs:

hdfs dfs -ls /user/myuser

Both commands will list the contents of the /user/myuser directory in


HDFS.

Create Directory:

Using hadoop fs:

hadoop fs -mkdir /user/myuser/data

Using hdfs dfs:


hdfs dfs -mkdir /user/myuser/data

Both commands will create a new directory named data within the
/user/myuser directory in HDFS.

Copy File from Local to HDFS:

Using hadoop fs:

hadoop fs -copyFromLocal localfile.txt hdfs://name-


node:8020/user/myuser/data/

Using hdfs dfs:

hdfs dfs -copyFromLocal localfile.txt


hdfs://namenode:8020/user/myuser/data/

Both commands will copy the local file localfile.txt to the data directory
in HDFS.

Move File:

Using hadoop fs:

hadoop fs -mv /user/myuser/data/file.txt /user/myuser/archive/

Using hdfs dfs:

hdfs dfs -mv /user/myuser/data/file.txt /user/myuser/archive/

Both commands will move the file file.txt from the data directory to the
archive directory in HDFS.

Delete File:

Using hadoop fs:


hadoop fs -rm /user/myuser/archive/file.txt

Using hdfs dfs:


hdfs dfs -rm /user/myuser/archive/file.txt

Difference between Hadoop fs and hdfs dfs

 hadoop fs is a more generic command-line utility that can interact


with various file systems, while hdfs dfs is specifically designed for
HDFS operations.
 The syntax of the commands is almost identical between the two
utilities for HDFS operations.
 The main difference lies in their scope, hadoop fs can interact with
other file systems (like HBase, S3, local file system, etc.), while hdfs
dfs is limited to HDFS operations.
Using AWS

Step-1: login into AWS account and create S3 bucket.

Step-2: choose a unique bucket name and at last click on the create bucket option.

Step-3 After creating bucket successfully click on bucket name (to enter bucket)
Step 4: After entering into bucket, we have to do following steps:

a. We have to upload jar file which is created in Cloudera to our Bucket.


b. We have to create an input folder on aws bucket and inside the input folder we have to upload our
sample.txt which contains our input data.
c. Note it down jar file S3 URI which will be useful at executing steps in clusters.
d. Note it down sample file S3 URI which will be useful at executing steps in clusters.

Step 4-a: click on the upload button and choose add file and upload our jar file and click upload at
bottom of the page.

After uploading the file the dashboard of bucket looks like bellow:
Step 4-b: click on create folder icon and name it as inputFile.

Then go inside the inputFolder and upload your input file which is given on canvas named as sample.txt,
after uploading the file the dashboard looks like bellow:
Step 4-c: Navigating to JAR file S3 URI:

Click on jar file

Noted it down S3 URI.


Step 4-d: Navigating to input file S3 URI

Click on inputFile folder and select input file

Noted it down the S3 URI


Now search as EMR and navigate into EMR page

Click on switch to the old console which is more user friendly for creating clusters.
After converting into old console click on create cluster

After clicking in create cluster enter unique cluster name and then choose Go to advanced options

In the next page come down and choose step type as custom Jar and then click add step
The Add step screen looks as follows:

In the above column Name we have to give our java class name where we write our code. In my case
Name is WordCount(make sure give your class name then only our cluster will run successfully) .
JAR location we have to take it from S3 bucket. Already we noted down location in Step 4-C.

Arguments we have to give both input and output file separated by space

Giving input file: Input file location we are already noted down at Step 4-d.In my case my input file S3
URI is : s3://bigdata5300activity2/inputFile/sample.txt

Giving output file: For output file no need to create file manually AWS s3 bucket will take care whenever
we run our cluster based on our arguments it will create output files in specified path. So I am just giving
name of the outputfile

s3://bigdata5300activity2/inputFile/output118
After adding all arguments, it should be looks like bellow:
Then click on Add and next and next In step-3 of General Cluster Settings make sure our name is
reflected or not if not means rename it.

After changing the name click next and select create cluster

The cluster runs internally it takes some time to execute steps and gives our output.
At finally our cluster will execute all steps successfully.

For output files we have to go back to S3 bucket.


If we go inside the inputFile folder, we can see output118 Folder which contains output files which
generated by our cluster.

NOTE: For both the tasks 1 and task2 steps are same (1. creating Jar and adding that jar into our cluster,
2. Passing input file to cluster) so we are not explaining task2.

You might also like