0% found this document useful (0 votes)

28 views31 pages

Activity 2

This document provides instructions for completing Activity 2 which involves using Hadoop on Cloudera and AWS. It includes: - An overview of the MapReduce process and the tasks to be completed involving word counting. - Steps for creating an Eclipse project, adding Hadoop libraries, and code blocks for counting word frequencies. - Implementing the tasks using the provided code blocks and visualizing outputs in Hue. - Instructions for uploading files to AWS S3 and creating an EMR cluster to run the MapReduce job on AWS.

Uploaded by

patilbhavesh991209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views31 pages

Activity 2

Uploaded by

patilbhavesh991209

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

CSCE5300 Introduction to Big Data and Data Science

Activity2
Eclipse Project Creation Instructions, executing Activity2
Using Hadoop on cloudera and AWS

Hadoop on cloudera

Purpose of Activity2.

In this we will learn how to use Hadoop MapReduce model, storing data in HDFS,
visualizing data using HUE and as well as in AWS.

QUESTION:

Provide the detailed understanding of the activity at the end and

perform tasks where ever mentioned.

Use case Description:

Map Phase:
Input Splitting: The input data is divided into smaller chunks called input splits.
Map Function Execution: Each input split is processed by multiple worker nodes in
parallel. The Map function processes the input data and produces intermediate
key-value pairs.
Intermediate Key-Value Pairs: The Map function generates these intermediate
pairs, which are then grouped by key.

Shuffle and Sort Phase:

Partitioning: The intermediate key-value pairs are partitioned based on the key's
hash value. Each partition is assigned to a reducer.
Sorting: Within each partition, the intermediate pairs are sorted based on their
keys. This step is crucial for efficient grouping and reducing.

Reduce Phase:
Reduce Function Execution: Each reducer processes one partition of the sorted
intermediate data. The Reduce function takes the sorted key-value pairs and
performs computations on them.
Output Generation: The Reduce function generates the final output key-value pairs,
which are typically aggregated results or summaries.

Activity Overview
WORD COUNT: (VOWELS and CONSONANTS)

 Task 1: Use code block 1 and count the frequency of words that start with letter
‘a’.

WORD COUNT: (EVEN and ODD)

 Task 2:Use code block 2 and count the frequency of words that has odd count.
Eclipse Project creation steps

Step by step instructions:

File > New > Java Project > Next.

"WordCount" as our project name and click "Finish":
Getting references to hadoop libraries
Right click on WordCount project and select "Properties":

Hit "Add External JARs...", then, File System > usr > lib > hadoop :
We may want to select all jars, and click OK:
We need to add more external libs. Go to "Add External JARs..." again, then grab all libs
in "client": Then, hit "OK"
CODE BLOCK 1:

WORD COUNT: (VOWELS and CONSONANTS)

This code block 1 explains about counting the frequency of words from
the given text file that starts with any two of the alphabets from vowels
followed any two alphabets from consonants.
Example: Starts with A, E (vowels) followed by S, R (Consonants)

Vowels: Words that start with letters ‘a’,’e’,’i’,’o’,’u’.

Consonants:Other than vowels are consonants.

Code is available in Ex2.java file which is there in canvas under

Activity2

Task 1

Use code block 1 and count the frequency of words that start with
letter ‘a’.

Please perform Task 1 by using code block 1 as reference

Implementation for Code Block 1

Step1: Intially placed sample.txt in Downloads folder and

Displaying the content in commandPrompt using below
command
Command: cat /home/cloudera/Downloads/sample.txt
Here cat is used to display the contents in command prompt
Step2:Creating a Directory named pravallika and placing
sample.txt in that folder using belowcommands
Commands: hadoop fs -mkdir pravallika
hadoop fs -put /home/cloudera/Downloads/sample.txt pravallika/
Here mkdir command is used to create a directory in HDFS
put command is used to copy the file from local file
system(/Downloads/sample.txt) toHDFS(pravallika/)
Visualize sample.txt in Hue

Create a new class named Ex2 under WordCount project

Once Class is Created then copy code from this file Ex2.java which is
available in canvas under activity 2

once we saved code do right click on project->export->select JAR file in

Java->Rename JARFile->next->finish

By using below command we can run jar file in command prompt to

visualize the output in HUE

Command: hadoop jar /home/cloudera/Ex2.jar Ex2

pravallika/sample.txt Ex2_output

Command Explanation
/home/cloudera/Ex2.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
Ex2_outputthis is the output file name

Visualize the Output in Hue

In command prompt to display output we are using below command
Command: hadoop fs -cat

Ex2_output/part-r-00000

cat is used to display the content in

command prompt

CODE BLOCK 2: (EVEN AND ODD NUMBERS)

This Code Block 2 explain about frequency of words that has even count

Code is available in even.java file which is there in canvas under

Activity2

Task 2

Use code block 2 and count the frequency of words that has odd count.

Please perform Task 2 by using code block 2 as reference

Implementation for Code Block 2

Create a new class named even under WordCount project

Once Class is Created then copy code from this file even.java which is
available in canvas under activity 2

once we saved code do right click on project->export->select JAR file in

Java->Rename JARFile->next->finish

By using below command we can run jar file in command prompt to

visualize the output in HUE

Command: hadoop jar /home/cloudera/even.jar even

pravallika/sample.txt even_output

Command Explanation
/home/cloudera/even.jarjar is located in this path
pravallika/sample.txtInput file is present in pravallika directory
even_outputthis is the output file name
Visualize the Output in Hue

In command prompt to display output we are using below command

Command: hadoop fs -cat

even_output/part-r-00000

cat is used to display the content in command prompt

List of Commands using hadoop fs and hdfs dfs

List Files and Directories:

Using hadoop fs:

hadoop fs -ls /user/myuser

Using hdfs dfs:

hdfs dfs -ls /user/myuser

Both commands will list the contents of the /user/myuser directory in

HDFS.

Create Directory:

Using hadoop fs:

hadoop fs -mkdir /user/myuser/data

Using hdfs dfs:

hdfs dfs -mkdir /user/myuser/data

Both commands will create a new directory named data within the
/user/myuser directory in HDFS.

Copy File from Local to HDFS:

Using hadoop fs:

hadoop fs -copyFromLocal localfile.txt hdfs://name-

node:8020/user/myuser/data/

Using hdfs dfs:

hdfs dfs -copyFromLocal localfile.txt

hdfs://namenode:8020/user/myuser/data/

Both commands will copy the local file localfile.txt to the data directory
in HDFS.

Move File:

Using hadoop fs:

hadoop fs -mv /user/myuser/data/file.txt /user/myuser/archive/

Using hdfs dfs:

hdfs dfs -mv /user/myuser/data/file.txt /user/myuser/archive/

Both commands will move the file file.txt from the data directory to the
archive directory in HDFS.

Delete File:

Using hadoop fs:

hadoop fs -rm /user/myuser/archive/file.txt

Using hdfs dfs:

hdfs dfs -rm /user/myuser/archive/file.txt

Difference between Hadoop fs and hdfs dfs

 hadoop fs is a more generic command-line utility that can interact

with various file systems, while hdfs dfs is specifically designed for
HDFS operations.
 The syntax of the commands is almost identical between the two
utilities for HDFS operations.
 The main difference lies in their scope, hadoop fs can interact with
other file systems (like HBase, S3, local file system, etc.), while hdfs
dfs is limited to HDFS operations.
Using AWS

Step-1: login into AWS account and create S3 bucket.

Step-2: choose a unique bucket name and at last click on the create bucket option.

Step-3 After creating bucket successfully click on bucket name (to enter bucket)
Step 4: After entering into bucket, we have to do following steps:

a. We have to upload jar file which is created in Cloudera to our Bucket.

b. We have to create an input folder on aws bucket and inside the input folder we have to upload our
sample.txt which contains our input data.
c. Note it down jar file S3 URI which will be useful at executing steps in clusters.
d. Note it down sample file S3 URI which will be useful at executing steps in clusters.

Step 4-a: click on the upload button and choose add file and upload our jar file and click upload at
bottom of the page.

After uploading the file the dashboard of bucket looks like bellow:
Step 4-b: click on create folder icon and name it as inputFile.

Then go inside the inputFolder and upload your input file which is given on canvas named as sample.txt,
after uploading the file the dashboard looks like bellow:
Step 4-c: Navigating to JAR file S3 URI:

Click on jar file

Noted it down S3 URI.

Step 4-d: Navigating to input file S3 URI

Click on inputFile folder and select input file

Noted it down the S3 URI

Now search as EMR and navigate into EMR page

Click on switch to the old console which is more user friendly for creating clusters.
After converting into old console click on create cluster

After clicking in create cluster enter unique cluster name and then choose Go to advanced options

In the next page come down and choose step type as custom Jar and then click add step
The Add step screen looks as follows:

In the above column Name we have to give our java class name where we write our code. In my case
Name is WordCount(make sure give your class name then only our cluster will run successfully) .
JAR location we have to take it from S3 bucket. Already we noted down location in Step 4-C.

Arguments we have to give both input and output file separated by space

Giving input file: Input file location we are already noted down at Step 4-d.In my case my input file S3
URI is : s3://bigdata5300activity2/inputFile/sample.txt

Giving output file: For output file no need to create file manually AWS s3 bucket will take care whenever
we run our cluster based on our arguments it will create output files in specified path. So I am just giving
name of the outputfile

s3://bigdata5300activity2/inputFile/output118
After adding all arguments, it should be looks like bellow:
Then click on Add and next and next In step-3 of General Cluster Settings make sure our name is
reﬂected or not if not means rename it.

After changing the name click next and select create cluster

The cluster runs internally it takes some time to execute steps and gives our output.
At finally our cluster will execute all steps successfully.

For output files we have to go back to S3 bucket.

If we go inside the inputFile folder, we can see output118 Folder which contains output files which
generated by our cluster.

NOTE: For both the tasks 1 and task2 steps are same (1. creating Jar and adding that jar into our cluster,
2. Passing input file to cluster) so we are not explaining task2.

PDC All Labs
100% (1)
PDC All Labs
129 pages
Big Data
No ratings yet
Big Data
130 pages
Model Control System in Triforma: Mcs Guide
No ratings yet
Model Control System in Triforma: Mcs Guide
183 pages
Crystal Reports - Oracle Stored Procedures
100% (6)
Crystal Reports - Oracle Stored Procedures
12 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
1z0-482 Dump
0% (4)
1z0-482 Dump
6 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
System Design Specification Sds
No ratings yet
System Design Specification Sds
12 pages
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
No ratings yet
Extracting Real Value From Your Data With Apache Hadoop: Sarah Sproehnle
51 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
DM Unit-1
No ratings yet
DM Unit-1
27 pages
2023bske PCVL For Barangay 4610004banilad-2
No ratings yet
2023bske PCVL For Barangay 4610004banilad-2
255 pages
Part 03 Intro To Hadoop
No ratings yet
Part 03 Intro To Hadoop
22 pages
Data Warehouse Security: Vaishali S
No ratings yet
Data Warehouse Security: Vaishali S
8 pages
Oracle Partitioning
No ratings yet
Oracle Partitioning
180 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
Big Data Cloudera TP
No ratings yet
Big Data Cloudera TP
33 pages
PRACTICAL FILE 24-25
No ratings yet
PRACTICAL FILE 24-25
31 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Hands On
No ratings yet
Hands On
26 pages
Unit 4 Rdbms
No ratings yet
Unit 4 Rdbms
46 pages
Database Independent Code-To Data
No ratings yet
Database Independent Code-To Data
60 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Hadoop module1
No ratings yet
Hadoop module1
37 pages
Unit 3
No ratings yet
Unit 3
61 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Aim: Database Connectivity Source Code:: Advance Java Lab
No ratings yet
Aim: Database Connectivity Source Code:: Advance Java Lab
12 pages
Lookup Transformation
No ratings yet
Lookup Transformation
28 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Basic SQL
100% (1)
Basic SQL
102 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
20 pages
Notes On DBMS Internals: Preamble
No ratings yet
Notes On DBMS Internals: Preamble
27 pages
2039883_E_20240701_FAQ_HANA_SNAPSHOTS
No ratings yet
2039883_E_20240701_FAQ_HANA_SNAPSHOTS
5 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Word Count
No ratings yet
Word Count
10 pages
Procedure: 1
No ratings yet
Procedure: 1
29 pages
BDA Lab Manual_organized (2) (1) - Copy
No ratings yet
BDA Lab Manual_organized (2) (1) - Copy
69 pages
BDA Lab
No ratings yet
BDA Lab
13 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Big Data File
No ratings yet
Big Data File
16 pages
Practice 2
No ratings yet
Practice 2
7 pages
Frequent Patterns and Association Rule Mining: Outline
No ratings yet
Frequent Patterns and Association Rule Mining: Outline
26 pages
Database Exam 1 Solution
No ratings yet
Database Exam 1 Solution
8 pages
Nama: Yohana Febriyanti Manik NPM: 219530011 Kelas: 2 Ipti Mata Kuliah: Praktek Sistem Basis Data
No ratings yet
Nama: Yohana Febriyanti Manik NPM: 219530011 Kelas: 2 Ipti Mata Kuliah: Praktek Sistem Basis Data
3 pages
Exp 5 - 9
No ratings yet
Exp 5 - 9
25 pages
Palak
No ratings yet
Palak
10 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
How To Share An SQVI Query in SAP
No ratings yet
How To Share An SQVI Query in SAP
9 pages
basic HDFS commands
No ratings yet
basic HDFS commands
7 pages
Labs Hadoop1
No ratings yet
Labs Hadoop1
9 pages
Module 2 Cont.
No ratings yet
Module 2 Cont.
16 pages
A.1.1 Outline The Differences Between Data and Information. What Is Data: Data Is A Raw and Unorganized Fact That Required To Be Processed To Make It
No ratings yet
A.1.1 Outline The Differences Between Data and Information. What Is Data: Data Is A Raw and Unorganized Fact That Required To Be Processed To Make It
27 pages
Project Charter Coffee Shop
No ratings yet
Project Charter Coffee Shop
3 pages
BDA Record (1)
No ratings yet
BDA Record (1)
34 pages
Backing Up Archived Logs Needed To Recover An Online Backup
No ratings yet
Backing Up Archived Logs Needed To Recover An Online Backup
41 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Mysql Notes PDF
No ratings yet
Mysql Notes PDF
30 pages
lsde_workshop_wk9(2)
No ratings yet
lsde_workshop_wk9(2)
31 pages
Data Science
No ratings yet
Data Science
82 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
Migrating An Oracle Database To AWS
No ratings yet
Migrating An Oracle Database To AWS
9 pages
Big Data Lab Manual Printout Copy
No ratings yet
Big Data Lab Manual Printout Copy
51 pages
big datalab
No ratings yet
big datalab
4 pages
BIG data file
No ratings yet
BIG data file
28 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
No ratings yet
Hadoop Practical Commands & Mapreduce Lab Mannula With Java and Python
2 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
INS2080 - Project Description - Rubik
No ratings yet
INS2080 - Project Description - Rubik
4 pages
Final 12au
No ratings yet
Final 12au
14 pages
DBMS Hand Written Notes Made Easy
No ratings yet
DBMS Hand Written Notes Made Easy
135 pages
Red Hat Ceph Storage-5-File System Guide-En-Us
No ratings yet
Red Hat Ceph Storage-5-File System Guide-En-Us
160 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Answer Posted By: I Also Faced This Question!!
No ratings yet
Answer Posted By: I Also Faced This Question!!
32 pages
BDA
No ratings yet
BDA
30 pages
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Windows Command Prompt
From Everand
Windows Command Prompt
Murat Yildirimoglu
No ratings yet
NoSQL Injection for Elasticsearch
From Everand
NoSQL Injection for Elasticsearch
Gary Drocella
No ratings yet