0% found this document useful (0 votes)
20 views

4exploring Hadoop Ecosystem With Simple Linux Commands

Uploaded by

y manoj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

4exploring Hadoop Ecosystem With Simple Linux Commands

Uploaded by

y manoj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Exploring Hadoop Ecosystem with Simple Linux Commands

Overview: This assignment is intended to get you more familiar with the Hadoop
Ecosystem with Linux

Prerequisites:
1. Google account OR Google Gmail Account Before proceeding:
• The user should have a Google account or a Google Gmail account at hand
• If not, the user should create a new one first

2. Access Google Cloud Platform (GCP) console


• The user should be able to access his/her Google Cloud Platform (GCP) console

3. An existing project to host the Hadoop-Spark cluster


• The user has an existing project under this account to host the to-be-created
cluster

4. GCP storage bucket ready for use


• The user has created a GCP storage bucket and have it ready for use

5. GCP Hadoop and Spark Cluster were created with 1 Manager Node and 2 Worker Nodes.
The nodes must be turned on for this assignment

NOTES: Please see the following documents, if you need a refresher


• How to Setup a GCP Account with Free Credit
• How to Create Projects in GCP
• How to Create New Storage Buckets in GCP
• How to Create a Hadoop and Spark Cluster in GCP

VERY IMPORTANT: Be sure all nodes are running in GCP

Step One: Start all 3 nodes in the cluster you have already created
• You will then Click on the chevron next to the SSH
• Click on “Open in browser window”
Step Two: Explore the Cluster in Hadoop
• Open terminal via SSH in GCP
• See all the services of Hadoop in our cluster
• Use the command
o whoami
o pwd
o These command lines show you your username and the home directory

• Try other commands in the lecture notes but be careful when you are deleting
something.
• Enter the command
o ps -ef | grep -i hadoop
o This will list all the processing currently running
o Remember when we set up Hadoop all of these services were setup when setup
Hadoop and Spark Cluster with DataProc
• How to move up and down on the terminal window
o Click on the Setting icon in the upper right-hand side of the terminal
o Click on “Show Scrollbar” to see the scrollbar

• You can scroll the terminal using your mouse wheel or trackpad. Alternatively, the
Ctrl+Shift+PageUp/Ctrl+Shift+PageDn keyboard shortcuts scroll the terminal on
Windows and Linux, and Fn+Shift+Up/Fn+Shift+Down scroll the terminal on macOS
zorhan@hadoop-spark-2-cluster-m:~$ ps -ef | grep -i hadoop
hive 769 1 5 00:28 ? 00:00:16 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-
metastore.log -Dhive.log.threshold=INFO -
Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -
Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx8027m -Dproc_metastore -Dlog4j2.formatMsgNoLookups=true -
Dlog4j.configurationFile=hive-log4j2.properties -
Djava.util.logging.config.file=/usr/lib/hive/conf/parquet-
logging.properties -Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.util.RunJar /usr/lib/hive/lib/hive-metastore-2.3.7.jar
org.apache.hadoop.hive.metastore.HiveMetaStore
hive 771 1 5 00:28 ? 00:00:18 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Xmx256m -Dhive.log.dir=/var/log/hive -Dhive.log.file=hive-
server2.log -Dhive.log.threshold=INFO -
Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -
Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx8027m -Dproc_hiveserver2 -Dlog4j2.formatMsgNoLookups=true -
XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -
XX:+PrintGCDetails -Dlog4j.configurationFile=hive-log4j2.properties -
Djava.util.logging.config.file=/usr/lib/hive/conf/parquet-
logging.properties -Djline.terminal=jline.UnsupportedTerminal -
Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar
/usr/lib/hive/lib/hive-service-2.3.7.jar
org.apache.hive.service.server.HiveServer2
mapred 885 1 7 00:28 ? 00:00:22 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_historyserver -Xmx4000m -
Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -
Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Dhadoop.log.dir=/var/log/hadoop-mapreduce -
Dhadoop.log.file=hadoop.log -Dhadoop.root.logger=INFO,console -
Dhadoop.id.str=mapred -Dhadoop.log.dir=/usr/lib/hadoop/logs -
Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -
Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Dhadoop.log.dir=/var/log/hadoop-mapreduce -
Dhadoop.log.file=mapred-mapred-historyserver-hadoop-spark-2-cluster-m.log
-Dhadoop.root.logger=INFO,RFA -Dmapred.jobsummary.logger=INFO,JSA -
XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -
XX:+PrintGCDetails -Dhadoop.security.logger=INFO,NullAppender
org.apache.hadoop.mapreduce.v2.hs.JobHistoryServer
hdfs 897 1 3 00:28 ? 00:00:09 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_secondarynamenode -Xmx1000m -
Dhadoop.log.dir=/var/log/hadoop-hdfs -Dhadoop.log.file=hadoop-hdfs-
secondarynamenode-hadoop-spark-2-cluster-m.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs -
Dhadoop.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx6422m -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -
XX:+PrintGCDateStamps -XX:+PrintGCDetails -
Dhadoop.security.logger=INFO,RFAS
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
yarn 899 1 6 00:28 ? 00:00:20 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_resourcemanager -Xmx4000m -
Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -
Dhadoop.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,RFA -
Dyarn.root.logger=INFO,RFA -Djava.library.path=/usr/lib/hadoop/lib/native
-Dyarn.policy.file=hadoop-policy.xml -Xmx12844m -
Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -
Dhadoop.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.log.file=yarn-yarn-resourcemanager-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop -
Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -classpath
/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/
usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-
hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-
yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
mapreduce/lib/*:/usr/lib/hadoop-
mapreduce/.//*:/usr/lib/spark/yarn/*::/usr/local/share/google/dataproc/lib
/*:/usr/local/share/google/dataproc/lib/*:/usr/local/share/google/dataproc
/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
yarn/lib/*:/etc/hadoop/conf/rm-config/log4j.properties:/usr/lib/hadoop-
yarn/.//timelineservice/*:/usr/lib/hadoop-yarn/.//timelineservice/lib/*
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
hdfs 901 1 5 00:28 ? 00:00:15 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_namenode -Xmx1000m -Dhadoop.log.dir=/var/log/hadoop-
hdfs -Dhadoop.log.file=hadoop-hdfs-namenode-hadoop-spark-2-cluster-m.log -
Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str=hdfs -
Dhadoop.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-
policy.xml -Xmx6422m -XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -
XX:+PrintGCDateStamps -XX:+PrintGCDetails -
Dhadoop.security.logger=INFO,RFAS
org.apache.hadoop.hdfs.server.namenode.NameNode
yarn 903 1 4 00:28 ? 00:00:14 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -Dproc_timelineserver -Xmx4000m -
Dhadoop.log.dir=/var/log/hadoop-yarn -Dyarn.log.dir=/var/log/hadoop-yarn -
Dhadoop.log.file=yarn-yarn-timelineserver-hadoop-spark-2-cluster-m.log -
Dyarn.log.file=yarn-yarn-timelineserver-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir= -Dyarn.id.str=yarn -Dhadoop.root.logger=INFO,RFA -
Dyarn.root.logger=INFO,RFA -Djava.library.path=/usr/lib/hadoop/lib/native
-Dyarn.policy.file=hadoop-policy.xml -XX:+UseConcMarkSweepGC -
XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -
XX:+UseConcMarkSweepGC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -
XX:+PrintGCDetails -Djava.util.logging.config.file=/etc/hadoop/conf/yarn-
timelineserver.logging.properties -
Djava.util.logging.config.file=/etc/hadoop/conf/yarn-
timelineserver.logging.properties -Dhadoop.log.dir=/var/log/hadoop-yarn -
Dyarn.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=yarn-yarn-
timelineserver-hadoop-spark-2-cluster-m.log -Dyarn.log.file=yarn-yarn-
timelineserver-hadoop-spark-2-cluster-m.log -
Dyarn.home.dir=/usr/lib/hadoop-yarn -Dhadoop.home.dir=/usr/lib/hadoop -
Dhadoop.root.logger=INFO,RFA -Dyarn.root.logger=INFO,RFA -
Djava.library.path=/usr/lib/hadoop/lib/native -classpath
/etc/hadoop/conf:/etc/hadoop/conf:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/
usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-
hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-
yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
mapreduce/lib/*:/usr/lib/hadoop-
mapreduce/.//*:/usr/lib/spark/yarn/*::/usr/local/share/google/dataproc/lib
/*:/usr/local/share/google/dataproc/lib/*:/usr/local/share/google/dataproc
/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-
yarn/lib/*:/etc/hadoop/conf/timelineserver-config/log4j.properties
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistory
Server
root 1328 1 5 00:28 ? 00:00:17 /usr/bin/java -
XX:+AlwaysPreTouch -Xms1605m -Xmx1605m -XX:+CrashOnOutOfMemoryError -
XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/crash/google-
dataproc-agent.hprof -Djava.util.logging.config.file=/etc/google-
dataproc/logging.properties -cp /usr/local/share/google/dataproc/dataproc-
agent.jar:/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr
/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-
hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-
yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-
mapreduce/.//*:/usr/local/share/google/dataproc/lib/*
com.google.cloud.hadoop.services.agent.AgentMain
/usr/local/share/google/dataproc/startup-script.sh
/usr/local/share/google/dataproc/post-hdfs-startup-script.sh
spark 1560 1 3 00:28 ? 00:00:11 /usr/lib/jvm/temurin-8-jdk-
amd64/bin/java -cp
/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/:/etc/hive/con
f/:/usr/local/share/google/dataproc/lib/*:/usr/share/java/mysql.jar -
Xmx4000m org.apache.spark.deploy.history.HistoryServer
zorhan 2737 2513 0 00:33 pts/0 00:00:00 grep -i hadoop

• So what does this all mean? These are all the components of the Hadoop Ecosystem.
o You see root. The process number 1328. This process is what is needed to run a
program, a Hadoop component. The process number is an ID for that program if
you will use it. It is very important in the Ecosystem as you could shut down a
process with a command using the process ID number
o You also see mapred. The process number is 885, that is running JobHistoryServer
o You see yarn. The process number is 899 that is running the ResourceManager
o See yarn. The process number is 903 that is running ApplicationHistoryServer
o See hive. The process number is 769 that is running the HiveMetastore (see
below)
o See hdfs with a process number 901 that is running the NameNode
o Another hive. The process number 771 that is running the HiveServer2
o Lastly you see Spark. The process number 1560 that is running HistoryServer
o Is this sounding familiar?
o Take note of each service, process ID of each service and what each is running

• Let’s look at Hive(details will be next week)

o Megastore and Hive Server (Engine) are critical to run Hive


• Let’s look at YARN Architecture
o The Resource Manager (Manager Node) is a major component of Yarn
o This is so that it can work with the Application Master and Node Master or worker
nodes
• Let’s once again look at the HDFS Architecture

You might also like