Hadoop Imp Commands
Hadoop Imp Commands
Hadoop Imp Commands
com
HADOOP COMMANDS
To Start Hadoop (Present Working Directory should be Hadoop Folder)
bin/start-all.sh
Note:
HDFS Daemons NN, SNN, JT
MR Daemons DN, TT
Note: This command should be EXECUTED only when the Hadoop is getting installed. It is
advised not to use this command once data are inserted into HDFS
To Create Directory
hadoop dfs -mkdir /NewDirectory/Dir1
To Copy File
From Local System to HDFS
To Move : File will be copied and file from copied path will be deleted.
To Delete Directory
bin/hadoop dfs -rm /ClientJavaFolder
To Delete Directory which start with same name and ends with different character.
bin/hadoop dfs -rm /ClientJavaFolder/File*
Ex: If dir ClientJavaFolder have 3 files. If you want to delete all the files which starts with File
then you need to use above command. i.e. * will be used.
File1.txt
File2.txt
HelloFile.txt
Ex:
bin/hadoop dfs -du /
Found 2 items
238925722 hdfs://localhost:50000/ClientJavaFolder
119462861 hdfs://localhost:50000/NewFile
Note: Only Directories will be given execute permission and for all files READ WRITE can be
given.
Format:
bin/hadoop jar [jarpath] [classPath] [inputpath] [outputpath]
Use below CMD if the JAR is EXPORTED along with class name
bin/hadoop jar /home/administrator/Print.jar
/NewJavaFolder/NewRenameFile.txt /hello.txt
Format:
bin/hadoop jar [streamingjar path]
\-file
[mapper type file format ]
-mapper
[mapper type file]
\-file
[reducer type file format ]
- reducer
[reducer type file]
-input
[input file path]
-output
Note: Spaces are not allowed in above command for all BACKWARD SLASH
Ex: \ -file is wrong
\-file is corrent
The whole comment should be ONE LINE command
Generic Parserer Used via PRE-DEFINED Command (Using LOCAL HOST method)
HIVE
Without Partition;
create table patient(pid INT,pname STRING,drug STRING,gender STRING,tot_amt INT)
row format delimited fields terminated by ',' stored as textfile;
WithPartition;
create table country(cid INT, name STRING,cntry STRING, joindate STRING) partitioned
by(country STRING, jdate STRING) row format delimited fields terminated by ',' stored as
textfile;
From HDFS:
load data inpath '/my30Mb_file' into table patient;
From Local
load data local inpath '/home/administrator/Desktop/US' into table patient;
Partitioned by:
load data local inpath '/home/administrator/Desktop/US' into table country partition
(country='US',jdate='01-05-2010');
Note: All partition variables must be used when trying to load data. Ex country and jdate are
partition variables. If we try to use country along, it will not work. We need to use all varaibles.
To Local
insert overwrite local directory '/home/administrator/h' select * from patient;
(Use above query when you need only one column to be stored. For more columns
storage, see below note section)
Note: Format problem will be available if we try to store in local drive. i.e. data will be stored
without spaces or any delimiters. We can use below query to put our own delimiter.
This delimiter can be any thing irrespective to default delimiter which was given in CREATE
statement.
Partitions in HIVE
From HDFS:
load data inpath '/home/administrator/Desktop/US' into table country partition
(country='US',jdate='01-05-2010');
From Local
load data local inpath '/home/administrator/Desktop/US' into table country partition
(country='US',jdate='01-05-2010');
To Add Jar:
add jar file:///home/administrator/hive.jar
To Add Jar:
create temporary function myfn as 'org.samples.hive.training.hiveUDAF';
Note: Hive will allow only one instance of HWI and in mulitple terminals if you try to execute this
query. You will end up with error.
Buckets:
create table bucket (cid INT, cname STRING) CLUSTERED BY(cid) 3 buckets row format
delimited fields terminated by ',' stored as textfile;
Note: When you try to load the same data again to same directory, it will create separate instance.
Whereas in normal method without bucketing, it will replace with the loaded data.
HBASE
To Start Hbase (Present Working Directory should be Hbase Folder)
bin/start-hbase.sh
To create table
create 'mercury','Personal','Medical'
To insert in table
put 'mercury','001','Personal:Name','Bala'
Note:
1. By Default, we will have 3 versions of data. It is configurable via code itself. Below
example shows how versions will be changed.
create 'sam1', {NAME => 'f1'},{NAME =>'f2',VERSIONS => 5}
2. If the table is disabled, put statement will not work. To find whether the table is
enabled or disabled. Use below command
is_disabled 'table_name'
To retrive values
Using GET Method
get 'mercury','001'
Note: This will display latest version. To display all versions of data, use below command.
To delete values
delete 'dummy','002','f1:name' (specific family wise)
delete 'dummy','002' (specific row key wise)
To drop table
drop 'mercury'
Note: To delete we need to disble the table. Use disable 'mercury' to disable the table.
Enable/Disable/Drop/Exist
PIG
To Start PIG Service
To load data from HDFS
bin/pig
To load data from Local System
bin/pig -x local
Note:
1. If the mode is HDFS,
1. The output will be available in /tmp directory
2. If the mode is LOCAL,
1. Output will be available in /tmp folder. (click COMPUTER link in left pane, to see
this folder.
To Load Data
A = load '/data10' using PigStorage(',');
A = load '/data10' using PigStorage(',') as (pid:int, pname:chararray, drug:chararray,
gender:chararray, amt:int);
To Filter Data
B = filter A by $2 =='avil';
Login as
bin/pig
To Load Data
C= load '/data10' using PigStorage(',') as (pid:int, pname:chararray,
drug:chararray, gender:chararray, tot_amt:int);
To Filter Data
D = filter C by drug=='avil'; (we use column values rather than $ value.)
X = filter C by (f1==8) OR (NOT (f2+f3 > f1));
To Display Data
dump A
To Describe Bag
illustrate A
To Display Data
illustrate A
To Use Group Fn
G = GROUP F by drug;
sm = foreach G generate group,SUM(C.tot_amt) as S;
To Use CoGroup Fn
A = LOAD '/data10' using PigStorage(',');
B = LOAD '/drug' using PigStorage(',');
C = COGROUP A by $2, B by $1;
To Use Distinct
A = LOAD '/tuple2' using PigStorage(',') as (f1:int, f2:int, f3:int);
B = GROUP A by f1;
C = foreach B generate group,SUM(A.f2) as S;
dump C
Split Fn
A = LOAD '/patientHR-10' using PigStorage(',') ;
SPLIT A into x if $0%2==0, z if $0==1;
dump x
dump y
Note: This can be used to do mulitple operations at a single query. Like Male records in one
Variable and Female records in another varaible.
To Use MACRO
DEFINE myMacro1(fTable,fCol,fValue) returns returnVariable
{ $returnVariable = FILTER $fTable BY $fCol == '$fValue';};
To Use Joins
A = load '/data10' using PigStorage(',');
B = load '/drug' using PigStorage(',');
C = join A by $2, B by $1;
To Sort Data
A = LOAD '/data10' using PigStorage(',');
B = foreach A generate $0..$4;
C = order B by $4;
C = order B by $4..$6;
C = order B by $4..$6,$7;
Tuples/Bag/Maps
Tuple:
A = LOAD '/tuple' using PigStorage(' ') AS (t1:tuple(t1a:int,
t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
(Refer PIG.txt, search above query for more explaination)
To Use Tuple
B = FOREACH A GENERATE t1.t1a,t2.t2a;
Bag:
A = LOAD '/tuple' using PigStorage(' ') AS (t1:tuple(t1a:int,
t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
(Refer PIG.txt, search above query for more explaination)
To Use Tuple
B = FOREACH A GENERATE t1.t1a,t2.t2a;
Conventions of TUPLE/BAG/MAP:
(1,2,3) is tuple
{1,2} is bag
{ (1,3),(3,4) } is tuple inside bag
(123,10,{ (1,3),(3,4) } )is bag inside tuple
[key#value] is MAP
Ex: [ 'name' # 'John', 'ext' # 5555 ]
Bag:
A = LOAD '/tuple' using PigStorage(' ') AS (t1:tuple(t1a:int,
t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int));
SQOOP
To check SQOOP & DB Connectivitiy
bin/sqoop
eval --connect jdbc:mysql://localhost/test
-username root
-password root
--query
"select * from patient";
To List Databases
bin/sqoop list-databases --connect jdbc:mysql://localhost/information_schema -username
root -password root
-m Mapper
Commands Descriptions
--query This command is used to accept the sql query. If you use this command, --table
should not be used.
--columns <col> If you need to fetch specific columns of table. Use this command.
Ex: --columns col1,col2,col3
--num-mappers It is as same as like -m option which we will provide to introduce how many
mappers will be running.
--warehouse-dir As same like target-dir, But the difference is it will create a directory of
TABLE name and will be storing all details inside the newly created dir.
Ex: Consider PatientMR is the table which we are going to store in HDFS
machines.
-conf <configuration file> specify an application configuration
file
Import Commands
Commands Descriptions
--input-escaped-by <char> Sets the input escape character
--input-fields-terminated-by <char> Sets the input field separator
--input-lines-terminated-by <char> Sets the input end-of-line character
--input-optionally-enclosed-by <char> Sets a field enclosing character
Commands Descriptions
--update-mode allowinsert By adding this command, allows user to insert the new records which
are not available in
--update-key <column name> This is used to update the table based on column name that is given.
--verbose Print more information while working
Commands Descriptions
bin/sqoop job --list To display all the Sqoop JOB that are created.
bin/sqoop job exec <jobname> To execute the job created.
FLUME
To RUN Flume
bin/flume-ng
agent
--conf-file netcat_flume.conf
--name a1
-Dflume.root.logger=INFO,console
Sources
*-mandatory attributes
type, bind, port, interceptor
Attributes Description
type*
What type of transfer method the flume uses is mentioned. Ex:
a1.sources.r1.type = netcat (Terminal based)
a1.sources.r1.type = exec (Log File based)
To mention Log File, below code will be used. Also need to give batchsize.
a1.sources.r1.command = tail - F /home/administration/data.log
a1.sources.r1.batchSize = 2
(Batchsize is max no of lines to read and send to the channel at a time)
a1.sources.r1.type = spooldir (Directory based)
bind*
IP Address of Source machine or if the local machine is used. Then below method is
followed.
a1.sources.r1.bind = localhost
port* Port number is mentioned here.
Interceptors
Interceptors will be intermediater between Source and Channels.We need to specify type
and header of the interceptor.
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
a1.sources.r1.interceptors.i1.hostheader = hostname
TBD
Port Numbers
If you have any concerns or need some changes in this doc, please contact us via below mail address.
ballicon@gmail.com
rbharath4u@gmail.com
saikiran.isk@gmail.com
email2anandraj@gmail.com
silambarasan909@gmail.com