Homework Labs Lecture01
Homework Labs Lecture01
Homework Labs Lecture01
Getting Started
1.
The
VM
is
set
to
automatically
log
in
as
the
user
training.
Should
you
log
out
at
any
time,
you
can
log
back
in
as
the
user
training
with
the
password
training.
2. In some command-‐line steps in the labs, you will see lines like this:
The
dollar
sign
($)
at
the
beginning
of
each
line
indicates
the
Linux
shell
prompt.
The
actual
prompt
will
include
additional
information
(e.g.,
[training@localhost workspace]$
)
but
this
is
omitted
from
these
instructions
for
brevity.
The
backslash
(\)
at
the
end
of
the
first
line
signifies
that
the
command
is
not
completed,
and
continues
on
the
next
line.
You
can
enter
the
code
exactly
as
3.
Although
many
students
are
comfortable
using
UNIX
text
editors
like
vi
or
emacs,
some
might
prefer
a
graphical
text
editor.
To
invoke
the
graphical
editor
from
the
command
line,
type
gedit
followed
by
the
path
of
the
file
you
wish
to
edit.
Appending
&
to
the
command
allows
you
to
type
additional
commands
while
the
editor
is
still
open.
Here
is
an
example
of
how
to
edit
a
file
named
myfile.txt:
In
this
lab
you
will
begin
to
get
acquainted
with
the
Hadoop
tools.
You
will
manipulate
files
in
HDFS,
the
Hadoop
Distributed
File
System.
$ ~/scripts/developer/training_setup_dev.sh
Hadoop
Hadoop
is
already
installed,
configured,
and
running
on
your
virtual
machine.
Most
of
your
interaction
with
the
system
will
be
through
a
command-‐line
wrapper
called
hadoop.
If
you
run
this
program
with
no
arguments,
it
prints
a
help
message.
To
try
this,
run
the
following
command
in
a
terminal
window:
$ hadoop
The
hadoop
command
is
subdivided
into
several
subsystems.
For
example,
there
is
a
subsystem
for
working
with
files
in
HDFS
and
another
for
launching
and
managing
MapReduce
processing
jobs.
1.
Open
a
terminal
window
(if
one
is
not
already
open)
by
double-‐clicking
the
Terminal
icon
on
the
desktop.
$ hadoop fs
You
see
a
help
message
describing
all
the
commands
associated
with
the
FsShell
subsystem.
3. Enter:
$ hadoop fs -ls /
This
shows
you
the
contents
of
the
root
directory
in
HDFS.
There
will
be
multiple
entries,
one
of
which
is
/user.
Individual
users
have
a
“home”
directory
under
this
directory,
named
after
their
username;
your
username
in
this
course
is
training,
therefore
your
home
directory
is
/user/training.
4. Try viewing the contents of the /user directory by running:
You will see your home directory in the directory listing.
There
are
no
files
yet,
so
the
command
silently
exits.
This
is
different
from
running
hadoop fs -ls /foo,
which
refers
to
a
directory
that
doesn’t
exist.
In
this
case,
an
error
message
would
be
displayed.
1.
Change
directories
to
the
local
filesystem
directory
containing
the
sample
data
we
will
be
using
in
the
homework
labs.
$ cd ~/training_materials/developer/data
If
you
perform
a
regular
Linux
ls
command
in
this
directory,
you
will
see
a
few
files,
including
two
named
shakespeare.tar.gz
and
shakespeare-stream.tar.gz.
Both
of
these
contain
the
complete
works
of
Shakespeare
in
text
format,
but
with
different
formats
and
organizations.
For
now
we
will
work
with
shakespeare.tar.gz.
This
creates
a
directory
named
shakespeare/
containing
several
files
on
your
local
filesystem.
This
copies
the
local
shakespeare
directory
and
its
contents
into
a
remote,
HDFS
directory
named
/user/training/shakespeare.
$ hadoop fs -ls
You
should
see
the
same
results.
If
you
don’t
pass
a
directory
name
to
the
-ls
command,
it
assumes
you
mean
your
home
directory,
i.e.
/user/training.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use
relative paths in MapReduce programs), they are considered relative to your
home directory.
6.
We
will
also
need
a
sample
web
server
log
file,
which
we
will
put
into
HDFS
for
use
in
future
labs.
This
file
is
currently
compressed
using
GZip.
Rather
than
extract
the
file
to
the
local
disk
and
then
upload
it,
we
will
extract
and
upload
in
one
step.
First,
create
a
directory
in
HDFS
in
which
to
store
it:
7.
Now,
extract
and
upload
the
file
in
one
step.
The
-c
option
to
gunzip
uncompresses
to
standard
output,
and
the
dash
(-)
in
the
hadoop fs -put
command
takes
whatever
is
being
sent
to
its
standard
input
and
places
that
data
in
HDFS.
$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log
8.
Run
the
hadoop fs -ls
command
to
verify
that
the
log
file
is
in
your
HDFS
home
directory.
9.
The
access
log
file
is
quite
large
–
around
500
MB.
Create
a
smaller
version
of
this
file,
consisting
only
of
its
first
5000
lines,
and
store
the
smaller
version
in
HDFS.
You
can
use
the
smaller
version
for
testing
in
subsequent
labs.
1. Enter:
2.
The
glossary
file
included
in
the
compressed
file
you
began
with
is
not
strictly
a
work
of
Shakespeare,
so
let’s
remove
it:
Note
that
you
could
leave
this
file
in
place
if
you
so
wished.
If
you
did,
then
it
would
be
included
in
subsequent
computations
across
the
works
of
Shakespeare,
and
would
skew
your
results
slightly.
As
with
many
real-‐world
big
data
problems,
you
make
trade-‐offs
between
the
labor
to
purify
your
input
data
and
the
precision
of
your
results.
3. Enter:
This
prints
the
last
50
lines
of
Henry
IV,
Part
1
to
your
terminal.
This
command
is
handy
for
viewing
the
output
of
MapReduce
programs.
Very
often,
an
individual
output
file
of
a
MapReduce
program
is
very
large,
making
it
inconvenient
to
view
the
entire
file
in
the
terminal.
For
this
reason,
it’s
often
a
4.
To
download
a
file
to
work
with
on
the
local
filesystem
use
the
fs -get
command.
This
command
takes
two
arguments:
an
HDFS
path
and
a
local
path.
It
copies
the
HDFS
contents
into
the
local
filesystem:
Other Commands
There
are
several
other
operations
available
with
the
hadoop fs
command
to
perform
most
common
filesystem
manipulations:
mv,
cp,
mkdir,
etc.
1. Enter:
$ hadoop fs
This
displays
a
brief
usage
report
of
the
commands
available
within
FsShell.
Try
playing
around
with
a
few
of
these
commands
if
you
like.