12/19/2015
Python
www.Bigdatainpractice.com
Introduction to Python and Python Streaming
Introduction to Python
Python Data Types
Working with Files
Conditions, Loops etc
Data Structures
Python Streaming on Hadoop
www.Bigdatainpractice.com
12/19/2015
Python Features
1. Open Source, All purpose
Programming Language
10. Versions available for
most of the operating
systems.
9. Extensive standard
libraries and
modules (like sci.py
and num.py)
2. Developed by Guido
Van Rossum (first
released in 1991)
3. Guido wanted to bridge
the gap between C and
Shell
4. Rapid development
(Python is interpreted
Language therefore no
compilation)
8. Very high level dynamic
data types like lists and
dictionaries
7. Supports object oriented as well
as procedural code.
6. A lot of heavy lifting (e.g. working with
hadoop) can be done very easily using
Python
5. Very handy for professionals who dont
have java or c++ or c skills
www.Bigdatainpractice.com
Comparison of Programming languages (popularity)
www.Bigdatainpractice.com
12/19/2015
Working with Python
Working with Python
1. Working with Python Shell:
2. Env Setup add path
3. Python Scripting:
For running python script python hello.py
Blocks.. Blocks blocks no braces just indentation
4. Python Variables:
Created when assigned to
Can hold any type data
Variable name can be of any length and is case-sensitive
Type is given based on assignment (int, float, str)
5. Working with Files
With open
www.Bigdatainpractice.com
Working with Python
Working with Python
6. Decision controls and Loops
For loop (when no of iterations are known)
While loop (loop based on condition)
7. Data Structures: Lists, tuples
Lists are used to store multiple values (of similar type)
Tuples are used to store multiple values (of different type)
Indexing and Slicing is similar for Lists, Tuples and Strings
8. Data Structues: Dictionaries, Sets
9. Working with File System
10. Functions
www.Bigdatainpractice.com
12/19/2015
Python Streaming on Hadoop - mapReduce
******************mapper.py******************
import sys
for line in sys.stdin:
line = line.strip()
memtype, cat, year, month, day, qty, sales = line.split(",")
print '%s\t%s' % (cat, sales)
*********************************************
******************reducer.py******************
Look at script
*********************************************
hadoop jar <streaming.jar> -file /user/cloudera/mapper.py -file
/user/cloudera/reducer.py -mapper /user/cloudera/mapper.py -reducer
/user/cloudera/reducer.py -input /user/cloudera/INPUT1/SalesData.csv output /user/cloudera/OUT_PY
www.Bigdatainpractice.com
Python Streaming on Hadoop - PIG
******************PIG SCRIPT******************
define streampy `/usr/bin/python pigPython.py`
input (stdin using PigStreaming(','))
output (stdout using PigStreaming(','))
ship ('pigPython.py');
salesData = LOAD '/user/cloudera/INPUT1/SalesData.csv' USING
PigStorage(',') AS (member_type:chararray, cat:chararray, year:int,
month:int, day:chararray, quantity:int, sales:float);
salesData2 = FILTER salesData BY cat == 'C1';
salesData3 = STREAM salesData2 THROUGH streampy as
(member_type:chararray, cat:chararray, year:int, month:int, day:chararray,
qty:int, sales:float);
DUMP salesData3;
www.Bigdatainpractice.com
12/19/2015
Python Streaming on Hadoop - PIG
******************pigPython.py******************
#!/usr/bin/python
import sys
name = 'NONE'
salary = 0.0
for line in sys.stdin:
line = line.strip()
member_type, cat, year, month, day, qty, sales = line.split(",")
member_type = 'MEMBER_' + member_type
cat = 'CAT_' + cat
print member_type + "," + cat + "," + year + "," + month + "," + day + "," +
qty + "," + sales
www.Bigdatainpractice.com
Python Streaming on Hadoop - HIVE
******************HIVE SCRIPT*****************
CREATE TABLE employee_python (
empid string,
name string,
assist string,
salary float,
country string,
state string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t
add FILE hiveTran.py;
INSERT OVERWRITE TABLE employee_python
SELECT
TRANSFORM (empid,name,assist,salary,country,state)
USING 'python hiveTran.py'
AS (empid, name, assist, salary, country, state)
FROM employee;
www.Bigdatainpractice.com
12/19/2015
Python Streaming on Hadoop - HIVE
******************hiveTran.py******************
#!/usr/bin/env python
import sys
name = 'NONE'
salary = 0.0
for line in sys.stdin:
line = line.strip()
empid, name, assist, salary, country, state = line.split("\t")
name = str(name)
name = name.upper()
salary = float(salary)
salary = salary/1000
country = str(country)
country = country.upper()
state = str(state)
state = state.upper()
print '\t'.join([empid, name, assist, str(salary),country, state])
www.Bigdatainpractice.com
Thank You
www.Bigdatainpractice.com