The UOB Python Lectures:
Part 3 - Python for Data
Analysis
Hesham al-Ammal
University of Bahrain
Thursday, April 4, 13
Small Data
BIG Data
Thursday, April 4, 13
Data Scientist’s Tasks
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.
Modeling and computation Transformation
Connecting your data to statistical models, Applying mathematical and statistical
machine learning algorithms, or other operations to groups of data sets to derive
computational tools new data sets. For example, aggregating a
large table by group variables.
Presentation
Creating interactive or static graphical
visualizations or textual summaries
Thursday, April 4, 13
Example 1: .usa.gov data
from bit.ly
JSON: JavaScript Object Notation
Python has many JSON libraries
In [15]: path = 'ch02/
usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()
We’ll use list comprehension to put the data in
a dictionary
import json
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
record[0]
Thursday, April 4, 13
In [19]: records[0]['tz'] Unicode strings and notice how
Out[19]: u'America/New_York' dictionaries work
In [20]: print records[0]['tz']
America/New_York
Counting timezones in Python
Let’s start by using Pyhton only and list comprehension
In [6]: time_zones = [rec['tz'] for rec in records]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-db4fbd348da9> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]
KeyError: 'tz'
You’ll get an error because not all
records have time zones
Thursday, April 4, 13
Counting timezones in Python
Solve the problem of missing tz by using an if
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
In [27]: time_zones[:10]
Out[27]:
Define a function to def get_counts(sequence):
[u'America/New_York',
count occurrences in a counts = {}
u'America/Denver',
sequence using a for x in sequence:
u'America/New_York',
if x in counts:
u'America/Sao_Paulo', dictionary
counts[x] += 1
u'America/New_York',
else:
u'America/New_York',
counts[x] = 1
u'Europe/Warsaw',
return counts
u'',
u'',
u''] Then just pass the time In [31]: counts = get_counts(time_zones)
zones list In [32]: counts['America/New_York']
Out[32]: 1251
In [33]: len(time_zones)
Out[33]: 3440
Thursday, April 4, 13
Finding the top 10 timezones
We have to manipulate the dictionary by sorting
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in
count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
Then we will have (74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]
Thursday, April 4, 13
Let’s do the same thing
in pandas
In [289]: from pandas import DataFrame, Series In [293]: frame['tz'][:10]
In [290]: import pandas as pd Out[293]:
In [291]: frame = DataFrame(records) 0 America/New_York
In [292]: frame 1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz
Thursday, April 4, 13
What is pandas?
pandas : Python Data Analysis Library
an open source, BSD-licensed library providing high-
performance, easy-to-use data structures and data analysis
tools for the Python programming language.
Features:
Effcient Dataframes data structure
Tools for data reading, munging, cleaning, etc.
Thursday, April 4, 13
To get the counts
In [294]: tz_counts = frame['tz'].value_counts()
In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382 To clean missing values
America/Denver 191
Europe/London 74 In [296]: clean_tz = frame['tz'].fillna('Missing')
Asia/Tokyo 37 In [297]: clean_tz[clean_tz == ''] = 'Unknown'
Pacific/Honolulu 36 In [298]: tz_counts = clean_tz.value_counts()
Europe/Madrid 35 In [299]: tz_counts[:10]
America/Sao_Paulo 33 Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120 Remember “data
cleaning”
Thursday, April 4, 13
To plot the results (presentation)
In [301]: tz_counts[:10].plot(kind='barh', rot=0)
import matplotlib.pyplot as plt
plt.show
Thursday, April 4, 13
Example 2: Movie Lens
1M Dataset
GroupLens Research (http://
www.grouplens.org/node/73)
Ratings for movies 1990s+2000s
Three tables: 1 million ratings, 6000 users,
4000 movies
Thursday, April 4, 13
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.
Extract the data from a zip file and load it into
pansdas DataFrames
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None,names=unames)
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None,names=rnames)
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None,names=mnames)
Thursday, April 4, 13
Verify
Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.
In [334]: users[:5]
In [335]: ratings[:5]
In [336]: movies[:5]
Thursday, April 4, 13
Merge Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.
Using pandas’s merge function, we first
merge ratings with users then merging that
result with the movies data. pandas infers
which columns to use as the merge (or join)
keys based on overlapping names
Thursday, April 4, 13
Merge results
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)
Thursday, April 4, 13
Example 3: US Baby
Names 1880-2010
United States Social Security Administration
(SSA) http://www.ssa.gov/oact/babynames/
limits.html
Visualize the proportion of babies given a
particular name
Determine the most popular names in each
year or the names with largest increases
or decreases
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Lets do some more
investigations
names[names.name=='Mohammad']
names[names.name=='Fatima']
Thursday, April 4, 13