Uob Python Lecture2p
Uob Python Lecture2p
Thursday, April 4, 13
Small Data
BIG Data
Thursday, April 4, 13
Data Scientist’s Tasks
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.
Presentation
Creating interactive or static graphical
visualizations or textual summaries
Thursday, April 4, 13
Example 1: .usa.gov data
from bit.ly
JSON: JavaScript Object Notation
Thursday, April 4, 13
In [19]: records[0]['tz'] Unicode strings and notice how
Out[19]: u'America/New_York' dictionaries work
KeyError: 'tz'
You’ll get an error because not all
records have time zones
Thursday, April 4, 13
Counting timezones in Python
Solve the problem of missing tz by using an if
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
In [27]: time_zones[:10]
Out[27]:
Define a function to def get_counts(sequence):
[u'America/New_York',
count occurrences in a counts = {}
u'America/Denver',
sequence using a for x in sequence:
u'America/New_York',
if x in counts:
u'America/Sao_Paulo', dictionary
counts[x] += 1
u'America/New_York',
else:
u'America/New_York',
counts[x] = 1
u'Europe/Warsaw',
return counts
u'',
u'',
u''] Then just pass the time In [31]: counts = get_counts(time_zones)
zones list In [32]: counts['America/New_York']
Out[32]: 1251
In [33]: len(time_zones)
Out[33]: 3440
Thursday, April 4, 13
Finding the top 10 timezones
We have to manipulate the dictionary by sorting
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in
count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]
In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
Then we will have (74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]
Thursday, April 4, 13
Let’s do the same thing
in pandas
In [289]: from pandas import DataFrame, Series In [293]: frame['tz'][:10]
In [290]: import pandas as pd Out[293]:
In [291]: frame = DataFrame(records) 0 America/New_York
In [292]: frame 1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz
Thursday, April 4, 13
What is pandas?
Features:
Effcient Dataframes data structure
Thursday, April 4, 13
To get the counts
Thursday, April 4, 13
To plot the results (presentation)
Thursday, April 4, 13
Example 2: Movie Lens
1M Dataset
Thursday, April 4, 13
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.
import pandas as pd
Thursday, April 4, 13
Verify
Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.
In [334]: users[:5]
In [335]: ratings[:5]
In [336]: movies[:5]
Thursday, April 4, 13
Merge Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.
Thursday, April 4, 13
Merge results
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)
Thursday, April 4, 13
Example 3: US Baby
Names 1880-2010
United States Social Security Administration
(SSA) http://www.ssa.gov/oact/babynames/
limits.html
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Lets do some more
investigations
names[names.name=='Mohammad']
names[names.name=='Fatima']
Thursday, April 4, 13