0% found this document useful (0 votes)
51 views

Uob Python Lecture2p

The document discusses data analysis in Python. It covers interacting with external data sources, preparing data through cleaning and transforming, modeling and computation, and presenting results. Specific techniques covered include loading JSON data and counting timezones in a dataset. Pandas is introduced for working with tabular data and visualizing frequency counts of timezones using matplotlib.

Uploaded by

Selenia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Uob Python Lecture2p

The document discusses data analysis in Python. It covers interacting with external data sources, preparing data through cleaning and transforming, modeling and computation, and presenting results. Specific techniques covered include loading JSON data and counting timezones in a dataset. Pandas is introduced for working with tabular data and visualizing frequency counts of timezones using matplotlib.

Uploaded by

Selenia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

The UOB Python Lectures:

Part 3 - Python for Data


Analysis
Hesham al-Ammal
University of Bahrain

Thursday, April 4, 13
Small Data
BIG Data

Thursday, April 4, 13
Data Scientist’s Tasks
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.

Modeling and computation Transformation


Connecting your data to statistical models, Applying mathematical and statistical
machine learning algorithms, or other operations to groups of data sets to derive
computational tools new data sets. For example, aggregating a
large table by group variables.

Presentation
Creating interactive or static graphical
visualizations or textual summaries

Thursday, April 4, 13
Example 1: .usa.gov data
from bit.ly
JSON: JavaScript Object Notation

Python has many JSON libraries


In [15]: path = 'ch02/
usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()

We’ll use list comprehension to put the data in


a dictionary
import json
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
record[0]

Thursday, April 4, 13
In [19]: records[0]['tz'] Unicode strings and notice how
Out[19]: u'America/New_York' dictionaries work

In [20]: print records[0]['tz']


America/New_York

Counting timezones in Python


Let’s start by using Pyhton only and list comprehension
In [6]: time_zones = [rec['tz'] for rec in records]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-db4fbd348da9> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]

KeyError: 'tz'
You’ll get an error because not all
records have time zones
Thursday, April 4, 13
Counting timezones in Python
Solve the problem of missing tz by using an if
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
In [27]: time_zones[:10]
Out[27]:
Define a function to def get_counts(sequence):
[u'America/New_York',
count occurrences in a counts = {}
u'America/Denver',
sequence using a for x in sequence:
u'America/New_York',
if x in counts:
u'America/Sao_Paulo', dictionary
counts[x] += 1
u'America/New_York',
else:
u'America/New_York',
counts[x] = 1
u'Europe/Warsaw',
return counts
u'',
u'',
u''] Then just pass the time In [31]: counts = get_counts(time_zones)
zones list In [32]: counts['America/New_York']
Out[32]: 1251
In [33]: len(time_zones)
Out[33]: 3440

Thursday, April 4, 13
Finding the top 10 timezones
We have to manipulate the dictionary by sorting
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in
count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]

In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
Then we will have (74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]

Thursday, April 4, 13
Let’s do the same thing
in pandas
In [289]: from pandas import DataFrame, Series In [293]: frame['tz'][:10]
In [290]: import pandas as pd Out[293]:
In [291]: frame = DataFrame(records) 0 America/New_York
In [292]: frame 1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz

Thursday, April 4, 13
What is pandas?

pandas : Python Data Analysis Library


an open source, BSD-licensed library providing high-
performance, easy-to-use data structures and data analysis
tools for the Python programming language.

Features:
Effcient Dataframes data structure

Tools for data reading, munging, cleaning, etc.

Thursday, April 4, 13
To get the counts

In [294]: tz_counts = frame['tz'].value_counts()


In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382 To clean missing values
America/Denver 191
Europe/London 74 In [296]: clean_tz = frame['tz'].fillna('Missing')
Asia/Tokyo 37 In [297]: clean_tz[clean_tz == ''] = 'Unknown'
Pacific/Honolulu 36 In [298]: tz_counts = clean_tz.value_counts()
Europe/Madrid 35 In [299]: tz_counts[:10]
America/Sao_Paulo 33 Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120 Remember “data
cleaning”

Thursday, April 4, 13
To plot the results (presentation)

In [301]: tz_counts[:10].plot(kind='barh', rot=0)

import matplotlib.pyplot as plt


plt.show

Thursday, April 4, 13
Example 2: Movie Lens
1M Dataset

GroupLens Research (http://


www.grouplens.org/node/73)

Ratings for movies 1990s+2000s

Three tables: 1 million ratings, 6000 users,


4000 movies

Thursday, April 4, 13
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.

Extract the data from a zip file and load it into


pansdas DataFrames

import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']


users = pd.read_table('ml-1m/users.dat', sep='::', header=None,names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']


ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None,names=rnames)

mnames = ['movie_id', 'title', 'genres']


movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None,names=mnames)

Thursday, April 4, 13
Verify
Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.

In [334]: users[:5]

In [335]: ratings[:5]

In [336]: movies[:5]

Thursday, April 4, 13
Merge Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.

Using pandas’s merge function, we first


merge ratings with users then merging that
result with the movies data. pandas infers
which columns to use as the merge (or join)
keys based on overlapping names

Thursday, April 4, 13
Merge results
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)

Thursday, April 4, 13
Example 3: US Baby
Names 1880-2010
United States Social Security Administration
(SSA) http://www.ssa.gov/oact/babynames/
limits.html

Visualize the proportion of babies given a


particular name

Determine the most popular names in each


year or the names with largest increases
or decreases

Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Lets do some more
investigations

names[names.name=='Mohammad']

names[names.name=='Fatima']

Thursday, April 4, 13

You might also like