0% found this document useful (0 votes)

62 views22 pages

Uob Python Lecture2p

The document discusses data analysis in Python. It covers interacting with external data sources, preparing data through cleaning and transforming, modeling and computation, and presenting results. Specific techniques covered include loading JSON data and counting timezones in a dataset. Pandas is introduced for working with tabular data and visualizing frequency counts of timezones using matplotlib.

Uploaded by

Selenia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views22 pages

Uob Python Lecture2p

Uploaded by

Selenia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

The UOB Python Lectures:

Part 3 - Python for Data

Analysis
Hesham al-Ammal
University of Bahrain

Thursday, April 4, 13
Small Data
BIG Data

Thursday, April 4, 13
Data Scientist’s Tasks
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.

Modeling and computation Transformation

Connecting your data to statistical models, Applying mathematical and statistical
machine learning algorithms, or other operations to groups of data sets to derive
computational tools new data sets. For example, aggregating a
large table by group variables.

Presentation
Creating interactive or static graphical
visualizations or textual summaries

Thursday, April 4, 13
Example 1: .usa.gov data
from bit.ly
JSON: JavaScript Object Notation

Python has many JSON libraries

In [15]: path = 'ch02/
usagov_bitly_data2012-03-16-1331923249.txt'
In [16]: open(path).readline()

We’ll use list comprehension to put the data in

a dictionary
import json
path = 'usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)]
record[0]

Thursday, April 4, 13
In [19]: records[0]['tz'] Unicode strings and notice how
Out[19]: u'America/New_York' dictionaries work

In [20]: print records[0]['tz']

America/New_York

Counting timezones in Python

Let’s start by using Pyhton only and list comprehension
In [6]: time_zones = [rec['tz'] for rec in records]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-db4fbd348da9> in <module>()
----> 1 time_zones = [rec['tz'] for rec in records]

KeyError: 'tz'
You’ll get an error because not all
records have time zones
Thursday, April 4, 13
Counting timezones in Python
Solve the problem of missing tz by using an if
In [26]: time_zones = [rec['tz'] for rec in records if 'tz' in rec]
In [27]: time_zones[:10]
Out[27]:
Define a function to def get_counts(sequence):
[u'America/New_York',
count occurrences in a counts = {}
u'America/Denver',
sequence using a for x in sequence:
u'America/New_York',
if x in counts:
u'America/Sao_Paulo', dictionary
counts[x] += 1
u'America/New_York',
else:
u'America/New_York',
counts[x] = 1
u'Europe/Warsaw',
return counts
u'',
u'',
u''] Then just pass the time In [31]: counts = get_counts(time_zones)
zones list In [32]: counts['America/New_York']
Out[32]: 1251
In [33]: len(time_zones)
Out[33]: 3440

Thursday, April 4, 13
Finding the top 10 timezones
We have to manipulate the dictionary by sorting
def top_counts(count_dict, n=10):
value_key_pairs = [(count, tz) for tz, count in
count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:]

In [35]: top_counts(counts)
Out[35]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
Then we will have (74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]

Thursday, April 4, 13
Let’s do the same thing
in pandas
In [289]: from pandas import DataFrame, Series In [293]: frame['tz'][:10]
In [290]: import pandas as pd Out[293]:
In [291]: frame = DataFrame(records) 0 America/New_York
In [292]: frame 1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
Name: tz

Thursday, April 4, 13
What is pandas?

pandas : Python Data Analysis Library

an open source, BSD-licensed library providing high-
performance, easy-to-use data structures and data analysis
tools for the Python programming language.

Features:
Effcient Dataframes data structure

Tools for data reading, munging, cleaning, etc.

Thursday, April 4, 13
To get the counts

In [294]: tz_counts = frame['tz'].value_counts()

In [295]: tz_counts[:10]
Out[295]:
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382 To clean missing values
America/Denver 191
Europe/London 74 In [296]: clean_tz = frame['tz'].fillna('Missing')
Asia/Tokyo 37 In [297]: clean_tz[clean_tz == ''] = 'Unknown'
Pacific/Honolulu 36 In [298]: tz_counts = clean_tz.value_counts()
Europe/Madrid 35 In [299]: tz_counts[:10]
America/Sao_Paulo 33 Out[299]:
America/New_York 1251
Unknown 521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Missing 120 Remember “data
cleaning”

Thursday, April 4, 13
To plot the results (presentation)

In [301]: tz_counts[:10].plot(kind='barh', rot=0)

import matplotlib.pyplot as plt

plt.show

Thursday, April 4, 13
Example 2: Movie Lens
1M Dataset

GroupLens Research (http://

www.grouplens.org/node/73)

Ratings for movies 1990s+2000s

Three tables: 1 million ratings, 6000 users,

4000 movies

Thursday, April 4, 13
Interacting with the outside Preparation
world Cleaning, munging, combining, normalizing,
Reading and writing with a variety of file reshaping, slicing and dicing, and
formats and databases. transforming data for analysis.

Extract the data from a zip file and load it into

pansdas DataFrames

import pandas as pd

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

users = pd.read_table('ml-1m/users.dat', sep='::', header=None,names=unames)

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None,names=rnames)

mnames = ['movie_id', 'title', 'genres']

movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None,names=mnames)

Thursday, April 4, 13
Verify
Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.

In [334]: users[:5]

In [335]: ratings[:5]

In [336]: movies[:5]

Thursday, April 4, 13
Merge Preparation
Cleaning, munging, combining, normalizing,
reshaping, slicing and dicing, and
transforming data for analysis.

Using pandas’s merge function, we first

merge ratings with users then merging that
result with the movies data. pandas infers
which columns to use as the merge (or join)
keys based on overlapping names

Thursday, April 4, 13
Merge results
In [338]: data = pd.merge(pd.merge(ratings, users), movies)
In [339]: data
Out[339]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns:
user_id 1000209 non-null values
movie_id 1000209 non-null values
rating 1000209 non-null values
timestamp 1000209 non-null values
gender 1000209 non-null values
age 1000209 non-null values
occupation 1000209 non-null values
zip 1000209 non-null values
title 1000209 non-null values
genres 1000209 non-null values
dtypes: int64(6), object(4)

Thursday, April 4, 13
Example 3: US Baby
Names 1880-2010
United States Social Security Administration
(SSA) http://www.ssa.gov/oact/babynames/
limits.html

Visualize the proportion of babies given a

particular name

Determine the most popular names in each

year or the names with largest increases
or decreases

Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Thursday, April 4, 13
Lets do some more
investigations

names[names.name=='Mohammad']

names[names.name=='Fatima']

Thursday, April 4, 13

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Python For Data Science Cheat Sheet 2.0
100% (1)
Python For Data Science Cheat Sheet 2.0
11 pages
Angular Signals - Complete Guide
No ratings yet
Angular Signals - Complete Guide
42 pages
S1 CS - U4 Data Ranges - Frequencies - Shifting
No ratings yet
S1 CS - U4 Data Ranges - Frequencies - Shifting
24 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Data Aggregation
No ratings yet
Data Aggregation
68 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
NOTE #1 Procedural Programming
No ratings yet
NOTE #1 Procedural Programming
10 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
MLStack Cafe 2
No ratings yet
MLStack Cafe 2
11 pages
41b Data Wrangling, Grouping and Aggregation
No ratings yet
41b Data Wrangling, Grouping and Aggregation
31 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Unit Iv
No ratings yet
Unit Iv
63 pages
Python Cheat Sheet For Excel Users
No ratings yet
Python Cheat Sheet For Excel Users
5 pages
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
No ratings yet
Data Cleaning and Exploratory Data Analysis With Pandas On Trending Youtube Video Statistics
5 pages
Python For Exploratory Data Analysis
No ratings yet
Python For Exploratory Data Analysis
12 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Lec 05-DSFa23
No ratings yet
Lec 05-DSFa23
65 pages
Pandas
No ratings yet
Pandas
94 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Lesson - 3 - 1 Data Wrangling
No ratings yet
Lesson - 3 - 1 Data Wrangling
29 pages
Python For Data Science Cheat Sheet 2.0
No ratings yet
Python For Data Science Cheat Sheet 2.0
11 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
Lec 05-DSFa23 Data Science
No ratings yet
Lec 05-DSFa23 Data Science
65 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Python For Data Analysis
67% (3)
Python For Data Analysis
39 pages
Dev Lab Manual Org
No ratings yet
Dev Lab Manual Org
28 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Importing Data Python Cheat Sheet PDF
No ratings yet
Importing Data Python Cheat Sheet PDF
1 page
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
No ratings yet
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
72 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
Pandas
No ratings yet
Pandas
27 pages
Effective Pandas Sampleocr
No ratings yet
Effective Pandas Sampleocr
13 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Lab #2 - Data Analysis With NumPy and Pandas
No ratings yet
Lab #2 - Data Analysis With NumPy and Pandas
7 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
Pandas Course Slides
No ratings yet
Pandas Course Slides
90 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
Comparing Strings: Adel Nehme
No ratings yet
Comparing Strings: Adel Nehme
58 pages
Getting Start With Pandas
No ratings yet
Getting Start With Pandas
11 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Data Analysis Using Python Day - 1 To Day - 4
No ratings yet
Data Analysis Using Python Day - 1 To Day - 4
30 pages
GVPCOEW-Pandas and Numpy For Data Analysis - DONE
No ratings yet
GVPCOEW-Pandas and Numpy For Data Analysis - DONE
110 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
10 Python Built-In Functions That Will Simplify Your Code
No ratings yet
10 Python Built-In Functions That Will Simplify Your Code
8 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Python Cheatsheet.pptx
No ratings yet
Python Cheatsheet.pptx
2 pages
Packet Tracer Activities
No ratings yet
Packet Tracer Activities
8 pages
Cisco Packet Tracer Software
No ratings yet
Cisco Packet Tracer Software
1 page
Windows Admin Center
No ratings yet
Windows Admin Center
380 pages
The Eso-Pic Package
No ratings yet
The Eso-Pic Package
13 pages
How To Tikz?: An Overview
No ratings yet
How To Tikz?: An Overview
160 pages
Karel The Robot Lessonplan Hoc 2018 2
No ratings yet
Karel The Robot Lessonplan Hoc 2018 2
14 pages
Pick A PIC Project
100% (1)
Pick A PIC Project
21 pages
Mechatronics Curriculum
0% (1)
Mechatronics Curriculum
4 pages
Document1 DOWN
No ratings yet
Document1 DOWN
3 pages
Msal
No ratings yet
Msal
364 pages
RAPTOR (Rapid Algorithmic Prototyping Tool For Ordered Reasoning) Is A Free Visual
No ratings yet
RAPTOR (Rapid Algorithmic Prototyping Tool For Ordered Reasoning) Is A Free Visual
3 pages
JDBC Type5 Type4 Driver Comparision
No ratings yet
JDBC Type5 Type4 Driver Comparision
2 pages
Tip Izvještaja: Instalirani Softver Na Računaru DIYOMI1-PC: Strana 1 Od 14
No ratings yet
Tip Izvještaja: Instalirani Softver Na Računaru DIYOMI1-PC: Strana 1 Od 14
14 pages
MERN Stack
No ratings yet
MERN Stack
6 pages
PPS Practical 5
No ratings yet
PPS Practical 5
7 pages
Different Types of ForEachLoop Enumerator
No ratings yet
Different Types of ForEachLoop Enumerator
38 pages
Quarterly TDS Returns
No ratings yet
Quarterly TDS Returns
4 pages
Code Calcul Motifs
No ratings yet
Code Calcul Motifs
3 pages
Binary Locks
No ratings yet
Binary Locks
4 pages
Python For Data Science
No ratings yet
Python For Data Science
13 pages
FSDF
No ratings yet
FSDF
3 pages
Nta Ugc Net KEY TO SUCCESS in Computer Science
No ratings yet
Nta Ugc Net KEY TO SUCCESS in Computer Science
61 pages
10th Computer Ch-2 Exercise Half 2nd
No ratings yet
10th Computer Ch-2 Exercise Half 2nd
6 pages
Report: Weekly Report (Revision) : Compute Definite Integral Using C++ Code 1. Problem
No ratings yet
Report: Weekly Report (Revision) : Compute Definite Integral Using C++ Code 1. Problem
3 pages
Exception Handling
No ratings yet
Exception Handling
3 pages
C167 Programming Tools
No ratings yet
C167 Programming Tools
11 pages
WDD 131 Dynamic Web Fundamentals
No ratings yet
WDD 131 Dynamic Web Fundamentals
308 pages
Mastek Sample Programming Placement Paper Level1
No ratings yet
Mastek Sample Programming Placement Paper Level1
19 pages
Erlang Programming Rules
No ratings yet
Erlang Programming Rules
21 pages
Python Cheat Sheet: Print Print ("Hello World") Input Input ("What's Your Name")
100% (1)
Python Cheat Sheet: Print Print ("Hello World") Input Input ("What's Your Name")
16 pages
476 56518 MVGR Matlab Tutorial
No ratings yet
476 56518 MVGR Matlab Tutorial
59 pages
(TDOhex) Flutter - Reverse - Engineering - Demo
No ratings yet
(TDOhex) Flutter - Reverse - Engineering - Demo
7 pages
U1 - 8051 ALP Instructions
No ratings yet
U1 - 8051 ALP Instructions
70 pages

Uob Python Lecture2p

Uploaded by

Uob Python Lecture2p

Uploaded by

The UOB Python Lectures:

Part 3 - Python for Data

Modeling and computation Transformation

Python has many JSON libraries

We’ll use list comprehension to put the data in

In [20]: print records[0]['tz']

Counting timezones in Python

pandas : Python Data Analysis Library

Tools for data reading, munging, cleaning, etc.

In [294]: tz_counts = frame['tz'].value_counts()

In [301]: tz_counts[:10].plot(kind='barh', rot=0)

import matplotlib.pyplot as plt

GroupLens Research (http://

Ratings for movies 1990s+2000s

Three tables: 1 million ratings, 6000 users,

Extract the data from a zip file and load it into

unames = ['user_id', 'gender', 'age', 'occupation', 'zip']

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']

mnames = ['movie_id', 'title', 'genres']

Using pandas’s merge function, we first

Visualize the proportion of babies given a

Determine the most popular names in each

You might also like