Data Analysis with Pandas
Data Analysis with Pandas
Contents
• Intro to Pandas • Selections
• Subsetting rows, Subsetting columns
• A first taste of Pandas • Reshaping Data
• Dropping values, Merging, pivot, unpivot,
• Pandas Data Structures sorting, reindexing
• Series, DataFrames • Handling Missing Data
• Dropping Missing Data, Filling Missing
• Loading and Saving Data Data
• Read/Write CSV, Write Excel, SQL • Group Data
• Retrieving Information • Groupby
2
Intro to Pandas
3
Intro to Pandas Official (Open)
What is Pandas?
• Pandas is a high-performance open source library for data analysis in Python
• De-facto standard library for data analysis using Python
4
Intro to Pandas Official (Open)
7
A first taste of Pandas
7
A first taste of Pandas Official (Open)
Objectives
This section will show you a simple Pandas program that you can learn
techniques to:
1. Load a simple CSV file
2. Show a preview of the first n rows and last n rows of the loaded dataset
3. Display information about the loaded dataset such as:
1. The count of how many rows and columns were loaded
2. The column names of each row and their datatypes
9
A first taste of Pandas Official (Open)
import pandas as pd
df = pd.read_csv('data/euro_winners.csv', sep=‘,’)
10
A first taste of Pandas Official (Open)
df = pd.read_csv('data/euro_winners.csv', sep=‘,')
11
A first taste of Pandas Official (Open)
import pandas as pd
df = pd.read_csv('data/euro_winners.csv', sep=‘,')
print(df.shape)
12
A first taste of Pandas Official (Open)
import pandas as pd
df = pd.read_csv('data/euro_winners.csv', sep=‘,')
print(df.dtypes)
13
A first taste of Pandas Official (Open)
df = pd.read_csv('data/euro_winners.csv', sep=‘,')
14
A first taste of Pandas Official (Open)
Df = pd.read_csv('data/euro_winners.csv', sep=‘,')
df2.to_csv(‘goodattendance.csv')
15
Loading / Saving Data
15
Loading and Saving Data Official (Open)
Loading Data
• Pandas provides methods to load data from a variety of sources
17
Loading and Saving Data Official (Open)
Saving Data
• Similarly, pandas provides methods to save data to a variety of sources
18
Pandas Data Structures
18
Pandas Data Structures Official (Open)
Overview
• There are three main data structures in pandas
• In this module, we will cover Series and DataFrame only
20
Intro to Pandas Official (Open)
Series
• Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floating point numbers, Python objects etc)
• The axis labels are collectively referred to as the index
• The basic method to create a Series is as follows:-
import pandas as pd
ser = pd.Series(data, index=idx)
• data can be many different things: a Python dict, an ndarray, a scalar value (like 5)
• index is a list of axis labels and are initialized differently depending on what the nature of
data is (see next slides for examples)
21
Pandas Data Structures Official (Open)
import pandas as pd
currencyDict ={
'US':'dollar', 'UK':'pound',
'Mexico':'peso', 'China':'yuan'}
currencySeries = pd.Series(currencyDict);
index
22
Pandas Data Structures Official (Open)
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(5),
index=['a', 'b', 'c', 'd', 'e'])
23
Pandas Data Structures Official (Open)
import pandas as pd
pd.Series(5,
index=['a', 'b', 'c', 'd', 'e'])
24
Pandas Data Structures Official (Open)
DataFrame
• DataFrame is the most commonly used data structure in pandas
• DataFrame is a 2-dimensional labeled data structure
• You can think of it like a spreadsheetor SQL table, or a dict of Series object
• The constructor accepts many different types of arguments
• 2D NumPy array
• Dictionary of 1D NumPy array or lists, dictionaries, or Series structures
• Structured or record ndarray
• Series structures
• Another DataFrame structure
25
Intro to Pandas Official (Open)
DataFrame
• The basic method to create a DataFrame is as follows:-
import pandas as pd
df = pd.DataFrame(data,index,columns)
26
Pandas Data Structures Official (Open)
print(type(data))
print(type(df))
print(df)
27
Pandas Data Structures Official (Open)
np1 = np.array([1,2,3])
np2 = np.array([4,5,6])
np3 = np.array([7,8,9])
d = {'one' : np1,
'two' : np2,
'three': np3}
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d,
index=['a', 'b', 'c'])
data = np.random.randn(5, 3)
df = pd.DataFrame(data,
columns=['i','ii','iii'],
index=['a','b','c','d','e']
)
29
Pandas Data Structures Official (Open)
d = {'weights' :[50,70.5,85.3,43.1],
'heights' : [1.54,1.73,1.82,1.6]}
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d,
index=['Mary', 'John',
'Robert', 'Christine'])
30
Pandas Data Structures Official (Open)
d = {
'one' : pd.Series([1, 2, 3],
index=['a', 'b', 'c']),
df = pd.DataFrame(d)
In this case, the column names are derived automatically from the dict keys
31
Pandas Data Structures Official (Open)
stockSummaries={
'AMZN':pd.Series([346.15, 589.8,158.88],
index=['Closing price','P/E','Market Cap(B)']),
'GOOG':pd.Series([1133.43, 380.64],
index=['Closing price','Market Cap(B)']),
'FB':pd.Series([61.48, 150.92],
index=['Closing price','Market Cap(B)'])}
stockDF = pd.DataFrame(stockSummaries)
df = pd.DataFrame(data)
df2 = pd.DataFrame(data,
index=['first', 'second'],
columns=['a','b','c'])
df2
33
Pandas Data Structures
df = pd.DataFrame(data)
In this case, the column names are derived automatically from the dict keys
34
Retrieving Information
35
Retrieving Information Official (Open)
Basic Information
Method Description
df.shape Returns (rows, columns)
df.index Describe index
df.columns Describe DataFrame column s
df.count() Number of non-NA values
df.info() Info on DataFrame
df.dtypes Data types
36
Retrieving Information Official (Open)
Basic Information
df.shape
df.index
37
Retrieving Information Official (Open)
Basic Information
df.columns
df.count
38
Retrieving Information Official (Open)
Basic Information
df.info
39
Retrieving Information Official (Open)
Basic Information
df.dtypes
40
Retrieving Series and DataFrame Information
Official (Open)
Summary
Method Description
df.sum() Sum of values
df.cumsum() Cumulative sum of values
41
Retrieving Information
Summary
Original dataset - df
42
Retrieving Information Official (Open)
Summary
df.sum
df.cumsum
43
Retrieving Information Official (Open)
Summary
df.min()
df.max()
44
Retrieving Information Official (Open)
Summary
df.idxmin()
df.idxmax()
45
Retrieving Information Official (Open)
Summary
df.describe()
46
Retrieving Information Official (Open)
Summary
df.mean()
df.median()
47
A first taste of Pandas Official (Open)
Subsetting columns
Method Description
df['width'] or df.width Subset a single column by column name
df[['width','length','species']] Subset multiple columns by column names
df.loc[:, 'A':'C'] Subset a range of columns by column names using loc
df.iloc[:,2] Subset a single column by its index using iloc
df.iloc[:, [0, -1]] Subset multiple columns by their indices using iloc
df.iloc[:, 0:2] Subset a range of columns by index using iloc
df.loc[:,'pop'] > 100000 Create derived columns by using Boolean logic
re = '^customer' Subset columns whose names match a regular
df.filter(regex=re) expression, e.g. where data contains 'customer'
48
A first taste of Pandas Official (Open)
df = pd.read_csv('data/euro_winners.csv', df3
sep=‘,')
print(df.columns)
49
A first taste of Pandas Official (Open)
df = pd.read_csv('gapminder.tsv', sep='\t')
df = pd.read_csv('gapminder.tsv', sep='\t')
df = pd.read_csv('classdata/euro_winners.csv', sep=',')
df2 = df.filter(regex=re1)
Subsetting rows
Method Description
df.loc['a':'c'] Select rows by label
df.iloc[10:20,:] Select rows by index
df[df.Length > 7] Select rows by Boolean logic
df.head(n) Select first n rows
df.tail(n) Select last n rows
df.sample(frac=0.5) Randomly select fraction of rows.
df.sample(n=10) Randomly select n rows
df.nlargest(n, 'value') Select and order top n entries
df.nsmallest(n, 'value') Select and order bottom n entries
df.drop_duplicates() Select unique rows only (duplicates removed)
53
Subsetting Rows Official (Open)
df1 = df.loc[1516045]
df2 = df.loc[[1516045,1532537]]
df2
df
5454
Subsetting Rows Official (Open)
df = pd.DataFrame(np.random.randn(6, 4),
index=list('abcdef'),
columns=list('ABCD')) df
df1 = df.loc['c':'f']
df1
df1 = df.iloc[2]
df2 = df.iloc[[2,4,6]]
df2
df
5656
A first taste of Pandas Official (Open)
df = pd.read_csv('studentsdataset.csv')
df
57
A first taste of Pandas Official (Open)
df = pd.read_csv('rainfall.csv')
58
Official (Open)
• ST1510 PDAS
Topic 3 part 2
Data Analysis with Pandas
59
Reshaping Data
Change the layout of a data set
60
Reshaping Data
df
import pandas as pd
df =
pd.read_csv("data/euro_winners.csv"
)
df2= df.drop(['Runners-up',
'Runner-UpNation',
'Venue', 'Attendance'], axis=1)
df2
62
Reshaping Data
result = pd.concat([df1,df2])
df1 df2
63
Reshaping Data
df1 df2
64
Official (Open)
result = pd.concat([df1,df3])
65
Official (Open)
66
Reshaping Data Official (Open)
After pivot
67
Reshaping Data
After melt
68
Reshaping Data Official (Open)
import pandas as pd
df = pd.read_csv("data/Weights.csv")
df2 = df.sort_values(by="weight")
Before sort
After sort
69
Reshaping Data
df df2
70
Reshaping Data
df2
df df3
71
Reshaping Data Official (Open)
Reindex a DataFrame
import pandas as pd
index = ['Firefox', 'Chrome', 'Safari', 'IE10','Konqueror']
df = pd.DataFrame({
'http_status': [200,200,404,404,301],
'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, index=index)
df= df.reset_index()
df=df.rename(columns={"observation":"reading"})
Before rename
After rename
74
Handling Missing Data
75
Handling Missing Data Official (Open)
76
Handling Missing Data Official (Open)
77
Handling Missinng Data Official (Open)
78
Official (Open)
import pandas as pd
df = pd.read_csv('data/kidney_disease.csv')
print(df.loc[0:4,'id':‘pcc']) # print first few columns and rows
print(pd.isnull(df['rbc']).head()) # check if ‘rbc’ has null values
df["rbc"].isnull().sum() # counts number of null values in ‘rbc’
7979
Handling Missing Data Official (Open)
80
Official (Open)
Univariate Imputation
• Fill with fixed value or derived from ‘neighbouring’ records
• Mean imputation/Mode for categorical
• Interpolation
Multivariate Imputation
• Regression
• KNN etc
• Do note that imputation introduces errors! You are using values to estimate the
missing value, then using that value again for the model. Limit to 5%.
Handling Missing Data Official (Open)
dropna
import pandas as pd
df = pd.read_csv('data/kidney_disease.csv')
df2 = df.dropna()
Original dataset with NaN values. See the next slide for comparison after using dropna
8282
Handling Missing Data Official (Open)
dropna
8383
Handling Missing Data Official (Open)
• The fillna function can “fill in” NA values with non-null data in a few ways:
8484
Handling Missing Data Official (Open)
df = pd.read_csv('data/kidney_disease.csv')
Original dataset with NaN values. See the next slide for comparison after using fillna
8585
Handling Missing Data Official (Open)
8686
Handling Missing Data Official (Open)
df = pd.read_csv('data/kidney_disease.csv')
Original dataset with NaN values. See the next slide for comparison after using fillna
8787
Handling Missing Data Official (Open)
8888
Handling Missing Data Official (Open)
df = pd.read_csv('data/kidney_disease.csv')
Original dataset with NaN values. See the next slide for comparison after using fillna
8989
Handling Missing Data Official (Open)
9090
Handling Missing Data
fillna (mean)
import pandas as pd
df = pd.read_csv('data/kidney_disease.csv')
Why is this step
necessary?
df.rc = pd.to_numeric(df.rc, errors='coerce')
# Replace NA with the mean of the values of the column
df.rc =df['rc'].fillna(df['rc'].mean())
Original dataset with NaN values. See the next slide for comparison after using fillna
9191
Handling Missing Data Official (Open)
fillna (mean)
9292
Official (Open)
df2.head(10)
9393
Handling Missing Data – fillna (mode)
• Mean cannot be computed for categorical variables. You can use mode.
94
Official (Open)
Interpolate
• Interpolation is a mathematical method that adjusts a function to your data and
uses this function to extrapolate the missing data.
• The simplest type of interpolation is the linear interpolation, that makes a mean
between the values before the missing data and the value after.
Interpolate()
9595
Combine Data Sets
96
Combine Data Sets
97
Combine Data Sets Official (Open)
merge (how='left')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='left', on='customerid')
df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
9898
Combine Datasets Official (Open)
merge (how='left')
9999
Combine Data Sets Official (Open)
merge (how='right')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='right', on='customerid')
df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101000
Combine Datasets Official (Open)
merge (how='right')
101101
Combine Data Sets Official (Open)
merge (how='inner')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='inner', on='customerid')
df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101202
Combine Datasets Official (Open)
merge (how='inner')
101303
Combine Data Sets Official (Open)
isin
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = df1[df1.customerid.isin(df2.customerid)]
df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101404
Combine Datasets Official (Open)
isin
101505
Combine Data Sets Official (Open)
~isin
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = df1[~df1.customerid.isin(df2.customerid)]
df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101606
Combine Datasets Official (Open)
~isin
101707
Group Data
108
Group Data Official (Open)
Group data
109
Study the partial dataset below which stores the results of the European club soccer championship since
1955
110
Official (Open)
Grouping Data
import pandas as pd
df = pd.read_csv("data/euro_winners.csv")
111212
Official (Open)
113
Group Data Official (Open)
Grouping Data
import pandas as pd
df = pd.read_csv("data/euro_winners.csv")
111414
Official (Open)
111616
Group Data Official (Open)
111717
Official (Open)
118
Group Data Official (Open)
winnersGrp = df.groupby(['Nation','Winners'])
clubWins=winnersGrp.size()
att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()
111919
Applying functions to Pandas Series and DataFrames
Official (Open)
def toFahrenheit(x):
return x*32
series = series.apply(toFahrenheit)
121020
Applying functions to Pandas Series and DataFrames
Official (Open)
import pandas as pd
import numpy as np
data = np.random.randint(1,10,(3,2))
Dataset before applying the function
df = pd.DataFrame(data,
index=['Student 1', 'Student 2',
'Student 3'],
columns=['Reward 1', 'Reward 2'])
def multiply(x):
return x*2
Dataset after applying the function
df = df.apply(multiply)
121121
Applying functions to Pandas Series and DataFrames
Official (Open)
121222
The End