0% found this document useful (0 votes)
24 views

Data Analysis with Pandas

The document provides a comprehensive guide to using the Pandas library for data analysis in Python, covering its key features, data structures (Series and DataFrame), and methods for loading, saving, and manipulating data. It includes practical examples of how to perform operations such as subsetting, filtering, and handling missing data. Additionally, it outlines various functions for retrieving information and summarizing datasets.

Uploaded by

Andrew Pang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Data Analysis with Pandas

The document provides a comprehensive guide to using the Pandas library for data analysis in Python, covering its key features, data structures (Series and DataFrame), and methods for loading, saving, and manipulating data. It includes practical examples of how to perform operations such as subsetting, filtering, and handling missing data. Additionally, it outlines various functions for retrieving information and summarizing datasets.

Uploaded by

Andrew Pang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 122

Official (Open)

Topic 3 – Data Analysis with Pandas


https://pandas.pydata.org/docs/user_guide
Contents Official (Open)

Contents
• Intro to Pandas • Selections
• Subsetting rows, Subsetting columns
• A first taste of Pandas • Reshaping Data
• Dropping values, Merging, pivot, unpivot,
• Pandas Data Structures sorting, reindexing
• Series, DataFrames • Handling Missing Data
• Dropping Missing Data, Filling Missing
• Loading and Saving Data Data
• Read/Write CSV, Write Excel, SQL • Group Data
• Retrieving Information • Groupby

• Basic Information, Summary • Applying Functions


• apply, lambda

2
Intro to Pandas

3
Intro to Pandas Official (Open)

What is Pandas?
• Pandas is a high-performance open source library for data analysis in Python
• De-facto standard library for data analysis using Python

• Widespread adoption of the tool


with a large community behind it
(817 contributors + 15,330
commits as in Jul 2017)
• Rapid iteration, features, and
enhancements continuously made
• https://github.com/pandas-dev/pandas

4
Intro to Pandas Official (Open)

Key Features of Pandas


• Can process a variety of data sets in different formats: tabular heterogeneous, time series,
and matrix data.
• Can load /import data from varied sources such as CSV, Excel and DB/SQL.
• Allows manipulation of datasets such as subsetting, slicing, filtering, merging, groupBy, re-
ordering, and re-shaping.
• Allows handling of missing data according to rules defined by the user/developer: ignore,
convert to 0, and so on.
• It can be used for parsing and munging (conversion) of data as well as modeling and statistical
analysis.
• It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.
• It delivers fast performance and can be speeded up even more by making use of Cython (C
extensions to Python).
6
Intro to Pandas Official (Open)

What problem does pandas solve?


• Python has long been great for data munging and preparation, but less so for data
analysis and modelling
• pandas helps fill this gap, enabling you to carry out your entire data analysis
workflow in Python without having to switch to a more domain specific language
like R
• pandas does not implement significant modeling functionality outside of linear
and panel regression; for this, look to statsmodels and scikit-learn.

7
A first taste of Pandas

7
A first taste of Pandas Official (Open)

Objectives
This section will show you a simple Pandas program that you can learn
techniques to:
1. Load a simple CSV file
2. Show a preview of the first n rows and last n rows of the loaded dataset
3. Display information about the loaded dataset such as:
1. The count of how many rows and columns were loaded
2. The column names of each row and their datatypes

4. Extract subsets of the dataset


5. Save a subset of data

9
A first taste of Pandas Official (Open)

Load a simple delimited file

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,’)

10
A first taste of Pandas Official (Open)

Show first/last n rows of dataset


import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

# Get the 5 rows of the dataset


df.head(5)

# Get the last 3 rows of the dataset


df.tail(3)

11
A first taste of Pandas Official (Open)

Count how many cols and rows loaded

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

print(df.shape)

12
A first taste of Pandas Official (Open)

Print out the columns and their datatypes

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

print(df.dtypes)

13
A first taste of Pandas Official (Open)

Extract subsets of the dataset


import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

# Extract only the three columns


df[['Season', 'Nation', 'Winners']]

# Extract only rows with attendance more than 100,000


df2 = df[df['Attendance'] > 100000]
df2

14
A first taste of Pandas Official (Open)

Save a subset of the data


import pandas as pd

Df = pd.read_csv('data/euro_winners.csv', sep=‘,')

# Extract only rows with population more than 100 thousands


df2 = df[df['Attendance'] > 100000]

df2.to_csv(‘goodattendance.csv')

15
Loading / Saving Data

15
Loading and Saving Data Official (Open)

Loading Data
• Pandas provides methods to load data from a variety of sources

Loading data from csv, Excel, HTML, json, SQL


csv_dataframe = pd.read_csv('my_dataset.csv', sep=',')
xls_dataframe = pd.read_excel('my_dataset.xlsx', 'Sheet1', na_values=['NA', '?'])
table_dataframe= pd.read_html('http://page.com/with/table.html')[0]
json_dataframe = pd.read_json('my_dataset.json', orient='columns')
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
sql_dataframe = pd.read_sql_table('my_table', engine, columns=['ColA', 'ColB'])
sql_dataframe = pd.read_sql("SELECT * FROM my_table;", engine)

17
Loading and Saving Data Official (Open)

Saving Data
• Similarly, pandas provides methods to save data to a variety of sources

Saving data to csv, Excel, HTML, json, SQL


my_dataframe.to_csv('dataset.csv')
my_dataframe.to_excel('dataset.xlsx')
html_code = my_dataframe.to_html()
my_dataframe.to_json('dataset.json')
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
sql_dataframe.to_sql('my_table', engine)

18
Pandas Data Structures

18
Pandas Data Structures Official (Open)

Overview
• There are three main data structures in pandas
• In this module, we will cover Series and DataFrame only

Series DataFrame Panel

20
Intro to Pandas Official (Open)

Series
• Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floating point numbers, Python objects etc)
• The axis labels are collectively referred to as the index
• The basic method to create a Series is as follows:-

import pandas as pd
ser = pd.Series(data, index=idx)

• data can be many different things: a Python dict, an ndarray, a scalar value (like 5)
• index is a list of axis labels and are initialized differently depending on what the nature of
data is (see next slides for examples)
21
Pandas Data Structures Official (Open)

Creating Series (using Python dictionary)


• If data is a dict, if index is passed the values in data corresponding to the labels in the
index will be pulled out
• Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

import pandas as pd

currencyDict ={
'US':'dollar', 'UK':'pound',
'Mexico':'peso', 'China':'yuan'}

currencySeries = pd.Series(currencyDict);

index
22
Pandas Data Structures Official (Open)

Creating Series (Using numpy.ndarray)


• If data is an ndarray, index must be the same length as data
• If no index is passed, one will be created having values [0, ..., len(data) - 1].

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5),
index=['a', 'b', 'c', 'd', 'e'])

23
Pandas Data Structures Official (Open)

Creating Series (using scalar values)


• If data is a scalar value, an index must be provided
• The value will be repeated to match the length of index

import pandas as pd

pd.Series(5,
index=['a', 'b', 'c', 'd', 'e'])

24
Pandas Data Structures Official (Open)

DataFrame
• DataFrame is the most commonly used data structure in pandas
• DataFrame is a 2-dimensional labeled data structure
• You can think of it like a spreadsheetor SQL table, or a dict of Series object
• The constructor accepts many different types of arguments
• 2D NumPy array
• Dictionary of 1D NumPy array or lists, dictionaries, or Series structures
• Structured or record ndarray
• Series structures
• Another DataFrame structure

25
Intro to Pandas Official (Open)

DataFrame
• The basic method to create a DataFrame is as follows:-

import pandas as pd
df = pd.DataFrame(data,index,columns)

• data : numpy ndarray, dict, Series or DataFrame


• index : Index to use for resulting frame. Default to np.arange(n)
• columns : Column labels to use for resulting frame. Default to np.arange(n)

26
Pandas Data Structures Official (Open)

Create DataFrame (Using 2d Numpy Array)


import pandas as pd
import numpy as np

# data is a numpy 10x5 ndarray


# of random numbers
data = np.random.randn(10, 5)
df = pd.DataFrame(data)

print(type(data))
print(type(df))
print(df)

27
Pandas Data Structures Official (Open)

Create DataFrame (dict of 1d Numpy array)


import pandas as pd
import numpy as np

np1 = np.array([1,2,3])
np2 = np.array([4,5,6])
np3 = np.array([7,8,9])

d = {'one' : np1,
'two' : np2,
'three': np3}

df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d,
index=['a', 'b', 'c'])

In this case, the column names are derived automatically


from the dict keys
28
Pandas Data Structures Official (Open)

Create DataFrame (Using 2d Numpy Array)


import pandas as pd
import numpy as np

data = np.random.randn(5, 3)
df = pd.DataFrame(data,
columns=['i','ii','iii'],
index=['a','b','c','d','e']
)

You can specify the column names and index


references when creating the DataFrame
object

29
Pandas Data Structures Official (Open)

Create DataFrame (dict of lists)


import pandas as pd

d = {'weights' :[50,70.5,85.3,43.1],
'heights' : [1.54,1.73,1.82,1.6]}

df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d,
index=['Mary', 'John',
'Robert', 'Christine'])

In this case, the column names are derived automatically from


the dict keys

30
Pandas Data Structures Official (Open)

Create DataFrame (dict of Series)


import pandas as pd

d = {
'one' : pd.Series([1, 2, 3],
index=['a', 'b', 'c']),

'two' : pd.Series([1, 2, 3, 4],


index=['a', 'b', 'c', 'd'])
}

df = pd.DataFrame(d)

In this case, the column names are derived automatically from the dict keys

31
Pandas Data Structures Official (Open)

Create DataFrame (dict of Series)


import pandas as pd

stockSummaries={
'AMZN':pd.Series([346.15, 589.8,158.88],
index=['Closing price','P/E','Market Cap(B)']),
'GOOG':pd.Series([1133.43, 380.64],
index=['Closing price','Market Cap(B)']),
'FB':pd.Series([61.48, 150.92],
index=['Closing price','Market Cap(B)'])}

stockDF = pd.DataFrame(stockSummaries)

In this case, the column names are


derived automatically from the dict keys
32
Pandas Data Structures

Create DataFrame (structured record/array)


df
import pandas as pd
data = [(1,2,'Hello'), (2,3,"World")]

df = pd.DataFrame(data)
df2 = pd.DataFrame(data,
index=['first', 'second'],
columns=['a','b','c'])
df2

33
Pandas Data Structures

Create DataFrame (list of dicts)


df
import pandas as pd

data = [{'a': 1, 'b': 2},


{'a': 5, 'b': 10, 'c': 20}
]

df = pd.DataFrame(data)

In this case, the column names are derived automatically from the dict keys

34
Retrieving Information

35
Retrieving Information Official (Open)

Basic Information
Method Description
df.shape Returns (rows, columns)
df.index Describe index
df.columns Describe DataFrame column s
df.count() Number of non-NA values
df.info() Info on DataFrame
df.dtypes Data types

36
Retrieving Information Official (Open)

Basic Information

df.shape

df.index

37
Retrieving Information Official (Open)

Basic Information
df.columns

df.count

38
Retrieving Information Official (Open)

Basic Information

df.info

39
Retrieving Information Official (Open)

Basic Information

df.dtypes

40
Retrieving Series and DataFrame Information
Official (Open)

Summary
Method Description
df.sum() Sum of values
df.cumsum() Cumulative sum of values

df.min(), df.max() Minimum / Maximum values


df.idxmin(), df.idxmax() Minimum / Maximum Index Value
df.describe Summary Statistics
df.mean() Mean of values
df.median() Median of values

41
Retrieving Information

Summary
Original dataset - df

42
Retrieving Information Official (Open)

Summary

df.sum

df.cumsum

43
Retrieving Information Official (Open)

Summary

df.min()

df.max()

44
Retrieving Information Official (Open)

Summary

df.idxmin()

df.idxmax()

45
Retrieving Information Official (Open)

Summary

df.describe()

46
Retrieving Information Official (Open)

Summary

df.mean()

df.median()

47
A first taste of Pandas Official (Open)

Subsetting columns
Method Description
df['width'] or df.width Subset a single column by column name
df[['width','length','species']] Subset multiple columns by column names
df.loc[:, 'A':'C'] Subset a range of columns by column names using loc
df.iloc[:,2] Subset a single column by its index using iloc
df.iloc[:, [0, -1]] Subset multiple columns by their indices using iloc
df.iloc[:, 0:2] Subset a range of columns by index using iloc
df.loc[:,'pop'] > 100000 Create derived columns by using Boolean logic
re = '^customer' Subset columns whose names match a regular
df.filter(regex=re) expression, e.g. where data contains 'customer'

48
A first taste of Pandas Official (Open)

Subsetting columns (by column name)


import pandas as pd

df = pd.read_csv('data/euro_winners.csv', df3
sep=‘,')
print(df.columns)

Subset a single column by column name


df1 = df[‘Season’]
df2 = df.Season
df4
# Subset multiple columns by column names
df3 = df[[‘Season', ‘Winners', ‘Score']]

# Subset a range of columns using loc


df4 = df.loc[:, ‘Season':‘Score']

49
A first taste of Pandas Official (Open)

Subsetting columns (by index)


import pandas as pd df2

df = pd.read_csv('gapminder.tsv', sep='\t')

# Subset a single column by its index


df0 = df.iloc[:,0]

# Subset multiple columns by their indices


df2 = df.iloc[:,[0,-1]] df3

# Subset a range of columns using iloc


df3 = df.iloc[:, 0:3]

Subsetting columns by index


50
A first taste of Pandas Official (Open)

Subsetting columns (by boolean logic)


import pandas as pd

df = pd.read_csv('gapminder.tsv', sep='\t')

# Boonlean expression for more than


mask = df.loc[:,'Attendance'] > 100000

df[mask][['Season', 'Venue', 'Attendance']]

Subsetting columns by Boolean logic


51
A first taste of Pandas Official (Open)

Subsetting columns (by reg expression)


import pandas as pd

df = pd.read_csv('classdata/euro_winners.csv', sep=',')

re1 = 'er' # Match strings containing a 'er'


re2 = 'on$' # Match strings ending with 'on'
re3 = '^S' # Match strings starting with S
re4 = '[n]+' #Match strings with at least one 'n' (one or more 'n')

df2 = df.filter(regex=re1)

Subsetting columns by regular expressions


52
Subsetting Rows

Subsetting rows
Method Description
df.loc['a':'c'] Select rows by label
df.iloc[10:20,:] Select rows by index
df[df.Length > 7] Select rows by Boolean logic
df.head(n) Select first n rows
df.tail(n) Select last n rows
df.sample(frac=0.5) Randomly select fraction of rows.
df.sample(n=10) Randomly select n rows
df.nlargest(n, 'value') Select and order top n entries
df.nsmallest(n, 'value') Select and order bottom n entries
df.drop_duplicates() Select unique rows only (duplicates removed)
53
Subsetting Rows Official (Open)

Subsetting rows (by label)


import pandas as pd Subsetting rows by label
df1
df = pd.read_csv('studentsdataset.csv',
index_col='StudentID')

df1 = df.loc[1516045]
df2 = df.loc[[1516045,1532537]]

df2
df

5454
Subsetting Rows Official (Open)

Subsetting rows (by label)


import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4),
index=list('abcdef'),
columns=list('ABCD')) df

df1 = df.loc['c':'f']

df1

Subsetting rows by label


5555
Subsetting Rows Official (Open)

Subsetting rows (by index)


import pandas as pd Subsetting rows by index
df1
df = pd.read_csv('studentsdataset.csv')

df1 = df.iloc[2]
df2 = df.iloc[[2,4,6]]

df2
df

5656
A first taste of Pandas Official (Open)

Subsetting rows (by boolean logic)


Subsetting rows by Boolean logic
import pandas as pd df1

df = pd.read_csv('studentsdataset.csv')

# Select only rows with col course = DBIT


df1 = df[df.StudentCourse == 'DBIT']
df2 = df[df.StudentCourse.isin(
['DBIT','DIT'])]

df

57
A first taste of Pandas Official (Open)

Subsetting rows (by regex matching)


Subsetting rows by regular expression matching
import pandas as pd

df = pd.read_csv('rainfall.csv')

# Regular expression - starts with 2017


re2017 = '^2017'
# Extract dataset with months from 2017
df2017 = df[df['month'].str.contains(re2017)]

58
Official (Open)
• ST1510 PDAS
Topic 3 part 2
Data Analysis with Pandas

59
Reshaping Data
Change the layout of a data set

60
Reshaping Data

Change the layout of datasets


Method Description
df.drop(['Length','Height'], axis=1) Drop columns from a DataFrame
pd.concat Append rows to DataFrames, axis=0
pd.concat Append columns to DataFrames, axis=1
df.pivot Spread rows into columns
pd.melt Gather columns into rows
df.sort_values('mpg') Sort DataFrame by column values
df.sort_index() Sort DataFrame by index values
df.reindex() Conform DataFrame to a new index
df.reset_index() Reset the index of a DataFrame
df.rename(columns = {'y':'year'}) Rename the columns of a DataFrame
61
Reshaping Data Official (Open)

Drop columns from a DataFrame (drop)

df

import pandas as pd
df =
pd.read_csv("data/euro_winners.csv"
)
df2= df.drop(['Runners-up',
'Runner-UpNation',
'Venue', 'Attendance'], axis=1)
df2
62
Reshaping Data

Append rows to DataFrames (concat)


import pandas as pd
df1 = pd.read_csv("data/Weights1.csv", result
index_col=['name'])
df2 = pd.read_csv("data/Weights2.csv",
index_col=['name'])

result = pd.concat([df1,df2])

df1 df2

63
Reshaping Data

Append columns to DataFrames (concat)


import pandas as pd
df1 = pd.read_csv("data/Weights1.csv", result
index_col=['name'])
df2 = pd.read_csv("data/Weights2.csv",
index_col=['name'])

result = pd.concat([df1,df2], axis=1)

df1 df2

64
Official (Open)

Append rows to DataFrames (concat)


import pandas as pd
df1 = pd.read_csv("data/Weights1.csv",index_col=['name'])
df3 = pd.read_csv("data/Heights2.csv",index_col=['name'])

result = pd.concat([df1,df3])

df1 df3 result

65
Official (Open)

Concat – One More

Note: One row is inserted to the df3

66
Reshaping Data Official (Open)

Spread rows into columns (pivot)


import pandas as pd
df = pd.read_csv("data/Weights.csv")
df2= df.pivot(index='observation',columns='gender',values='weight')
index
Values of
‘gender’
Before pivot

After pivot
67
Reshaping Data

Gather columns into rows (melt)


import pandas as pd
df = pd.read_csv("data/Weights.csv")

df3 = pd.melt(df, id_vars=['observation','gender'])


Before melt

After melt
68
Reshaping Data Official (Open)

Sort DataFrame by column values


DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')[source]

import pandas as pd
df = pd.read_csv("data/Weights.csv")
df2 = df.sort_values(by="weight")
Before sort

After sort
69
Reshaping Data

Sort DataFrame by index values


import pandas as pd
import numpy as np

dates = pd.date_range('20130101', periods=6)


df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df2= df.sort_index(ascending=False)

df df2

70
Reshaping Data

Sort DataFrame by index values (2)


import pandas as pd
df=pd.read_csv('data/stock_index_prices.csv')
df2=df.set_index(['TradingDate','PriceType'])
df3=df2.sort_index(ascending=False,level="PriceType")

df2
df df3
71
Reshaping Data Official (Open)

Reindex a DataFrame
import pandas as pd
index = ['Firefox', 'Chrome', 'Safari', 'IE10','Konqueror']
df = pd.DataFrame({
'http_status': [200,200,404,404,301],
'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, index=index)

new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10','Chrome']


df2 = df.reindex(new_index)

Before reindexing After reindexing (df2)


72
Reshaping Data Official (Open)

Reset the index of a DataFrame


import pandas as pd
df=pd.read_csv('data/stock_index_prices.csv')
df = df.set_index(['TradingDate','PriceType'])

df= df.reset_index()

Before reset index After reset index


73
Reshaping Data

Rename the columns of a DataFrame


import pandas as pd
import pandas as pd
df = pd.read_csv("data/Weights.csv")

df=df.rename(columns={"observation":"reading"})
Before rename

After rename
74
Handling Missing Data

75
Handling Missing Data Official (Open)

When / why does data become missing?


• Many data sets simply arrive with missing data, either because it exists and was
not collected or it never existed
• For example, in a collection of financial time series, some of the time series
might start on different dates. Thus, values prior to the start date would
generally be marked as missing

76
Handling Missing Data Official (Open)

E.g how reindexing causes missing data


df1 = pd.DataFrame(np.random.randn(3, 3), index=['a', 'b', 'c',
columns=['one', 'two', 'three'])

df2 = df1.reindex(['a', 'b', 'c', 'd'])

77
Handling Missinng Data Official (Open)

Handling Missing Data


Code Example Description
isnull Detect missing values (NaN in numeric arrays, None/NaN in
object arrays)
notnull Replacement for numpy.isfinite / -numpy.isnan which is
suitable for use on object arrays
df.dropna Drop rows with any column having NA/null data
df.fillna Replace all NA/null data with value
df.interpolate Uses interpolation to ‘best guess’ missing numeric values

78
Official (Open)

Handling Missing Data in Pandas


• isnull() and notnull() functions in pandas may be used on both Series and DataFrame
objects to detect missing values

import pandas as pd
df = pd.read_csv('data/kidney_disease.csv')
print(df.loc[0:4,'id':‘pcc']) # print first few columns and rows
print(pd.isnull(df['rbc']).head()) # check if ‘rbc’ has null values
df["rbc"].isnull().sum() # counts number of null values in ‘rbc’

7979
Handling Missing Data Official (Open)

Values considered missing


• As data comes in many shapes and forms, pandas aims to be flexible with regards
to handling missing data
• While NaN is the default missing value marker for reasons of computational speed
and convenience, we need to be able to easily detect this value with data of
different types: floating point, integer, boolean, and general object
• In many cases, however, the Python None will arise and we wish to also consider
that “missing” or “null”.

80
Official (Open)

Handling Missing Data


• There are many ways to handle missing data. Some common methods include:
• Drop them! Simple and will not introduce errors through imputation

Univariate Imputation
• Fill with fixed value or derived from ‘neighbouring’ records
• Mean imputation/Mode for categorical
• Interpolation

Multivariate Imputation
• Regression
• KNN etc

• Do note that imputation introduces errors! You are using values to estimate the
missing value, then using that value again for the model. Limit to 5%.
Handling Missing Data Official (Open)

dropna
import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')
df2 = df.dropna()

Original dataset with NaN values. See the next slide for comparison after using dropna

8282
Handling Missing Data Official (Open)

dropna

8383
Handling Missing Data Official (Open)

fillna (scalar values)

• The fillna function can “fill in” NA values with non-null data in a few ways:

scalar fill gaps fill gaps


values forwards backwards

8484
Handling Missing Data Official (Open)

fillna (scalar values)


import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')

# Replace NA with a scalar value


df2 = df.fillna(0)

Original dataset with NaN values. See the next slide for comparison after using fillna
8585
Handling Missing Data Official (Open)

fillna (scalar values)

8686
Handling Missing Data Official (Open)

fillna (fill gaps forward)


import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')

# Fill gaps forwards


df2 = df.fillna(method='ffill')

Original dataset with NaN values. See the next slide for comparison after using fillna
8787
Handling Missing Data Official (Open)

fillna (fill gaps forward)

8888
Handling Missing Data Official (Open)

fillna (fill gaps backward)


import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')

# Fill gaps forwards


df2 = df.fillna(method='bfill')

Original dataset with NaN values. See the next slide for comparison after using fillna
8989
Handling Missing Data Official (Open)

fillna (fill gaps backward)

9090
Handling Missing Data

fillna (mean)
import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')
Why is this step
necessary?
df.rc = pd.to_numeric(df.rc, errors='coerce')
# Replace NA with the mean of the values of the column
df.rc =df['rc'].fillna(df['rc'].mean())

Original dataset with NaN values. See the next slide for comparison after using fillna
9191
Handling Missing Data Official (Open)

fillna (mean)

9292
Official (Open)

Handling Missing Data – fillna (mode)


• Mean cannot be computed for categorical variables. You can use mode.
df = pd.read_csv('pandas_data/kidney_disease.csv')
count1 = df['rbc'].isnull().sum()

print('Number of Nan values in rbc:', count1)


df2= df.fillna(df.mode().iloc[0]) # Replace all columns NaN with mode vlaue
count2 = df2["rbc"].isnull().sum() # Number of Nan values in rbc 0
print('Number of Nan values in rbc:', count2)

df2.head(10)
9393
Handling Missing Data – fillna (mode)
• Mean cannot be computed for categorical variables. You can use mode.

94
Official (Open)

Interpolate
• Interpolation is a mathematical method that adjusts a function to your data and
uses this function to extrapolate the missing data.
• The simplest type of interpolation is the linear interpolation, that makes a mean
between the values before the missing data and the value after.

df2 =df.interpolate() # default linear method for numeric

Interpolate()

9595
Combine Data Sets

96
Combine Data Sets

Combine Data Sets


Code Example Description
pd.merge(adf, bdf, how='left', on='x1') Get all rows from adf and only those rows from
bdf that have a match with adf on the column x1
pd.merge(adf, bdf, how='right', on='x1') Get all rows from bdf and only those rows from
adf that have a match with bdf on the column x1
pd.merge(adf, bdf, how='inner', on='x1') Join matching rows from both adf and bdf
pd.merge(adf, bdf, how='outer', on='x1') Join all rows from adf and bdf
adf[adf.x1.isin(bdf.x1)] All rows in adf that have a match in bdf.
adf[~adf.x1.isin(bdf.x1)] All rows in adf that do not have a match in bdf

97
Combine Data Sets Official (Open)

merge (how='left')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='left', on='customerid')

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
9898
Combine Datasets Official (Open)

merge (how='left')

9999
Combine Data Sets Official (Open)

merge (how='right')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='right', on='customerid')

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101000
Combine Datasets Official (Open)

merge (how='right')

101101
Combine Data Sets Official (Open)

merge (how='inner')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='inner', on='customerid')

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101202
Combine Datasets Official (Open)

merge (how='inner')

101303
Combine Data Sets Official (Open)

isin
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = df1[df1.customerid.isin(df2.customerid)]

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101404
Combine Datasets Official (Open)

isin

101505
Combine Data Sets Official (Open)

~isin
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = df1[~df1.customerid.isin(df2.customerid)]

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101606
Combine Datasets Official (Open)

~isin

101707
Group Data

108
Group Data Official (Open)

Group data

Code Example Description


df1 = df.groupby(by='col) Return a GroupBy object, grouped by values in
column named 'col'
df1 = df.groupby(level='ind') Return a GroupBy object, grouped by values in
index level named 'ind'
df1.groups The dictionary of groups
len(df1.groups) Number of groups in the dictionary
size() Size of each group
agg(function) Aggregate group using function

109
Study the partial dataset below which stores the results of the European club soccer championship since
1955

110
Official (Open)

Suppose I want to find the countries who won the


championship the most number of times. We can
rank the nations by the number of European club
championships they have won

We can achieve this by using the groupby function


to group the data by Nation.

We can then use size() to count how many times


each Nation appeared in the dataset (indicating a
win for that nation). This returns a Series, with
Nation as the index.
111
Group Data Official (Open)

Grouping Data
import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# Returns a dictionary of DataFrameGroupBy objects


nationsGrp = df.groupby(['Nation'])

# size() returns a Series, indexed by “Nation”


nationWins = nationsGrp.size()
nationWins.sort_values(ascending=False)

111212
Official (Open)

Let's say we want to find not the best


country, but the best clubs! We can find
out the total wins, not only by country
alone, but by club as well.

We can achieve this by using multicolumn


groupby function in Pandas.

113
Group Data Official (Open)

Grouping Data
import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# To do a further breakup by country and club


# apply a multicolumn groupby function
winnersGrp = df.groupby(['Nation','Winners'])

# a dataframe is returned indexed by both


# “Nation” and “Winners”
clubWins=winnersGrp.size()
clubWins.sort_values(ascending=False)

111414
Official (Open)

Knowing the best teams, I wish to sum up


all the attendance for the matches. Are the
matches where the best teams win also the
most popular (highest attendance)?

To do this we need to aggregate the data in


the Attendance column by summing up all
the attendance based on matching Nation
and Winners.

We can then sort it if we want.


115
Group Data Official (Open)

Aggregate Grouped Data


import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# We can aggregate data by using sum


# on the Attendance column
att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()

111616
Group Data Official (Open)

Aggregate Grouped Data


import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# We can aggregate data by using sum


# on the Attendance column
att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()

# You can choose to sort by Attenance


att.sort_values(ascending=False,
by='Attendance')

111717
Official (Open)

With the Series of Wins and the DataFrame


of Attendance, we can now join this two
data to produce the final table on the right.

Do you see a trend for the Attendance and


the number of Wins?

118
Group Data Official (Open)

Aggregate Grouped Data


import pandas as pd
df = pd.read_csv("euro_winners.csv")

winnersGrp = df.groupby(['Nation','Winners'])
clubWins=winnersGrp.size()

att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()

clubWins.name = "Wins” # Name the Series


attendance = att.join(clubWins)
attendance.sort_values(ascending=False,
by='Attendance')

111919
Applying functions to Pandas Series and DataFrames
Official (Open)

apply (for panda Series)


import pandas as pd
import numpy as np

series = pd.Series([20, 21, 12],


index=['London','New York','Helsinki'])

def toFahrenheit(x):
return x*32

series = series.apply(toFahrenheit)

Datasets before and after applying the function

121020
Applying functions to Pandas Series and DataFrames
Official (Open)

apply (for panda DataFrame)


Applies function along input axis of DataFrame

import pandas as pd
import numpy as np

data = np.random.randint(1,10,(3,2))
Dataset before applying the function
df = pd.DataFrame(data,
index=['Student 1', 'Student 2',
'Student 3'],
columns=['Reward 1', 'Reward 2'])

def multiply(x):
return x*2
Dataset after applying the function
df = df.apply(multiply)

121121
Applying functions to Pandas Series and DataFrames
Official (Open)

apply (for panda DataFrame)


Applies function along input axis of DataFrame
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], Dataset before applying the function
'year': [1989, 1990, 1994, 1979, 1975]}
df = pd.DataFrame(data)

# Create a capitalization lambda function


capitalizer = lambda x: x.upper()

# Apply the capitalizer function over the column 'name'


df['name'].apply(capitalizer)
Dataset after applying the function

121222
The End

You might also like