0% found this document useful (0 votes)

24 views

Data Analysis with Pandas

The document provides a comprehensive guide to using the Pandas library for data analysis in Python, covering its key features, data structures (Series and DataFrame), and methods for loading, saving, and manipulating data. It includes practical examples of how to perform operations such as subsetting, filtering, and handling missing data. Additionally, it outlines various functions for retrieving information and summarizing datasets.

Uploaded by

Andrew Pang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

Data Analysis with Pandas

Uploaded by

Andrew Pang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 122

Official (Open)

Topic 3 – Data Analysis with Pandas

https://pandas.pydata.org/docs/user_guide
Contents Official (Open)

Contents
• Intro to Pandas • Selections
• Subsetting rows, Subsetting columns
• A first taste of Pandas • Reshaping Data
• Dropping values, Merging, pivot, unpivot,
• Pandas Data Structures sorting, reindexing
• Series, DataFrames • Handling Missing Data
• Dropping Missing Data, Filling Missing
• Loading and Saving Data Data
• Read/Write CSV, Write Excel, SQL • Group Data
• Retrieving Information • Groupby

• Basic Information, Summary • Applying Functions

• apply, lambda

2
Intro to Pandas

3
Intro to Pandas Official (Open)

What is Pandas?
• Pandas is a high-performance open source library for data analysis in Python
• De-facto standard library for data analysis using Python

• Widespread adoption of the tool

with a large community behind it
(817 contributors + 15,330
commits as in Jul 2017)
• Rapid iteration, features, and
enhancements continuously made
• https://github.com/pandas-dev/pandas

4
Intro to Pandas Official (Open)

Key Features of Pandas

• Can process a variety of data sets in different formats: tabular heterogeneous, time series,
and matrix data.
• Can load /import data from varied sources such as CSV, Excel and DB/SQL.
• Allows manipulation of datasets such as subsetting, slicing, filtering, merging, groupBy, re-
ordering, and re-shaping.
• Allows handling of missing data according to rules defined by the user/developer: ignore,
convert to 0, and so on.
• It can be used for parsing and munging (conversion) of data as well as modeling and statistical
analysis.
• It integrates well with other Python libraries such as statsmodels, SciPy, and scikit-learn.
• It delivers fast performance and can be speeded up even more by making use of Cython (C
extensions to Python).
6
Intro to Pandas Official (Open)

What problem does pandas solve?

• Python has long been great for data munging and preparation, but less so for data
analysis and modelling
• pandas helps fill this gap, enabling you to carry out your entire data analysis
workflow in Python without having to switch to a more domain specific language
like R
• pandas does not implement significant modeling functionality outside of linear
and panel regression; for this, look to statsmodels and scikit-learn.

7
A first taste of Pandas

7
A first taste of Pandas Official (Open)

Objectives
This section will show you a simple Pandas program that you can learn
techniques to:
1. Load a simple CSV file
2. Show a preview of the first n rows and last n rows of the loaded dataset
3. Display information about the loaded dataset such as:
1. The count of how many rows and columns were loaded
2. The column names of each row and their datatypes

4. Extract subsets of the dataset

5. Save a subset of data

9
A first taste of Pandas Official (Open)

Load a simple delimited file

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,’)

10
A first taste of Pandas Official (Open)

Show first/last n rows of dataset

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

# Get the 5 rows of the dataset

df.head(5)

# Get the last 3 rows of the dataset

df.tail(3)

11
A first taste of Pandas Official (Open)

Count how many cols and rows loaded

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

print(df.shape)

12
A first taste of Pandas Official (Open)

Print out the columns and their datatypes

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

print(df.dtypes)

13
A first taste of Pandas Official (Open)

Extract subsets of the dataset

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', sep=‘,')

# Extract only the three columns

df[['Season', 'Nation', 'Winners']]

# Extract only rows with attendance more than 100,000

df2 = df[df['Attendance'] > 100000]
df2

14
A first taste of Pandas Official (Open)

Save a subset of the data

import pandas as pd

Df = pd.read_csv('data/euro_winners.csv', sep=‘,')

# Extract only rows with population more than 100 thousands

df2 = df[df['Attendance'] > 100000]

df2.to_csv(‘goodattendance.csv')

15
Loading / Saving Data

15
Loading and Saving Data Official (Open)

Loading Data
• Pandas provides methods to load data from a variety of sources

Loading data from csv, Excel, HTML, json, SQL

csv_dataframe = pd.read_csv('my_dataset.csv', sep=',')
xls_dataframe = pd.read_excel('my_dataset.xlsx', 'Sheet1', na_values=['NA', '?'])
table_dataframe= pd.read_html('http://page.com/with/table.html')[0]
json_dataframe = pd.read_json('my_dataset.json', orient='columns')
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
sql_dataframe = pd.read_sql_table('my_table', engine, columns=['ColA', 'ColB'])
sql_dataframe = pd.read_sql("SELECT * FROM my_table;", engine)

17
Loading and Saving Data Official (Open)

Saving Data
• Similarly, pandas provides methods to save data to a variety of sources

Saving data to csv, Excel, HTML, json, SQL

my_dataframe.to_csv('dataset.csv')
my_dataframe.to_excel('dataset.xlsx')
html_code = my_dataframe.to_html()
my_dataframe.to_json('dataset.json')
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:')
sql_dataframe.to_sql('my_table', engine)

18
Pandas Data Structures

18
Pandas Data Structures Official (Open)

Overview
• There are three main data structures in pandas
• In this module, we will cover Series and DataFrame only

Series DataFrame Panel

20
Intro to Pandas Official (Open)

Series
• Series is a one-dimensional labeled array capable of holding any data type (integers,
strings, floating point numbers, Python objects etc)
• The axis labels are collectively referred to as the index
• The basic method to create a Series is as follows:-

import pandas as pd
ser = pd.Series(data, index=idx)

• data can be many different things: a Python dict, an ndarray, a scalar value (like 5)
• index is a list of axis labels and are initialized differently depending on what the nature of
data is (see next slides for examples)
21
Pandas Data Structures Official (Open)

Creating Series (using Python dictionary)

• If data is a dict, if index is passed the values in data corresponding to the labels in the
index will be pulled out
• Otherwise, an index will be constructed from the sorted keys of the dict, if possible.

import pandas as pd

currencyDict ={
'US':'dollar', 'UK':'pound',
'Mexico':'peso', 'China':'yuan'}

currencySeries = pd.Series(currencyDict);

index
22
Pandas Data Structures Official (Open)

Creating Series (Using numpy.ndarray)

• If data is an ndarray, index must be the same length as data
• If no index is passed, one will be created having values [0, ..., len(data) - 1].

import pandas as pd
import numpy as np

s = pd.Series(np.random.randn(5),
index=['a', 'b', 'c', 'd', 'e'])

23
Pandas Data Structures Official (Open)

Creating Series (using scalar values)

• If data is a scalar value, an index must be provided
• The value will be repeated to match the length of index

import pandas as pd

pd.Series(5,
index=['a', 'b', 'c', 'd', 'e'])

24
Pandas Data Structures Official (Open)

DataFrame
• DataFrame is the most commonly used data structure in pandas
• DataFrame is a 2-dimensional labeled data structure
• You can think of it like a spreadsheetor SQL table, or a dict of Series object
• The constructor accepts many different types of arguments
• 2D NumPy array
• Dictionary of 1D NumPy array or lists, dictionaries, or Series structures
• Structured or record ndarray
• Series structures
• Another DataFrame structure

25
Intro to Pandas Official (Open)

DataFrame
• The basic method to create a DataFrame is as follows:-

import pandas as pd
df = pd.DataFrame(data,index,columns)

• data : numpy ndarray, dict, Series or DataFrame

• index : Index to use for resulting frame. Default to np.arange(n)
• columns : Column labels to use for resulting frame. Default to np.arange(n)

26
Pandas Data Structures Official (Open)

Create DataFrame (Using 2d Numpy Array)

import pandas as pd
import numpy as np

# data is a numpy 10x5 ndarray

# of random numbers
data = np.random.randn(10, 5)
df = pd.DataFrame(data)

print(type(data))
print(type(df))
print(df)

27
Pandas Data Structures Official (Open)

Create DataFrame (dict of 1d Numpy array)

import pandas as pd
import numpy as np

np1 = np.array([1,2,3])
np2 = np.array([4,5,6])
np3 = np.array([7,8,9])

d = {'one' : np1,
'two' : np2,
'three': np3}

df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d,
index=['a', 'b', 'c'])

In this case, the column names are derived automatically

from the dict keys
28
Pandas Data Structures Official (Open)

Create DataFrame (Using 2d Numpy Array)

import pandas as pd
import numpy as np

data = np.random.randn(5, 3)
df = pd.DataFrame(data,
columns=['i','ii','iii'],
index=['a','b','c','d','e']
)

You can specify the column names and index

references when creating the DataFrame
object

29
Pandas Data Structures Official (Open)

Create DataFrame (dict of lists)

import pandas as pd

d = {'weights' :[50,70.5,85.3,43.1],
'heights' : [1.54,1.73,1.82,1.6]}

df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d,
index=['Mary', 'John',
'Robert', 'Christine'])

In this case, the column names are derived automatically from

the dict keys

30
Pandas Data Structures Official (Open)

Create DataFrame (dict of Series)

import pandas as pd

d = {
'one' : pd.Series([1, 2, 3],
index=['a', 'b', 'c']),

'two' : pd.Series([1, 2, 3, 4],

index=['a', 'b', 'c', 'd'])
}

df = pd.DataFrame(d)

In this case, the column names are derived automatically from the dict keys

31
Pandas Data Structures Official (Open)

Create DataFrame (dict of Series)

import pandas as pd

stockSummaries={
'AMZN':pd.Series([346.15, 589.8,158.88],
index=['Closing price','P/E','Market Cap(B)']),
'GOOG':pd.Series([1133.43, 380.64],
index=['Closing price','Market Cap(B)']),
'FB':pd.Series([61.48, 150.92],
index=['Closing price','Market Cap(B)'])}

stockDF = pd.DataFrame(stockSummaries)

In this case, the column names are

derived automatically from the dict keys
32
Pandas Data Structures

Create DataFrame (structured record/array)

df
import pandas as pd
data = [(1,2,'Hello'), (2,3,"World")]

df = pd.DataFrame(data)
df2 = pd.DataFrame(data,
index=['first', 'second'],
columns=['a','b','c'])
df2

33
Pandas Data Structures

Create DataFrame (list of dicts)

df
import pandas as pd

data = [{'a': 1, 'b': 2},

{'a': 5, 'b': 10, 'c': 20}
]

df = pd.DataFrame(data)

In this case, the column names are derived automatically from the dict keys

34
Retrieving Information

35
Retrieving Information Official (Open)

Basic Information
Method Description
df.shape Returns (rows, columns)
df.index Describe index
df.columns Describe DataFrame column s
df.count() Number of non-NA values
df.info() Info on DataFrame
df.dtypes Data types

36
Retrieving Information Official (Open)

Basic Information

df.shape

df.index

37
Retrieving Information Official (Open)

Basic Information
df.columns

df.count

38
Retrieving Information Official (Open)

Basic Information

df.info

39
Retrieving Information Official (Open)

Basic Information

df.dtypes

40
Retrieving Series and DataFrame Information
Official (Open)

Summary
Method Description
df.sum() Sum of values
df.cumsum() Cumulative sum of values

df.min(), df.max() Minimum / Maximum values

df.idxmin(), df.idxmax() Minimum / Maximum Index Value
df.describe Summary Statistics
df.mean() Mean of values
df.median() Median of values

41
Retrieving Information

Summary
Original dataset - df

42
Retrieving Information Official (Open)

Summary

df.sum

df.cumsum

43
Retrieving Information Official (Open)

Summary

df.min()

df.max()

44
Retrieving Information Official (Open)

Summary

df.idxmin()

df.idxmax()

45
Retrieving Information Official (Open)

Summary

df.describe()

46
Retrieving Information Official (Open)

Summary

df.mean()

df.median()

47
A first taste of Pandas Official (Open)

Subsetting columns
Method Description
df['width'] or df.width Subset a single column by column name
df[['width','length','species']] Subset multiple columns by column names
df.loc[:, 'A':'C'] Subset a range of columns by column names using loc
df.iloc[:,2] Subset a single column by its index using iloc
df.iloc[:, [0, -1]] Subset multiple columns by their indices using iloc
df.iloc[:, 0:2] Subset a range of columns by index using iloc
df.loc[:,'pop'] > 100000 Create derived columns by using Boolean logic
re = '^customer' Subset columns whose names match a regular
df.filter(regex=re) expression, e.g. where data contains 'customer'

48
A first taste of Pandas Official (Open)

Subsetting columns (by column name)

import pandas as pd

df = pd.read_csv('data/euro_winners.csv', df3
sep=‘,')
print(df.columns)

Subset a single column by column name

df1 = df[‘Season’]
df2 = df.Season
df4
# Subset multiple columns by column names
df3 = df[[‘Season', ‘Winners', ‘Score']]

# Subset a range of columns using loc

df4 = df.loc[:, ‘Season':‘Score']

49
A first taste of Pandas Official (Open)

Subsetting columns (by index)

import pandas as pd df2

df = pd.read_csv('gapminder.tsv', sep='\t')

# Subset a single column by its index

df0 = df.iloc[:,0]

# Subset multiple columns by their indices

df2 = df.iloc[:,[0,-1]] df3

# Subset a range of columns using iloc

df3 = df.iloc[:, 0:3]

Subsetting columns by index

50
A first taste of Pandas Official (Open)

Subsetting columns (by boolean logic)

import pandas as pd

df = pd.read_csv('gapminder.tsv', sep='\t')

# Boonlean expression for more than

mask = df.loc[:,'Attendance'] > 100000

df[mask][['Season', 'Venue', 'Attendance']]

Subsetting columns by Boolean logic

51
A first taste of Pandas Official (Open)

Subsetting columns (by reg expression)

import pandas as pd

df = pd.read_csv('classdata/euro_winners.csv', sep=',')

re1 = 'er' # Match strings containing a 'er'

re2 = 'on$' # Match strings ending with 'on'
re3 = '^S' # Match strings starting with S
re4 = '[n]+' #Match strings with at least one 'n' (one or more 'n')

df2 = df.filter(regex=re1)

Subsetting columns by regular expressions

52
Subsetting Rows

Subsetting rows
Method Description
df.loc['a':'c'] Select rows by label
df.iloc[10:20,:] Select rows by index
df[df.Length > 7] Select rows by Boolean logic
df.head(n) Select first n rows
df.tail(n) Select last n rows
df.sample(frac=0.5) Randomly select fraction of rows.
df.sample(n=10) Randomly select n rows
df.nlargest(n, 'value') Select and order top n entries
df.nsmallest(n, 'value') Select and order bottom n entries
df.drop_duplicates() Select unique rows only (duplicates removed)
53
Subsetting Rows Official (Open)

Subsetting rows (by label)

import pandas as pd Subsetting rows by label
df1
df = pd.read_csv('studentsdataset.csv',
index_col='StudentID')

df1 = df.loc[1516045]
df2 = df.loc[[1516045,1532537]]

df2
df

5454
Subsetting Rows Official (Open)

Subsetting rows (by label)

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(6, 4),
index=list('abcdef'),
columns=list('ABCD')) df

df1 = df.loc['c':'f']

df1

Subsetting rows by label

5555
Subsetting Rows Official (Open)

Subsetting rows (by index)

import pandas as pd Subsetting rows by index
df1
df = pd.read_csv('studentsdataset.csv')

df1 = df.iloc[2]
df2 = df.iloc[[2,4,6]]

df2
df

5656
A first taste of Pandas Official (Open)

Subsetting rows (by boolean logic)

Subsetting rows by Boolean logic
import pandas as pd df1

df = pd.read_csv('studentsdataset.csv')

# Select only rows with col course = DBIT

df1 = df[df.StudentCourse == 'DBIT']
df2 = df[df.StudentCourse.isin(
['DBIT','DIT'])]

57
A first taste of Pandas Official (Open)

Subsetting rows (by regex matching)

Subsetting rows by regular expression matching
import pandas as pd

df = pd.read_csv('rainfall.csv')

# Regular expression - starts with 2017

re2017 = '^2017'
# Extract dataset with months from 2017
df2017 = df[df['month'].str.contains(re2017)]

58
Official (Open)
• ST1510 PDAS
Topic 3 part 2
Data Analysis with Pandas

59
Reshaping Data
Change the layout of a data set

60
Reshaping Data

Change the layout of datasets

Method Description
df.drop(['Length','Height'], axis=1) Drop columns from a DataFrame
pd.concat Append rows to DataFrames, axis=0
pd.concat Append columns to DataFrames, axis=1
df.pivot Spread rows into columns
pd.melt Gather columns into rows
df.sort_values('mpg') Sort DataFrame by column values
df.sort_index() Sort DataFrame by index values
df.reindex() Conform DataFrame to a new index
df.reset_index() Reset the index of a DataFrame
df.rename(columns = {'y':'year'}) Rename the columns of a DataFrame
61
Reshaping Data Official (Open)

Drop columns from a DataFrame (drop)

import pandas as pd
df =
pd.read_csv("data/euro_winners.csv"
)
df2= df.drop(['Runners-up',
'Runner-UpNation',
'Venue', 'Attendance'], axis=1)
df2
62
Reshaping Data

Append rows to DataFrames (concat)

import pandas as pd
df1 = pd.read_csv("data/Weights1.csv", result
index_col=['name'])
df2 = pd.read_csv("data/Weights2.csv",
index_col=['name'])

result = pd.concat([df1,df2])

df1 df2

63
Reshaping Data

Append columns to DataFrames (concat)

import pandas as pd
df1 = pd.read_csv("data/Weights1.csv", result
index_col=['name'])
df2 = pd.read_csv("data/Weights2.csv",
index_col=['name'])

result = pd.concat([df1,df2], axis=1)

df1 df2

64
Official (Open)

Append rows to DataFrames (concat)

import pandas as pd
df1 = pd.read_csv("data/Weights1.csv",index_col=['name'])
df3 = pd.read_csv("data/Heights2.csv",index_col=['name'])

result = pd.concat([df1,df3])

df1 df3 result

65
Official (Open)

Concat – One More

Note: One row is inserted to the df3

66
Reshaping Data Official (Open)

Spread rows into columns (pivot)

import pandas as pd
df = pd.read_csv("data/Weights.csv")
df2= df.pivot(index='observation',columns='gender',values='weight')
index
Values of
‘gender’
Before pivot

After pivot
67
Reshaping Data

Gather columns into rows (melt)

import pandas as pd
df = pd.read_csv("data/Weights.csv")

df3 = pd.melt(df, id_vars=['observation','gender'])

Before melt

After melt
68
Reshaping Data Official (Open)

Sort DataFrame by column values

DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')[source]

import pandas as pd
df = pd.read_csv("data/Weights.csv")
df2 = df.sort_values(by="weight")
Before sort

After sort
69
Reshaping Data

Sort DataFrame by index values

import pandas as pd
import numpy as np

dates = pd.date_range('20130101', periods=6)

df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df2= df.sort_index(ascending=False)

df df2

70
Reshaping Data

Sort DataFrame by index values (2)

import pandas as pd
df=pd.read_csv('data/stock_index_prices.csv')
df2=df.set_index(['TradingDate','PriceType'])
df3=df2.sort_index(ascending=False,level="PriceType")

df2
df df3
71
Reshaping Data Official (Open)

Reindex a DataFrame
import pandas as pd
index = ['Firefox', 'Chrome', 'Safari', 'IE10','Konqueror']
df = pd.DataFrame({
'http_status': [200,200,404,404,301],
'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, index=index)

new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10','Chrome']

df2 = df.reindex(new_index)

Before reindexing After reindexing (df2)

72
Reshaping Data Official (Open)

Reset the index of a DataFrame

import pandas as pd
df=pd.read_csv('data/stock_index_prices.csv')
df = df.set_index(['TradingDate','PriceType'])

df= df.reset_index()

Before reset index After reset index

73
Reshaping Data

Rename the columns of a DataFrame

import pandas as pd
import pandas as pd
df = pd.read_csv("data/Weights.csv")

df=df.rename(columns={"observation":"reading"})
Before rename

After rename
74
Handling Missing Data

75
Handling Missing Data Official (Open)

When / why does data become missing?

• Many data sets simply arrive with missing data, either because it exists and was
not collected or it never existed
• For example, in a collection of financial time series, some of the time series
might start on different dates. Thus, values prior to the start date would
generally be marked as missing

76
Handling Missing Data Official (Open)

E.g how reindexing causes missing data

df1 = pd.DataFrame(np.random.randn(3, 3), index=['a', 'b', 'c',
columns=['one', 'two', 'three'])

df2 = df1.reindex(['a', 'b', 'c', 'd'])

77
Handling Missinng Data Official (Open)

Handling Missing Data

Code Example Description
isnull Detect missing values (NaN in numeric arrays, None/NaN in
object arrays)
notnull Replacement for numpy.isfinite / -numpy.isnan which is
suitable for use on object arrays
df.dropna Drop rows with any column having NA/null data
df.fillna Replace all NA/null data with value
df.interpolate Uses interpolation to ‘best guess’ missing numeric values

78
Official (Open)

Handling Missing Data in Pandas

• isnull() and notnull() functions in pandas may be used on both Series and DataFrame
objects to detect missing values

import pandas as pd
df = pd.read_csv('data/kidney_disease.csv')
print(df.loc[0:4,'id':‘pcc']) # print first few columns and rows
print(pd.isnull(df['rbc']).head()) # check if ‘rbc’ has null values
df["rbc"].isnull().sum() # counts number of null values in ‘rbc’

7979
Handling Missing Data Official (Open)

Values considered missing

• As data comes in many shapes and forms, pandas aims to be flexible with regards
to handling missing data
• While NaN is the default missing value marker for reasons of computational speed
and convenience, we need to be able to easily detect this value with data of
different types: floating point, integer, boolean, and general object
• In many cases, however, the Python None will arise and we wish to also consider
that “missing” or “null”.

80
Official (Open)

Handling Missing Data

• There are many ways to handle missing data. Some common methods include:
• Drop them! Simple and will not introduce errors through imputation

Univariate Imputation
• Fill with fixed value or derived from ‘neighbouring’ records
• Mean imputation/Mode for categorical
• Interpolation

Multivariate Imputation
• Regression
• KNN etc

• Do note that imputation introduces errors! You are using values to estimate the
missing value, then using that value again for the model. Limit to 5%.
Handling Missing Data Official (Open)

dropna
import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')
df2 = df.dropna()

Original dataset with NaN values. See the next slide for comparison after using dropna

8282
Handling Missing Data Official (Open)

dropna

8383
Handling Missing Data Official (Open)

fillna (scalar values)

• The fillna function can “fill in” NA values with non-null data in a few ways:

scalar fill gaps fill gaps

values forwards backwards

8484
Handling Missing Data Official (Open)

fillna (scalar values)

import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')

# Replace NA with a scalar value

df2 = df.fillna(0)

Original dataset with NaN values. See the next slide for comparison after using fillna
8585
Handling Missing Data Official (Open)

fillna (scalar values)

8686
Handling Missing Data Official (Open)

fillna (fill gaps forward)

import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')

# Fill gaps forwards

df2 = df.fillna(method='ffill')

Original dataset with NaN values. See the next slide for comparison after using fillna
8787
Handling Missing Data Official (Open)

fillna (fill gaps forward)

8888
Handling Missing Data Official (Open)

fillna (fill gaps backward)

import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')

# Fill gaps forwards

df2 = df.fillna(method='bfill')

Original dataset with NaN values. See the next slide for comparison after using fillna
8989
Handling Missing Data Official (Open)

fillna (fill gaps backward)

9090
Handling Missing Data

fillna (mean)
import pandas as pd

df = pd.read_csv('data/kidney_disease.csv')
Why is this step
necessary?
df.rc = pd.to_numeric(df.rc, errors='coerce')
# Replace NA with the mean of the values of the column
df.rc =df['rc'].fillna(df['rc'].mean())

Original dataset with NaN values. See the next slide for comparison after using fillna
9191
Handling Missing Data Official (Open)

fillna (mean)

9292
Official (Open)

Handling Missing Data – fillna (mode)

• Mean cannot be computed for categorical variables. You can use mode.
df = pd.read_csv('pandas_data/kidney_disease.csv')
count1 = df['rbc'].isnull().sum()

print('Number of Nan values in rbc:', count1)

df2= df.fillna(df.mode().iloc[0]) # Replace all columns NaN with mode vlaue
count2 = df2["rbc"].isnull().sum() # Number of Nan values in rbc 0
print('Number of Nan values in rbc:', count2)

df2.head(10)
9393
Handling Missing Data – fillna (mode)
• Mean cannot be computed for categorical variables. You can use mode.

94
Official (Open)

Interpolate
• Interpolation is a mathematical method that adjusts a function to your data and
uses this function to extrapolate the missing data.
• The simplest type of interpolation is the linear interpolation, that makes a mean
between the values before the missing data and the value after.

df2 =df.interpolate() # default linear method for numeric

Interpolate()

9595
Combine Data Sets

96
Combine Data Sets

Combine Data Sets

Code Example Description
pd.merge(adf, bdf, how='left', on='x1') Get all rows from adf and only those rows from
bdf that have a match with adf on the column x1
pd.merge(adf, bdf, how='right', on='x1') Get all rows from bdf and only those rows from
adf that have a match with bdf on the column x1
pd.merge(adf, bdf, how='inner', on='x1') Join matching rows from both adf and bdf
pd.merge(adf, bdf, how='outer', on='x1') Join all rows from adf and bdf
adf[adf.x1.isin(bdf.x1)] All rows in adf that have a match in bdf.
adf[~adf.x1.isin(bdf.x1)] All rows in adf that do not have a match in bdf

97
Combine Data Sets Official (Open)

merge (how='left')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='left', on='customerid')

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
9898
Combine Datasets Official (Open)

merge (how='left')

9999
Combine Data Sets Official (Open)

merge (how='right')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='right', on='customerid')

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101000
Combine Datasets Official (Open)

merge (how='right')

101101
Combine Data Sets Official (Open)

merge (how='inner')
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = pd.merge(df1, df2, how='inner', on='customerid')

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101202
Combine Datasets Official (Open)

merge (how='inner')

101303
Combine Data Sets Official (Open)

isin
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = df1[df1.customerid.isin(df2.customerid)]

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101404
Combine Datasets Official (Open)

isin

101505
Combine Data Sets Official (Open)

~isin
import pandas as pd
df1 = pd.read_csv('data/mergedataset1s.csv')
df2 = pd.read_csv('data/mergedataset2s.csv')
df = df1[~df1.customerid.isin(df2.customerid)]

df1- mergedataset1s.csv
df2 - mergedataset2s.csv
Original datasets before merging. See the next slide for comparison after merging.
101606
Combine Datasets Official (Open)

~isin

101707
Group Data

108
Group Data Official (Open)

Group data

Code Example Description

df1 = df.groupby(by='col) Return a GroupBy object, grouped by values in
column named 'col'
df1 = df.groupby(level='ind') Return a GroupBy object, grouped by values in
index level named 'ind'
df1.groups The dictionary of groups
len(df1.groups) Number of groups in the dictionary
size() Size of each group
agg(function) Aggregate group using function

109
Study the partial dataset below which stores the results of the European club soccer championship since
1955

110
Official (Open)

Suppose I want to find the countries who won the

championship the most number of times. We can
rank the nations by the number of European club
championships they have won

We can achieve this by using the groupby function

to group the data by Nation.

We can then use size() to count how many times

each Nation appeared in the dataset (indicating a
win for that nation). This returns a Series, with
Nation as the index.
111
Group Data Official (Open)

Grouping Data
import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# Returns a dictionary of DataFrameGroupBy objects

nationsGrp = df.groupby(['Nation'])

# size() returns a Series, indexed by “Nation”

nationWins = nationsGrp.size()
nationWins.sort_values(ascending=False)

111212
Official (Open)

Let's say we want to find not the best

country, but the best clubs! We can find
out the total wins, not only by country
alone, but by club as well.

We can achieve this by using multicolumn

groupby function in Pandas.

113
Group Data Official (Open)

Grouping Data
import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# To do a further breakup by country and club

# apply a multicolumn groupby function
winnersGrp = df.groupby(['Nation','Winners'])

# a dataframe is returned indexed by both

# “Nation” and “Winners”
clubWins=winnersGrp.size()
clubWins.sort_values(ascending=False)

111414
Official (Open)

Knowing the best teams, I wish to sum up

all the attendance for the matches. Are the
matches where the best teams win also the
most popular (highest attendance)?

To do this we need to aggregate the data in

the Attendance column by summing up all
the attendance based on matching Nation
and Winners.

We can then sort it if we want.

115
Group Data Official (Open)

Aggregate Grouped Data

import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# We can aggregate data by using sum

# on the Attendance column
att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()

111616
Group Data Official (Open)

Aggregate Grouped Data

import pandas as pd
df = pd.read_csv("data/euro_winners.csv")

# We can aggregate data by using sum

# on the Attendance column
att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()

# You can choose to sort by Attenance

att.sort_values(ascending=False,
by='Attendance')

111717
Official (Open)

With the Series of Wins and the DataFrame

of Attendance, we can now join this two
data to produce the final table on the right.

Do you see a trend for the Attendance and

the number of Wins?

118
Group Data Official (Open)

Aggregate Grouped Data

import pandas as pd
df = pd.read_csv("euro_winners.csv")

winnersGrp = df.groupby(['Nation','Winners'])
clubWins=winnersGrp.size()

att = df.groupby(['Nation','Winners'])
[['Attendance']].sum()

clubWins.name = "Wins” # Name the Series

attendance = att.join(clubWins)
attendance.sort_values(ascending=False,
by='Attendance')

111919
Applying functions to Pandas Series and DataFrames
Official (Open)

apply (for panda Series)

import pandas as pd
import numpy as np

series = pd.Series([20, 21, 12],

index=['London','New York','Helsinki'])

def toFahrenheit(x):
return x*32

series = series.apply(toFahrenheit)

Datasets before and after applying the function

121020
Applying functions to Pandas Series and DataFrames
Official (Open)

apply (for panda DataFrame)

Applies function along input axis of DataFrame

import pandas as pd
import numpy as np

data = np.random.randint(1,10,(3,2))
Dataset before applying the function
df = pd.DataFrame(data,
index=['Student 1', 'Student 2',
'Student 3'],
columns=['Reward 1', 'Reward 2'])

def multiply(x):
return x*2
Dataset after applying the function
df = df.apply(multiply)

121121
Applying functions to Pandas Series and DataFrames
Official (Open)

apply (for panda DataFrame)

Applies function along input axis of DataFrame
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], Dataset before applying the function
'year': [1989, 1990, 1994, 1979, 1975]}
df = pd.DataFrame(data)

# Create a capitalization lambda function

capitalizer = lambda x: x.upper()

# Apply the capitalizer function over the column 'name'

df['name'].apply(capitalizer)
Dataset after applying the function

121222
The End

Pandas Basics
No ratings yet
Pandas Basics
84 pages
Python Pandas Module - Introduction-07-11-2023
No ratings yet
Python Pandas Module - Introduction-07-11-2023
84 pages
2_Pandas
No ratings yet
2_Pandas
22 pages
18_Pandas
No ratings yet
18_Pandas
33 pages
Pandas
No ratings yet
Pandas
41 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
Exp1 - Manipulating Datasets Using Pandas
No ratings yet
Exp1 - Manipulating Datasets Using Pandas
15 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Pandas (Ziad)
No ratings yet
Pandas (Ziad)
38 pages
Python 3rd unit question and answer
No ratings yet
Python 3rd unit question and answer
25 pages
Pandas Dataframe Export The CSV File
No ratings yet
Pandas Dataframe Export The CSV File
9 pages
Unit 4
No ratings yet
Unit 4
36 pages
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
No ratings yet
Class Notes: Class: XII Date: 7-Apr-2020 Subject: Informatics Practices Topic: 2. Python Pandas
4 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
Pandas
No ratings yet
Pandas
16 pages
Pandas
No ratings yet
Pandas
12 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Pandas python
No ratings yet
Pandas python
11 pages
Python Data Frame New
No ratings yet
Python Data Frame New
32 pages
06 MGMT 590 Fall 2019 Data Handling With Pandas(1)
No ratings yet
06 MGMT 590 Fall 2019 Data Handling With Pandas(1)
14 pages
Python Pandas New Sylabus
No ratings yet
Python Pandas New Sylabus
53 pages
Pandas
No ratings yet
Pandas
41 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
Practical Guide To Pandas For Data Science
No ratings yet
Practical Guide To Pandas For Data Science
26 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Learndatasci
No ratings yet
Pandas Learndatasci
86 pages
Pandas Notes(1)
No ratings yet
Pandas Notes(1)
44 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
No ratings yet
Pandas Tutorial 1: Pandas Basics (Reading Data Files, Dataframes, Data Selection)
15 pages
Pandas Basics
No ratings yet
Pandas Basics
21 pages
Lecture 7 Understanding dataFrames in Python and R
No ratings yet
Lecture 7 Understanding dataFrames in Python and R
17 pages
Class 6 Pandas
No ratings yet
Class 6 Pandas
13 pages
PPT for Assignment-3 (Final_Pandas_Lab)
No ratings yet
PPT for Assignment-3 (Final_Pandas_Lab)
40 pages
Python Pandas
No ratings yet
Python Pandas
13 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Pandas Questions
No ratings yet
Pandas Questions
11 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
introduction to pandas
No ratings yet
introduction to pandas
14 pages
Pandas
No ratings yet
Pandas
42 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
No ratings yet
Cheat Sheet: The Pandas Dataframe Object I: Preliminaries Get Your Data Into A Dataframe
12 pages
1 Pandas Basics
No ratings yet
1 Pandas Basics
13 pages
Pandas
No ratings yet
Pandas
25 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas
No ratings yet
Pandas
82 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Pandas For Machine Learning: Acadview
No ratings yet
Pandas For Machine Learning: Acadview
18 pages
Notes On Pandasmanpreet
No ratings yet
Notes On Pandasmanpreet
4 pages
Pandas
No ratings yet
Pandas
3 pages
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
2015 TYS Paper 1
No ratings yet
2015 TYS Paper 1
5 pages
(2022)Tourism GI Skills Handout
No ratings yet
(2022)Tourism GI Skills Handout
10 pages
Summary Recap 01B
No ratings yet
Summary Recap 01B
16 pages
Network Devices
No ratings yet
Network Devices
22 pages
Mobile User Interface Development - ANSWERKEY
No ratings yet
Mobile User Interface Development - ANSWERKEY
88 pages
Habeas Data
No ratings yet
Habeas Data
2 pages
Configuring A Lan With DHCP and Vlans: Figure 5-1
No ratings yet
Configuring A Lan With DHCP and Vlans: Figure 5-1
8 pages
Troubleshoot Apache Startup Problems
No ratings yet
Troubleshoot Apache Startup Problems
7 pages
Resume: Yogesh Kumar
No ratings yet
Resume: Yogesh Kumar
2 pages
Spring Break HW
No ratings yet
Spring Break HW
3 pages
Chapter 5 - Elements of Arts
No ratings yet
Chapter 5 - Elements of Arts
7 pages
009 Data Sheet For Pile Load Test
No ratings yet
009 Data Sheet For Pile Load Test
2 pages
Get Practical Building Energy Modeling with OpenStudio: A Guide for Students and Professionals Team Educohack free all chapters
100% (4)
Get Practical Building Energy Modeling with OpenStudio: A Guide for Students and Professionals Team Educohack free all chapters
55 pages
Computer Networking Course - Network Engineering (CompTIA Network+ Exam Prep) - YouTube
No ratings yet
Computer Networking Course - Network Engineering (CompTIA Network+ Exam Prep) - YouTube
2 pages
VSSI LOA-Consent Fom (2020)
No ratings yet
VSSI LOA-Consent Fom (2020)
1 page
Appendix-I of G.O.Ms - No.59
No ratings yet
Appendix-I of G.O.Ms - No.59
2 pages
Implementation Tools - Implementation Methodology
No ratings yet
Implementation Tools - Implementation Methodology
35 pages
Test 1
No ratings yet
Test 1
40 pages
VEGAS Pro 18 en Keyboard Shortcuts
No ratings yet
VEGAS Pro 18 en Keyboard Shortcuts
15 pages
lec 1 Data Acquisition and preprocessing
No ratings yet
lec 1 Data Acquisition and preprocessing
8 pages
Erlang Web Applications Problem Design Solution PDF
No ratings yet
Erlang Web Applications Problem Design Solution PDF
2 pages
Advancing Robotics and Control Systems in West Africa: Opportunities, Challenges, and Future Directions (WWW - Kiu.ac - Ug)
No ratings yet
Advancing Robotics and Control Systems in West Africa: Opportunities, Challenges, and Future Directions (WWW - Kiu.ac - Ug)
5 pages
P5 Decimal Worksheet CA1 2014
No ratings yet
P5 Decimal Worksheet CA1 2014
4 pages
Development and Evaluation of A Maturity Model For AI Deployment Capability of Manufacturing Companies
No ratings yet
Development and Evaluation of A Maturity Model For AI Deployment Capability of Manufacturing Companies
32 pages
Communication Manual 5800
No ratings yet
Communication Manual 5800
46 pages
U.S. Food & Drug: Administration 10903 New Hampshire Avenue Silver Spring, MD 20993
No ratings yet
U.S. Food & Drug: Administration 10903 New Hampshire Avenue Silver Spring, MD 20993
10 pages
CSE Ongoing Scholars - 280422
No ratings yet
CSE Ongoing Scholars - 280422
4 pages
Wood, Carpenter Company in Malaysia
No ratings yet
Wood, Carpenter Company in Malaysia
16 pages
Medical Translation - Highlights
No ratings yet
Medical Translation - Highlights
20 pages
HD1500-7 Introduction & General Service PDF
100% (5)
HD1500-7 Introduction & General Service PDF
122 pages
Seminar Report Format
No ratings yet
Seminar Report Format
29 pages
Article 1
No ratings yet
Article 1
1 page
Bascis of Graphic Design
No ratings yet
Bascis of Graphic Design
41 pages
Fast - Lane F5 NETWORKS - CONFIGURING BIG IP ADVANCED WAF
No ratings yet
Fast - Lane F5 NETWORKS - CONFIGURING BIG IP ADVANCED WAF
5 pages