Cleaning Data in Python

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

CLEANING DATA IN PYTHON

Tidy data
Cleaning Data in Python

Tidy data
● “Tidy Data” paper by Hadley Wickham, PhD
● Formalize the way we describe the shape of data
● Gives us a goal when forma!ing our data
● “Standard way to organize data values within a dataset”
Cleaning Data in Python

Motivation for tidy data


Cleaning Data in Python

Principles of tidy data


● Columns represent separate variables
● Rows represent individual observations
● Observational units form tables
Cleaning Data in Python

Converting to tidy data

● Be!er for reporting vs. be!er for analysis


● Tidy data makes it easier to fix common data
problems
Cleaning Data in Python

Converting to tidy data


● The data problem we are trying to fix:
● Columns containing values, instead of variables
● Solution: pd.melt()
Cleaning Data in Python

Melting
In [1]: pd.melt(frame=df, id_vars='name',
...: value_vars=['treatment a', 'treatment b'])
Out[1]:
name variable value
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
Cleaning Data in Python

Melting
In [2]: pd.melt(frame=df, id_vars='name',
...: value_vars=['treatment a', 'treatment b’],
...: var_name='treatment', value_name='result')
Out[2]:
name treatment result
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Pivoting data
Cleaning Data in Python

Pivot: un-melting data


● Opposite of melting
● In melting, we turned columns into rows
● Pivoting: turn unique values into separate columns
● Analysis friendly shape to reporting friendly shape
● Violates tidy data principle: rows contain observations
● Multiple variables stored in the same column
Cleaning Data in Python

Pivot: un-melting data


Cleaning Data in Python

Pivot
In [1]: weather_tidy = weather.pivot(index='date',
...: columns='element',
...: values='value')

In [2]: print(weather_tidy)
element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4
Cleaning Data in Python

Pivot
Cleaning Data in Python

Using pivot when you have duplicate entries


In [3]: import numpy as np

In [4]: weather2_tidy = weather.pivot(values='value',


...: index='date',
...: columns='element')
Out[4]:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-9-2962bb23f5a3> in <module>()
1 weather2_tidy = weather2.pivot(values='value',
2 index='date',
----> 3 columns='element')
ValueError: Index contains duplicate entries, cannot reshape
Cleaning Data in Python

Pivot table
● Has a parameter that specifies how to deal with duplicate
values
● Example: Can aggregate the duplicate values by taking their
average
Cleaning Data in Python

Pivot table
In [5]: weather2_tidy = weather.pivot_table(values='value',
...: index='date',
...: columns='element',
...: aggfunc=np.mean)
Out[5]:
element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 15.4
CLEANING DATA IN PYTHON

Let’s practice!
CLEANING DATA IN PYTHON

Beyond melt and


pivot
Cleaning Data in Python

Beyond melt and pivot


● Melting and pivoting are basic tools
● Another common problem:
● Columns contain multiple bits of information
Cleaning Data in Python

Beyond melt and pivot


Cleaning Data in Python

Melting and parsing


In [1]: pd.melt(frame=tb, id_vars=['country', 'year'])
Out[1]:
country year variable value
0 AD 2000 m014 0
1 AE 2000 m014 2
2 AF 2000 m014 52
3 AD 2000 m1524 0
4 AE 2000 m1524 4
5 AF 2000 m1524 228

● Nothing inherently wrong about original data shape


● Not conducive for analysis
Cleaning Data in Python

Melting and parsing


In [2]: tb_melt['sex'] = tb_melt.variable.str[0]

In [3]: tb_melt
Out[3]:
country year variable value sex
0 AD 2000 m014 0 m
1 AE 2000 m014 2 m
2 AF 2000 m014 52 m
3 AD 2000 m1524 0 m
4 AE 2000 m1524 4 m
5 AF 2000 m1524 228 m
CLEANING DATA IN PYTHON

Let’s practice!

You might also like