Cleaning Data in Python
Cleaning Data in Python
Cleaning Data in Python
Tidy data
Cleaning Data in Python
Tidy data
● “Tidy Data” paper by Hadley Wickham, PhD
● Formalize the way we describe the shape of data
● Gives us a goal when forma!ing our data
● “Standard way to organize data values within a dataset”
Cleaning Data in Python
Melting
In [1]: pd.melt(frame=df, id_vars='name',
...: value_vars=['treatment a', 'treatment b'])
Out[1]:
name variable value
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
Cleaning Data in Python
Melting
In [2]: pd.melt(frame=df, id_vars='name',
...: value_vars=['treatment a', 'treatment b’],
...: var_name='treatment', value_name='result')
Out[2]:
name treatment result
0 Daniel treatment a _
1 John treatment a 12
2 Jane treatment a 24
3 Daniel treatment b 42
4 John treatment b 31
5 Jane treatment b 27
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
Pivoting data
Cleaning Data in Python
Pivot
In [1]: weather_tidy = weather.pivot(index='date',
...: columns='element',
...: values='value')
In [2]: print(weather_tidy)
element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4
Cleaning Data in Python
Pivot
Cleaning Data in Python
Pivot table
● Has a parameter that specifies how to deal with duplicate
values
● Example: Can aggregate the duplicate values by taking their
average
Cleaning Data in Python
Pivot table
In [5]: weather2_tidy = weather.pivot_table(values='value',
...: index='date',
...: columns='element',
...: aggfunc=np.mean)
Out[5]:
element tmax tmin
date
2010-01-30 27.8 14.5
2010-02-02 27.3 15.4
CLEANING DATA IN PYTHON
Let’s practice!
CLEANING DATA IN PYTHON
In [3]: tb_melt
Out[3]:
country year variable value sex
0 AD 2000 m014 0 m
1 AE 2000 m014 2 m
2 AF 2000 m014 52 m
3 AD 2000 m1524 0 m
4 AE 2000 m1524 4 m
5 AF 2000 m1524 228 m
CLEANING DATA IN PYTHON
Let’s practice!