Live training:
Cleaning Data in Python
About me
ADEL NEHME
Content Developer
The data science workflow
Report or
Dashboard
Model in
Production
Access Explore and Extract
Data Process Data Insights
The data science workflow
Report or
Dashboard
Model in
Production
Access Explore and Extract
Data Process Data Insights
Why do we have dirty data?
Human error Technical error
The dataset
Airbnb data
listing_id: Unique identifier for a listing
Name: Description used for a listing
Host_id: Unique identifier for each host
Host_name: Name of host
Airbnb data featuring listings in New York
Neighbourhood_full: Burrough and neighbourhood
Coordinates: Latitude, Longitude
Room_type: Type of room
Price: Price per night
Number of reviews: Number of reviews so far
Last_review: Date of last review
Reviews_per_month: # of reviews per month
Availability_365: Days available per year
Rating: Average rating (0 to 5)
Number_of_stays: Number of stays so far
5_stars: Percentage of ratings that is 5_stars
Listing_added: Date listing added to site
The end result
Technologies
Popular open source data analysis tool for tabular data
Open source plotting library for 2-D visualizations
Open source plotting library built on top of matplotlib
Technologies
Popular open source computing tool for arrays
missingno Open source plotting library for missing data
datetime Package for easy date data manipulation
Technologies
❗❗Requires a gmail account to edit ❗❗
Session outline
1 Introduction
2 Importing our dataset
3 Diagnosing our data problems
4 Q&A
5 Our to do list
6 Data cleaning
7 Q&A
8 Recap & closing notes
9 Take home question
Notebook
Notebook
Session outline
1 Introduction
2 Importing our dataset
3 Diagnosing our data problems
4 Q&A
5 Our to do list
6 Data cleaning
7 Q&A
8 Recap & closing notes
9 Take home question
The data science workflow - revisited
Report or
Dashboard
Model in
Production
Access Explore and Extract
Data Process Data Insights
The data science workflow - revisited
Report or
Dashboard
Model in
Production
Access Explore and Extract
Data Process Data Insights
The data science workflow - revisited
Report or
Dashboard
Model in
Production
Access Explore and Extract
Data Process Data Insights
The data science workflow - revisited
Report or
Dashboard
Model in
Production
Access Explore and Extract
Data Process Data Insights
Coming soon
Check out our upcoming webinars!
Live training: Data Viz with ggplot2 📈
Register here
DCVirtual: Webinar week 🎉
DataCamp for Enterprise: What’s New in Q2 2020 🌅
Register here
Take home question
Pick one of the following:
1) What is the average price of listings by borough? Visualize your results with a bar plot!
2) What is the average availability in days of listings by borough? Visualize your results with a bar plot!
3) What is the median price per room type in each borough? Visualize your results with a bar plot!
4) Visualize the number of listings over time.
Functions that should/could be used:
● .groupby() and .agg()
● sns.barplot(x = , y = , hue = , data = )
● sns.lineplot(x = , y = , data = )
● .dt.strftime() for extracting specific dates from a datetime column
Bonus points if you finish more than one question
Submission details:
● Share with us a code snippet with your output on LinkedIn, Twitter or Facebook
● Tag us on `@DataCamp` with the hashtag `#datacamplive`
Recap of the functions used
Diagnosis functions Description Treatment functions Description
import pandas as pd Imports the pandas package with the alias pd Replaces one string with another for each row of a str
.str.replace(“”, “”) column
.head() Prints the header of a DataFrame
.str.split(“”, expand = True) Splits a string column into two based on input
.dtypes Gets the data types of each column in a
DataFrame .astype() Converts a column to a datatype of choice
.info() Returns a # observations, data types and missing pd.to_datetime() Converts a date column to datetime
values per column
.str.lower() Lowercases each row in a str column
.describe() Returns statistical distribution of numeric value
in a DataFrame
.str.strip(“”) Removes a pattern from each row of an str column
.isna().sum() Returns # of missing values per column
.replace() Replace values for others in a column
sns.distplot() Plots distribution of one variable
.fillna() Fills missing values of a column with a value of your
choice
msno.matrix() Visualizes missingness matrix
.drop_duplicates() Drops duplicates
msno.barplot() Visualizes missingness barplot
.duplicated(subset = , keep = ) Lets you find duplicates in a DataFrame based on
all or subset of columns