DATA CLEANING IN
PYTHON
Working with Python made easier
Introduction
• Data cleaning is a vital step in data analysis, and Python, with libraries
like `pandas`, offers powerful tools for this process. Below is a guide to
common data cleaning tasks using Python:
1. Import Libraries and Load Data
• Python code
• import pandas as pd
• import numpy as np
•
• # Load data
• df = pd.read_csv('data.csv') # Replace with your dataset path
2. Handle Missing Data
Identify Missing Values: Fill Missing Values: Drop Missing Values:
Python Code Python code Python code
print(df.isnull().sum()) df['column_name'].fillna('Default df.dropna(inplace=True) # Drop
# Count missing values per Value', inplace=True) rows with missing values
column # Fill with a default value df.dropna(axis=1, inplace=True)
print(df[df.isnull().any(axis=1)]) df['column_name'].fillna(df['colu # Drop columns with missing
# Display rows with missing mn_name'].mean(), values
values inplace=True)
# Fill with mean
3. Remove Duplicates
• Python code
• df.drop_duplicates(inplace=True) # Remove duplicate rows
4. Standardize Data Formats
Trim Whitespace: Change Case: Format Dates:
Python code Python code Python code
df['column_name'] = df['column_name'] = df['date_column'] =
df['column_name'].str.strip() df['column_name'].str.lower() # pd.to_datetime(df['date_column'
Convert to lowercase ], format='%Y%m%d')
5. Correct Invalid Data
Replace Invalid Values: Remove Outliers:
Python code Python code
df['column_name'] = # Using ZScore
df['column_name'].replace(['Invalid Value'], 'Valid from scipy.stats import zscore
Value') df = df[(np.abs(zscore(df['numeric_column'])) < 3)]
6. Handle Inconsistent Data
Unify Categories: Split and Combine Columns:
Python code Python code
df['category_column'] = # Split a column
df['category_column'].replace({ df[['first_name', 'last_name']] =
'Variation1': 'Standardized Value', df['full_name'].str.split(' ', expand=True)
'Variation2': 'Standardized Value'
}) # Combine columns
df['full_name'] = df['first_name'] + ' ' +
df['last_name']
7. Drop Unnecessary Columns or Rows
Drop Columns: Drop Rows:
Python code Python code
df.drop(['unnecessary_column'], axis=1, df = df[df['column_name'] != 'Unwanted Value']
inplace=True)
8. Validate and Clean Data Types
Convert Data Types: Check for Invalid Types:
Python code Python code
df['numeric_column'] = print(df.dtypes)
pd.to_numeric(df['numeric_column'],
errors='coerce') # Coerce invalid values to NaN
df['string_column'] =
df['string_column'].astype(str)
9. Handle Outliers
• Using IQR:
• Python code
• Q1 = df['numeric_column'].quantile(0.25)
• Q3 = df['numeric_column'].quantile(0.75)
• IQR = Q3 Q1
• df = df[(df['numeric_column'] >= Q1 1.5 IQR) & (df['numeric_column'] <= Q3 + 1.5
IQR)]
10. Save Cleaned Data
• Python code
• df.to_csv('cleaned_data.csv', index=False) # Save cleaned data to a new file
Example Workflow
# Full Example # Identify and # Remove # Standardize # Handle # Save the
fill missing duplicates case outliers cleaned data
values
df = df['age'].fillna(df[ df.drop_duplicat df['name'] = Q1 = df.to_csv('cleane
pd.read_csv('dat 'age'].mean(), es(inplace=True) df['name'].str.lo df['income'].qua d_data.csv',
a.csv') inplace=True) wer() ntile(0.25) index=False)
Q3 =
df['income'].qua
ntile(0.75)
IQR = Q3 Q1
df =
df[(df['income']
>= Q1 1.5 IQR) &
(df['income'] <=
Q3 + 1.5 IQR)]
Conclusion
• These Python tools ensure clean, structured, and consistent data for
analysis.