0% found this document useful (0 votes)

7 views9 pages

DP

The document outlines a Python data preprocessing script that includes steps for importing libraries, loading a dataset, handling missing values, removing duplicates, encoding categorical data, handling outliers, and scaling features. It also covers visualizing the cleaned data and saving it to a new CSV file. The script ensures data integrity and prepares the dataset for further analysis or modeling.

Uploaded by

sidy22jan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views9 pages

DP

Uploaded by

sidy22jan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

Step-by-Step Explanation of Python Data

Preprocessing Script

1. Importing Libraries

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

 pandas is used for handling data structures like DataFrames.

 matplotlib.pyplot and seaborn are used for data visualization.

 MinMaxScaler from sklearn is used for feature scaling.

2. Load Dataset

df = pd.read_csv("data.csv")

 pd.read_csv("data.csv") loads the CSV file data.csv into a pandas DataFrame (df). This
DataFrame is where all the data manipulation will take place.

3. Display Missing Values Before Handling

print("Missing values before handling:")

print(df.isnull().sum())

 df.isnull().sum() checks for missing values (i.e., NaN) in the DataFrame. It sums up how
many missing values there are in each column.

 The output is printed to show the number of missing values for each column.

4. Handle Missing Values

if "Age" in df.columns and df["Age"].dtype != object:

df["Age"] = df["Age"].fillna(df["Age"].mean()) # Fill Age with Mean (No inplace)

if "Salary" in df.columns and df["Salary"].dtype != object:

df["Salary"] = df["Salary"].fillna(df["Salary"].median()) # Fill Salary with Median (No inplace)

 Missing values are handled for the "Age" and "Salary" columns.

o For Age: If the column exists and is numeric (not an object type like strings),
missing values (NaN) are replaced with the mean of the column.
df["Age"].mean() calculates the mean.

o For Salary: Similarly, missing values in the "Salary" column are filled with the
median of the column (df["Salary"].median()).

 Important: Instead of using inplace=True, we directly assign the result of fillna() to the
column to avoid warnings related to chained assignment.

5. Remove Duplicates

df.drop_duplicates(inplace=True)

 df.drop_duplicates() removes any duplicate rows in the DataFrame.

 inplace=True means the operation will modify the original DataFrame (df) without
creating a new copy.

6. Encode Categorical Data (One-Hot Encoding for 'Department')

if "Department" in df.columns and df["Department"].dtype == object:

df = pd.get_dummies(df, columns=["Department"], drop_first=True)

 One-hot encoding is applied to the "Department" column (if it exists and is categorical).

o pd.get_dummies() converts categorical variables into dummy/indicator variables

(binary columns for each category in the column).

o drop_first=True ensures the first column is dropped to avoid multicollinearity

(since one column can be inferred from the others).
o For example, if "Department" had values like "HR", "Finance", "IT", this step
would convert it into three columns: "Department_Finance", "Department_IT",
etc., with 1 indicating presence and 0 indicating absence.

7. Handle Outliers Using IQR Method

numeric_cols = df.select_dtypes(include=['number']).columns # Select only numeric columns

for col in numeric_cols:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

 Outliers are handled using the Interquartile Range (IQR) method.

o IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile
(75th percentile).

o The outlier threshold is set to values outside the range: Q1 - 1.5 * IQR for the
lower bound, and Q3 + 1.5 * IQR for the upper bound.

o Outliers are removed by filtering out any data points that lie outside these
bounds.

8. Feature Scaling Using Min-Max Normalization

scaler = MinMaxScaler()

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

 Feature scaling is performed using Min-Max normalization on numeric columns.

o MinMaxScaler() scales the numeric columns to a range between 0 and 1.

o This helps in models that require features to be on the same scale, like k-nearest
neighbors or gradient descent-based algorithms.

9. Save the Cleaned Dataset

df.to_csv("cleaned_data.csv", index=False)

 The cleaned dataset is saved to a new CSV file named "cleaned_data.csv".

 index=False ensures that the index column (row numbers) is not written to the file.

10. Display Missing Values After Handling

print("\nMissing values after handling:")

print(df.isnull().sum())

 After handling the missing values, we display again the count of missing values in each
column.

 Ideally, all missing values should be filled or removed, so the output should show 0 for
all columns.

11. Data Visualization (Histograms, Boxplots, and Heatmap)

Histogram for Age Distribution

if "Age" in df.columns and df["Age"].dtype != object:

plt.figure(figsize=(6,4))

sns.histplot(df["Age"], bins=10, kde=True, color='blue')

plt.title("Age Distribution")

plt.xlabel("Age")

plt.ylabel("Count")

plt.show()

 A histogram is plotted for the "Age" column to visualize the distribution.

o sns.histplot() creates a histogram with a kernel density estimate (KDE) curve.

o bins=10 specifies the number of bins to group the data into.

Boxplot for Salary

if "Salary" in df.columns and df["Salary"].dtype != object:

plt.figure(figsize=(6,4))

sns.boxplot(x=df["Salary"], color='red')

plt.title("Boxplot of Salary")

plt.show()

 A boxplot is created for the "Salary" column to visualize outliers and the distribution.

o sns.boxplot() shows the distribution of "Salary" with its quartiles and outliers.

Correlation Heatmap

if len(numeric_cols) > 1:

plt.figure(figsize=(8,6))

sns.heatmap(df[numeric_cols].corr(), annot=True, cmap="coolwarm", fmt=".2f",

linewidths=0.5)

plt.title("Feature Correlation Heatmap")

plt.show()

 A heatmap is plotted to show the correlation between numeric columns.

o df[numeric_cols].corr() calculates the correlation matrix.

o annot=True adds numerical labels for each cell in the heatmap.

o cmap="coolwarm" specifies the color map to use.

o fmt=".2f" sets the format for numbers to two decimal places.

Final Output:

The script processes and cleans data by:

1. Handling missing values.

2. Removing duplicates.
3. One-hot encoding categorical features.

4. Handling outliers.

5. Scaling numeric features.

6. Visualizing the cleaned data.

7. Saving the cleaned data to a new CSV file.

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

# Load dataset

df = pd.read_csv("data.csv")

# Display missing values before handling

print("Missing values before handling:")

print(df.isnull().sum())

# 1️⃣ Handle Missing Values Safely (Fixing the warning by avoiding inplace=True)

if "Age" in df.columns and df["Age"].dtype != object:

df["Age"] = df["Age"].fillna(df["Age"].mean()) # Fill Age with Mean (No inplace)

if "Salary" in df.columns and df["Salary"].dtype != object:

df["Salary"] = df["Salary"].fillna(df["Salary"].median()) # Fill Salary with Median (No inplace)

# 2️⃣ Remove Duplicates

df.drop_duplicates(inplace=True)

# 3️⃣ Encode Categorical Data (One-Hot Encoding for 'Department')

if "Department" in df.columns and df["Department"].dtype == object:

df = pd.get_dummies(df, columns=["Department"], drop_first=True)

# 4️⃣ Handle Outliers using IQR Method (Only for Numeric Columns)
numeric_cols = df.select_dtypes(include=['number']).columns # Select only numeric columns

for col in numeric_cols:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# 5️⃣ Feature Scaling (Normalize Numeric Features using Min-Max Scaling)

scaler = MinMaxScaler()

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Save the cleaned dataset

df.to_csv("cleaned_data.csv", index=False)

# Display missing values after handling

print("\nMissing values after handling:")

print(df.isnull().sum())

print("\nData preprocessing completed successfully! ✅")

# ---------------------------------------------

# 📊 DATA VISUALIZATION (With Safe Checks)

# ---------------------------------------------
# 🎯 1. Histogram for Age Distribution (Only if "Age" Exists and is Numeric)

if "Age" in df.columns and df["Age"].dtype != object:

plt.figure(figsize=(6,4))

sns.histplot(df["Age"], bins=10, kde=True, color='blue')

plt.title("Age Distribution")

plt.xlabel("Age")

plt.ylabel("Count")

plt.show()

# 🎯 2. Boxplot for Salary (Only if "Salary" Exists and is Numeric)

if "Salary" in df.columns and df["Salary"].dtype != object:

plt.figure(figsize=(6,4))

sns.boxplot(x=df["Salary"], color='red')

plt.title("Boxplot of Salary")

plt.show()

# 🎯 3. Correlation Heatmap (Only for Numeric Columns)

if len(numeric_cols) > 1:

plt.figure(figsize=(8,6))

sns.heatmap(df[numeric_cols].corr(), annot=True, cmap="coolwarm", fmt=".2f",

linewidths=0.5)

plt.title("Feature Correlation Heatmap")

plt.show()

Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
English-Trade-Based Financial Crime in The Middle East and North Africa
No ratings yet
English-Trade-Based Financial Crime in The Middle East and North Africa
114 pages
To Diagnose and Resolve Audio Issues in PES 2017 On Your PC
No ratings yet
To Diagnose and Resolve Audio Issues in PES 2017 On Your PC
2 pages
ODS Schools Profile by Slidego
No ratings yet
ODS Schools Profile by Slidego
81 pages
Blockchain Technology Expla_ (Z-Library)
No ratings yet
Blockchain Technology Expla_ (Z-Library)
101 pages
HP Compaq nx5000. M-B CRYSTAL 1.0. Schematic Diagram.
No ratings yet
HP Compaq nx5000. M-B CRYSTAL 1.0. Schematic Diagram.
60 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Geo Python Doc[1]7,8 Bavesh
No ratings yet
Geo Python Doc[1]7,8 Bavesh
9 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
Practicals
No ratings yet
Practicals
42 pages
ML 8 Program
No ratings yet
ML 8 Program
5 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Egger BR Eurolight How-To-guide en
No ratings yet
Egger BR Eurolight How-To-guide en
60 pages
Datos Deforestacion2
No ratings yet
Datos Deforestacion2
48 pages
ML_Unit_2
No ratings yet
ML_Unit_2
52 pages
General Dynamics - SRAT II Stryker Reactive Armor Tiles
No ratings yet
General Dynamics - SRAT II Stryker Reactive Armor Tiles
1 page
Python Class 6 Assignment Solution
No ratings yet
Python Class 6 Assignment Solution
9 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
EDP-3[2]
No ratings yet
EDP-3[2]
16 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
RDX Series Two-Way Radios User Guide Rdu2080d-Rdv2080d-Rdu4160d
No ratings yet
RDX Series Two-Way Radios User Guide Rdu2080d-Rdv2080d-Rdu4160d
104 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
Advance Python
No ratings yet
Advance Python
5 pages
Wa0001.
No ratings yet
Wa0001.
47 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
ML Complete Notes Hridoy.docx
No ratings yet
ML Complete Notes Hridoy.docx
5 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
SOW FOR REPAIR OF MAIN STORM RAIN WATER DRAINAGE PIPE IN COMPOUND Project 45
No ratings yet
SOW FOR REPAIR OF MAIN STORM RAIN WATER DRAINAGE PIPE IN COMPOUND Project 45
15 pages
Lab_questionbank
No ratings yet
Lab_questionbank
3 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Datasheet AVEVA ProcessSimulation
No ratings yet
Datasheet AVEVA ProcessSimulation
8 pages
INOMAX VFD MAX500 Series +IP55 Catalog V2403
No ratings yet
INOMAX VFD MAX500 Series +IP55 Catalog V2403
8 pages
Phython Example
No ratings yet
Phython Example
12 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Frequency Distribution Table
No ratings yet
Frequency Distribution Table
9 pages
Designs Luz Del Futuro Community School
No ratings yet
Designs Luz Del Futuro Community School
40 pages
BRM Presentation - by AbhishekB - YogeshB - AbhishekC
No ratings yet
BRM Presentation - by AbhishekB - YogeshB - AbhishekC
13 pages
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
4 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
EXP-2 ML
No ratings yet
EXP-2 ML
6 pages
Data Analysis CheatSheet
No ratings yet
Data Analysis CheatSheet
2 pages
Chapter 13 en
No ratings yet
Chapter 13 en
28 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
AIML
No ratings yet
AIML
13 pages
CS Project
No ratings yet
CS Project
11 pages
Exp 8_LM
No ratings yet
Exp 8_LM
10 pages
DA lab
No ratings yet
DA lab
27 pages
Avinash DA 6
No ratings yet
Avinash DA 6
3 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
EXP-2
No ratings yet
EXP-2
6 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
Pandas-1
No ratings yet
Pandas-1
13 pages
Instructables Convert CNC Router To 3d Printer
No ratings yet
Instructables Convert CNC Router To 3d Printer
10 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Tape Reading
50% (2)
Tape Reading
3 pages
Measurement of Insulation Resistance IR Part 1
100% (1)
Measurement of Insulation Resistance IR Part 1
5 pages
ONLINESSHOPPING2
No ratings yet
ONLINESSHOPPING2
10 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Certificate
No ratings yet
Certificate
25 pages
EDA_CODE_SNIPPETS
No ratings yet
EDA_CODE_SNIPPETS
17 pages
Network Theory MQP 2
No ratings yet
Network Theory MQP 2
4 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Voltage Regulators: QD3/T350 Motor Replacement Kit Kit Number 57A63675100B
No ratings yet
Voltage Regulators: QD3/T350 Motor Replacement Kit Kit Number 57A63675100B
8 pages
BCS-078 BCS Interface
No ratings yet
BCS-078 BCS Interface
9 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
05PT 8 Parameters Restoration
No ratings yet
05PT 8 Parameters Restoration
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
18 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
RDJ ITP 06 I 05 - Inspection and Test Plan For Fiber Optic Cabling and Termination - EN Rev00
No ratings yet
RDJ ITP 06 I 05 - Inspection and Test Plan For Fiber Optic Cabling and Termination - EN Rev00
3 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
17 2017 Lecture1-2 INT312
0% (2)
17 2017 Lecture1-2 INT312
21 pages
Hareesh Yenugu With All Ceritificates
No ratings yet
Hareesh Yenugu With All Ceritificates
3 pages
YHAI Udhampur Trek April May June
No ratings yet
YHAI Udhampur Trek April May June
2 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet