0% found this document useful (0 votes)
7 views9 pages

DP

The document outlines a Python data preprocessing script that includes steps for importing libraries, loading a dataset, handling missing values, removing duplicates, encoding categorical data, handling outliers, and scaling features. It also covers visualizing the cleaned data and saving it to a new CSV file. The script ensures data integrity and prepares the dataset for further analysis or modeling.

Uploaded by

sidy22jan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

DP

The document outlines a Python data preprocessing script that includes steps for importing libraries, loading a dataset, handling missing values, removing duplicates, encoding categorical data, handling outliers, and scaling features. It also covers visualizing the cleaned data and saving it to a new CSV file. The script ensures data integrity and prepares the dataset for further analysis or modeling.

Uploaded by

sidy22jan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

Step-by-Step Explanation of Python Data

Preprocessing Script

1. Importing Libraries

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

 pandas is used for handling data structures like DataFrames.

 matplotlib.pyplot and seaborn are used for data visualization.

 MinMaxScaler from sklearn is used for feature scaling.

2. Load Dataset

df = pd.read_csv("data.csv")

 pd.read_csv("data.csv") loads the CSV file data.csv into a pandas DataFrame (df). This
DataFrame is where all the data manipulation will take place.

3. Display Missing Values Before Handling

print("Missing values before handling:")

print(df.isnull().sum())

 df.isnull().sum() checks for missing values (i.e., NaN) in the DataFrame. It sums up how
many missing values there are in each column.

 The output is printed to show the number of missing values for each column.

4. Handle Missing Values

if "Age" in df.columns and df["Age"].dtype != object:


df["Age"] = df["Age"].fillna(df["Age"].mean()) # Fill Age with Mean (No inplace)

if "Salary" in df.columns and df["Salary"].dtype != object:

df["Salary"] = df["Salary"].fillna(df["Salary"].median()) # Fill Salary with Median (No inplace)

 Missing values are handled for the "Age" and "Salary" columns.

o For Age: If the column exists and is numeric (not an object type like strings),
missing values (NaN) are replaced with the mean of the column.
df["Age"].mean() calculates the mean.

o For Salary: Similarly, missing values in the "Salary" column are filled with the
median of the column (df["Salary"].median()).

 Important: Instead of using inplace=True, we directly assign the result of fillna() to the
column to avoid warnings related to chained assignment.

5. Remove Duplicates

df.drop_duplicates(inplace=True)

 df.drop_duplicates() removes any duplicate rows in the DataFrame.

 inplace=True means the operation will modify the original DataFrame (df) without
creating a new copy.

6. Encode Categorical Data (One-Hot Encoding for 'Department')

if "Department" in df.columns and df["Department"].dtype == object:

df = pd.get_dummies(df, columns=["Department"], drop_first=True)

 One-hot encoding is applied to the "Department" column (if it exists and is categorical).

o pd.get_dummies() converts categorical variables into dummy/indicator variables


(binary columns for each category in the column).

o drop_first=True ensures the first column is dropped to avoid multicollinearity


(since one column can be inferred from the others).
o For example, if "Department" had values like "HR", "Finance", "IT", this step
would convert it into three columns: "Department_Finance", "Department_IT",
etc., with 1 indicating presence and 0 indicating absence.

7. Handle Outliers Using IQR Method

numeric_cols = df.select_dtypes(include=['number']).columns # Select only numeric columns

for col in numeric_cols:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

 Outliers are handled using the Interquartile Range (IQR) method.

o IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile
(75th percentile).

o The outlier threshold is set to values outside the range: Q1 - 1.5 * IQR for the
lower bound, and Q3 + 1.5 * IQR for the upper bound.

o Outliers are removed by filtering out any data points that lie outside these
bounds.

8. Feature Scaling Using Min-Max Normalization

scaler = MinMaxScaler()

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

 Feature scaling is performed using Min-Max normalization on numeric columns.

o MinMaxScaler() scales the numeric columns to a range between 0 and 1.


o This helps in models that require features to be on the same scale, like k-nearest
neighbors or gradient descent-based algorithms.

9. Save the Cleaned Dataset

df.to_csv("cleaned_data.csv", index=False)

 The cleaned dataset is saved to a new CSV file named "cleaned_data.csv".

 index=False ensures that the index column (row numbers) is not written to the file.

10. Display Missing Values After Handling

print("\nMissing values after handling:")

print(df.isnull().sum())

 After handling the missing values, we display again the count of missing values in each
column.

 Ideally, all missing values should be filled or removed, so the output should show 0 for
all columns.

11. Data Visualization (Histograms, Boxplots, and Heatmap)

Histogram for Age Distribution

if "Age" in df.columns and df["Age"].dtype != object:

plt.figure(figsize=(6,4))

sns.histplot(df["Age"], bins=10, kde=True, color='blue')

plt.title("Age Distribution")

plt.xlabel("Age")

plt.ylabel("Count")

plt.show()

 A histogram is plotted for the "Age" column to visualize the distribution.

o sns.histplot() creates a histogram with a kernel density estimate (KDE) curve.


o bins=10 specifies the number of bins to group the data into.

Boxplot for Salary

if "Salary" in df.columns and df["Salary"].dtype != object:

plt.figure(figsize=(6,4))

sns.boxplot(x=df["Salary"], color='red')

plt.title("Boxplot of Salary")

plt.show()

 A boxplot is created for the "Salary" column to visualize outliers and the distribution.

o sns.boxplot() shows the distribution of "Salary" with its quartiles and outliers.

Correlation Heatmap

if len(numeric_cols) > 1:

plt.figure(figsize=(8,6))

sns.heatmap(df[numeric_cols].corr(), annot=True, cmap="coolwarm", fmt=".2f",


linewidths=0.5)

plt.title("Feature Correlation Heatmap")

plt.show()

 A heatmap is plotted to show the correlation between numeric columns.

o df[numeric_cols].corr() calculates the correlation matrix.

o annot=True adds numerical labels for each cell in the heatmap.

o cmap="coolwarm" specifies the color map to use.

o fmt=".2f" sets the format for numbers to two decimal places.

Final Output:

The script processes and cleans data by:

1. Handling missing values.

2. Removing duplicates.
3. One-hot encoding categorical features.

4. Handling outliers.

5. Scaling numeric features.

6. Visualizing the cleaned data.

7. Saving the cleaned data to a new CSV file.


import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.preprocessing import MinMaxScaler

# Load dataset

df = pd.read_csv("data.csv")

# Display missing values before handling

print("Missing values before handling:")

print(df.isnull().sum())

# 1️⃣ Handle Missing Values Safely (Fixing the warning by avoiding inplace=True)

if "Age" in df.columns and df["Age"].dtype != object:

df["Age"] = df["Age"].fillna(df["Age"].mean()) # Fill Age with Mean (No inplace)

if "Salary" in df.columns and df["Salary"].dtype != object:

df["Salary"] = df["Salary"].fillna(df["Salary"].median()) # Fill Salary with Median (No inplace)

# 2️⃣ Remove Duplicates

df.drop_duplicates(inplace=True)

# 3️⃣ Encode Categorical Data (One-Hot Encoding for 'Department')

if "Department" in df.columns and df["Department"].dtype == object:

df = pd.get_dummies(df, columns=["Department"], drop_first=True)

# 4️⃣ Handle Outliers using IQR Method (Only for Numeric Columns)
numeric_cols = df.select_dtypes(include=['number']).columns # Select only numeric columns

for col in numeric_cols:

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

# 5️⃣ Feature Scaling (Normalize Numeric Features using Min-Max Scaling)

scaler = MinMaxScaler()

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Save the cleaned dataset

df.to_csv("cleaned_data.csv", index=False)

# Display missing values after handling

print("\nMissing values after handling:")

print(df.isnull().sum())

print("\nData preprocessing completed successfully! ✅")

# ---------------------------------------------

# 📊 DATA VISUALIZATION (With Safe Checks)

# ---------------------------------------------
# 🎯 1. Histogram for Age Distribution (Only if "Age" Exists and is Numeric)

if "Age" in df.columns and df["Age"].dtype != object:

plt.figure(figsize=(6,4))

sns.histplot(df["Age"], bins=10, kde=True, color='blue')

plt.title("Age Distribution")

plt.xlabel("Age")

plt.ylabel("Count")

plt.show()

# 🎯 2. Boxplot for Salary (Only if "Salary" Exists and is Numeric)

if "Salary" in df.columns and df["Salary"].dtype != object:

plt.figure(figsize=(6,4))

sns.boxplot(x=df["Salary"], color='red')

plt.title("Boxplot of Salary")

plt.show()

# 🎯 3. Correlation Heatmap (Only for Numeric Columns)

if len(numeric_cols) > 1:

plt.figure(figsize=(8,6))

sns.heatmap(df[numeric_cols].corr(), annot=True, cmap="coolwarm", fmt=".2f",


linewidths=0.5)

plt.title("Feature Correlation Heatmap")

plt.show()

You might also like