DP
DP
Preprocessing Script
1. Importing Libraries
import pandas as pd
2. Load Dataset
df = pd.read_csv("data.csv")
pd.read_csv("data.csv") loads the CSV file data.csv into a pandas DataFrame (df). This
DataFrame is where all the data manipulation will take place.
print(df.isnull().sum())
df.isnull().sum() checks for missing values (i.e., NaN) in the DataFrame. It sums up how
many missing values there are in each column.
The output is printed to show the number of missing values for each column.
Missing values are handled for the "Age" and "Salary" columns.
o For Age: If the column exists and is numeric (not an object type like strings),
missing values (NaN) are replaced with the mean of the column.
df["Age"].mean() calculates the mean.
o For Salary: Similarly, missing values in the "Salary" column are filled with the
median of the column (df["Salary"].median()).
Important: Instead of using inplace=True, we directly assign the result of fillna() to the
column to avoid warnings related to chained assignment.
5. Remove Duplicates
df.drop_duplicates(inplace=True)
inplace=True means the operation will modify the original DataFrame (df) without
creating a new copy.
One-hot encoding is applied to the "Department" column (if it exists and is categorical).
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
o IQR is the range between the 1st quartile (25th percentile) and the 3rd quartile
(75th percentile).
o The outlier threshold is set to values outside the range: Q1 - 1.5 * IQR for the
lower bound, and Q3 + 1.5 * IQR for the upper bound.
o Outliers are removed by filtering out any data points that lie outside these
bounds.
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
df.to_csv("cleaned_data.csv", index=False)
index=False ensures that the index column (row numbers) is not written to the file.
print(df.isnull().sum())
After handling the missing values, we display again the count of missing values in each
column.
Ideally, all missing values should be filled or removed, so the output should show 0 for
all columns.
plt.figure(figsize=(6,4))
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()
plt.figure(figsize=(6,4))
sns.boxplot(x=df["Salary"], color='red')
plt.title("Boxplot of Salary")
plt.show()
A boxplot is created for the "Salary" column to visualize outliers and the distribution.
o sns.boxplot() shows the distribution of "Salary" with its quartiles and outliers.
Correlation Heatmap
if len(numeric_cols) > 1:
plt.figure(figsize=(8,6))
plt.show()
Final Output:
2. Removing duplicates.
3. One-hot encoding categorical features.
4. Handling outliers.
# Load dataset
df = pd.read_csv("data.csv")
print(df.isnull().sum())
# 1️⃣ Handle Missing Values Safely (Fixing the warning by avoiding inplace=True)
df.drop_duplicates(inplace=True)
# 4️⃣ Handle Outliers using IQR Method (Only for Numeric Columns)
numeric_cols = df.select_dtypes(include=['number']).columns # Select only numeric columns
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
df.to_csv("cleaned_data.csv", index=False)
print(df.isnull().sum())
# ---------------------------------------------
# ---------------------------------------------
# 🎯 1. Histogram for Age Distribution (Only if "Age" Exists and is Numeric)
plt.figure(figsize=(6,4))
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()
plt.figure(figsize=(6,4))
sns.boxplot(x=df["Salary"], color='red')
plt.title("Boxplot of Salary")
plt.show()
if len(numeric_cols) > 1:
plt.figure(figsize=(8,6))
plt.show()