0% found this document useful (0 votes)
23 views

Data Preprocessing & Visualization1

Uploaded by

Zeha 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Preprocessing & Visualization1

Uploaded by

Zeha 1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Step 1: Load the Data

In [4]: import pandas as pd

# Load the data


df = pd.read_csv('data.csv')
print("Initial Data:\n", df)

Initial Data:
ID Age Salary Department Experience
0 1 25.0 50000.0 Sales 2
1 2 30.0 60000.0 Engineering 5
2 3 22.0 45000.0 Sales 1
3 4 35.0 NaN HR 10
4 5 28.0 70000.0 Engineering 4
5 6 40.0 80000.0 HR 15
6 7 38.0 75000.0 Sales 12
7 8 NaN 62000.0 Engineering 7
8 9 45.0 90000.0 HR 20
9 10 32.0 54000.0 Sales 6

Step 2: Data Preprocessing

1) Handling Missing Values


In [5]: # Fill missing values in 'Age' and 'Salary' with the mean of the respective columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nData after handling missing values:\n", df)

Data after handling missing values:


ID Age Salary Department Experience
0 1 25.000000 50000.000000 Sales 2
1 2 30.000000 60000.000000 Engineering 5
2 3 22.000000 45000.000000 Sales 1
3 4 35.000000 65111.111111 HR 10
4 5 28.000000 70000.000000 Engineering 4
5 6 40.000000 80000.000000 HR 15
6 7 38.000000 75000.000000 Sales 12
7 8 32.777778 62000.000000 Engineering 7
8 9 45.000000 90000.000000 HR 20
9 10 32.000000 54000.000000 Sales 6

2) Handling Outliers
In [6]: # Remove outliers from 'Salary' using the IQR method
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['Salary'] < (Q1 - 1.5 * IQR)) | (df['Salary'] > (Q3 + 1.5 * IQR)))]
print("\nData after removing outliers:\n", df)

Data after removing outliers:


ID Age Salary Department Experience
0 1 25.000000 50000.000000 Sales 2
1 2 30.000000 60000.000000 Engineering 5
2 3 22.000000 45000.000000 Sales 1
3 4 35.000000 65111.111111 HR 10
4 5 28.000000 70000.000000 Engineering 4
5 6 40.000000 80000.000000 HR 15
6 7 38.000000 75000.000000 Sales 12
7 8 32.777778 62000.000000 Engineering 7
8 9 45.000000 90000.000000 HR 20
9 10 32.000000 54000.000000 Sales 6

3) Encoding Categorical Variables


In [7]: from sklearn.preprocessing import LabelEncoder

# Encode 'Department' column


label_encoder = LabelEncoder()
df['Department'] = label_encoder.fit_transform(df['Department'])
print("\nData after encoding categorical variables:\n", df)

Data after encoding categorical variables:


ID Age Salary Department Experience
0 1 25.000000 50000.000000 2 2
1 2 30.000000 60000.000000 0 5
2 3 22.000000 45000.000000 2 1
3 4 35.000000 65111.111111 1 10
4 5 28.000000 70000.000000 0 4
5 6 40.000000 80000.000000 1 15
6 7 38.000000 75000.000000 2 12
7 8 32.777778 62000.000000 0 7
8 9 45.000000 90000.000000 1 20
9 10 32.000000 54000.000000 2 6

4) Scaling and Normalization


In [9]: from sklearn.preprocessing import StandardScaler

# Scale numeric columns


scaler = StandardScaler()
df[['Age', 'Salary', 'Experience']] = scaler.fit_transform(df[['Age', 'Salary', 'Experience']])
print("\nData after scaling and normalization:\n", df)

Data after scaling and normalization:


ID Age Salary Department Experience
0 1 -1.170477 -1.140700 2 -1.083228
1 2 -0.418027 -0.385825 0 -0.559085
2 3 -1.621947 -1.518138 2 -1.257942
3 4 0.334422 0.000000 1 0.314485
4 5 -0.719007 0.369050 0 -0.733799
5 6 1.086871 1.123925 1 1.188056
6 7 0.785892 0.746488 2 0.663914
7 8 0.000000 -0.234850 0 -0.209657
8 9 1.839321 1.878801 1 2.061627
9 10 -0.117048 -0.838750 2 -0.384371

Step 3: Data Visualization


In [10]: import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot of Age vs Salary


plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Age', y='Salary', data=df)
plt.title('Scatter plot of Age vs Salary')

# Histogram of Age
plt.subplot(2, 2, 2)
sns.histplot(df['Age'], kde=True)
plt.title('Histogram of Age')

# Heatmap of Correlation Matrix


plt.subplot(2, 2, 3)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Correlation Matrix')

# Boxplot of Salary by Department


plt.subplot(2, 2, 4)
sns.boxplot(x='Department', y='Salary', data=df)
plt.title('Boxplot of Salary by Department')

plt.tight_layout()
plt.show()

C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN befor
e operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
In [ ]:

You might also like