Data Preprocessing & Visualization1

Uploaded by

Zeha 1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Data Preprocessing & Visualization1

Uploaded by

Zeha 1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

Step 1: Load the Data

In [4]: import pandas as pd

# Load the data

df = pd.read_csv('data.csv')
print("Initial Data:\n", df)

Initial Data:
ID Age Salary Department Experience
0 1 25.0 50000.0 Sales 2
1 2 30.0 60000.0 Engineering 5
2 3 22.0 45000.0 Sales 1
3 4 35.0 NaN HR 10
4 5 28.0 70000.0 Engineering 4
5 6 40.0 80000.0 HR 15
6 7 38.0 75000.0 Sales 12
7 8 NaN 62000.0 Engineering 7
8 9 45.0 90000.0 HR 20
9 10 32.0 54000.0 Sales 6

Step 2: Data Preprocessing

1) Handling Missing Values

In [5]: # Fill missing values in 'Age' and 'Salary' with the mean of the respective columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nData after handling missing values:\n", df)

Data after handling missing values:

ID Age Salary Department Experience
0 1 25.000000 50000.000000 Sales 2
1 2 30.000000 60000.000000 Engineering 5
2 3 22.000000 45000.000000 Sales 1
3 4 35.000000 65111.111111 HR 10
4 5 28.000000 70000.000000 Engineering 4
5 6 40.000000 80000.000000 HR 15
6 7 38.000000 75000.000000 Sales 12
7 8 32.777778 62000.000000 Engineering 7
8 9 45.000000 90000.000000 HR 20
9 10 32.000000 54000.000000 Sales 6

2) Handling Outliers
In [6]: # Remove outliers from 'Salary' using the IQR method
Q1 = df['Salary'].quantile(0.25)
Q3 = df['Salary'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['Salary'] < (Q1 - 1.5 * IQR)) | (df['Salary'] > (Q3 + 1.5 * IQR)))]
print("\nData after removing outliers:\n", df)

Data after removing outliers:

3) Encoding Categorical Variables

In [7]: from sklearn.preprocessing import LabelEncoder

# Encode 'Department' column

label_encoder = LabelEncoder()
df['Department'] = label_encoder.fit_transform(df['Department'])
print("\nData after encoding categorical variables:\n", df)

Data after encoding categorical variables:

ID Age Salary Department Experience
0 1 25.000000 50000.000000 2 2
1 2 30.000000 60000.000000 0 5
2 3 22.000000 45000.000000 2 1
3 4 35.000000 65111.111111 1 10
4 5 28.000000 70000.000000 0 4
5 6 40.000000 80000.000000 1 15
6 7 38.000000 75000.000000 2 12
7 8 32.777778 62000.000000 0 7
8 9 45.000000 90000.000000 1 20
9 10 32.000000 54000.000000 2 6

4) Scaling and Normalization

In [9]: from sklearn.preprocessing import StandardScaler

# Scale numeric columns

scaler = StandardScaler()
df[['Age', 'Salary', 'Experience']] = scaler.fit_transform(df[['Age', 'Salary', 'Experience']])
print("\nData after scaling and normalization:\n", df)

Data after scaling and normalization:

ID Age Salary Department Experience
0 1 -1.170477 -1.140700 2 -1.083228
1 2 -0.418027 -0.385825 0 -0.559085
2 3 -1.621947 -1.518138 2 -1.257942
3 4 0.334422 0.000000 1 0.314485
4 5 -0.719007 0.369050 0 -0.733799
5 6 1.086871 1.123925 1 1.188056
6 7 0.785892 0.746488 2 0.663914
7 8 0.000000 -0.234850 0 -0.209657
8 9 1.839321 1.878801 1 2.061627
9 10 -0.117048 -0.838750 2 -0.384371

Step 3: Data Visualization

In [10]: import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot of Age vs Salary

plt.figure(figsize=(12, 8))
plt.subplot(2, 2, 1)
sns.scatterplot(x='Age', y='Salary', data=df)
plt.title('Scatter plot of Age vs Salary')

# Histogram of Age
plt.subplot(2, 2, 2)
sns.histplot(df['Age'], kde=True)
plt.title('Histogram of Age')

# Heatmap of Correlation Matrix

plt.subplot(2, 2, 3)
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Heatmap of Correlation Matrix')

# Boxplot of Salary by Department

plt.subplot(2, 2, 4)
sns.boxplot(x='Department', y='Salary', data=df)
plt.title('Boxplot of Salary by Department')

plt.tight_layout()
plt.show()

C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN befor
e operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
C:\Users\Reena\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, Categoric
alDtype) instead
if pd.api.types.is_categorical_dtype(vector):
In [ ]:

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
SC-200: Microsoft Security Operations Analyst Preparation
From Everand
SC-200: Microsoft Security Operations Analyst Preparation
Georgio Daccache
No ratings yet
Epp List Final
100% (1)
Epp List Final
1,754 pages
Biology Lab
No ratings yet
Biology Lab
5 pages
Programmable Logic Controller PLC
No ratings yet
Programmable Logic Controller PLC
53 pages
Unit 3 & Unit 4
No ratings yet
Unit 3 & Unit 4
8 pages
Assignment Ds Midterm
No ratings yet
Assignment Ds Midterm
2 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
vertopal.com_Final007
No ratings yet
vertopal.com_Final007
35 pages
Employee Info
No ratings yet
Employee Info
2 pages
Pps Ui22cs57lab 10
No ratings yet
Pps Ui22cs57lab 10
17 pages
Answer Key for SET-1 TO 3
No ratings yet
Answer Key for SET-1 TO 3
7 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
L6 and 7-Data Preprocessing-coding
No ratings yet
L6 and 7-Data Preprocessing-coding
34 pages
DSBDA3 - Jupyter Notebook
No ratings yet
DSBDA3 - Jupyter Notebook
12 pages
2022UCD2164-1-2
No ratings yet
2022UCD2164-1-2
35 pages
etl_and_stats_code
No ratings yet
etl_and_stats_code
2 pages
Capstone Project Assignment
No ratings yet
Capstone Project Assignment
3 pages
Salaries for San Francisco Employee _ ML _ FA _ DA projects
No ratings yet
Salaries for San Francisco Employee _ ML _ FA _ DA projects
33 pages
Maxbox Starter139 Top5 Data Diagram Types
No ratings yet
Maxbox Starter139 Top5 Data Diagram Types
4 pages
profitanalysis
No ratings yet
profitanalysis
18 pages
211423205047-Exp1d
No ratings yet
211423205047-Exp1d
6 pages
Practical Questions
No ratings yet
Practical Questions
7 pages
Python Report Ritik
No ratings yet
Python Report Ritik
15 pages
Kunj Project 1
No ratings yet
Kunj Project 1
34 pages
Data Science
No ratings yet
Data Science
18 pages
Exercises 2
No ratings yet
Exercises 2
10 pages
DP
No ratings yet
DP
9 pages
Ml Projects
No ratings yet
Ml Projects
22 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
Dsa Lab Manual
No ratings yet
Dsa Lab Manual
35 pages
Python Module 5
No ratings yet
Python Module 5
19 pages
Ip Project File
No ratings yet
Ip Project File
46 pages
Kunj Project 1
No ratings yet
Kunj Project 1
34 pages
Predictive_Modelling_Alternate_Project_Business_Case.docx
No ratings yet
Predictive_Modelling_Alternate_Project_Business_Case.docx
47 pages
New Final Ip Project
No ratings yet
New Final Ip Project
33 pages
Parth IP Employee Management Project (1)
No ratings yet
Parth IP Employee Management Project (1)
32 pages
Kunj 3
No ratings yet
Kunj 3
34 pages
FDS RECORD-1-4
No ratings yet
FDS RECORD-1-4
18 pages
Employee Management System
No ratings yet
Employee Management System
33 pages
Mastering_Pandas_with_103_Practical_Questions_and_Solution_1731584558
No ratings yet
Mastering_Pandas_with_103_Practical_Questions_and_Solution_1731584558
48 pages
Viksit Ip Project File
No ratings yet
Viksit Ip Project File
33 pages
22067515 Kushal Kadayat
No ratings yet
22067515 Kushal Kadayat
33 pages
Some Exercises
No ratings yet
Some Exercises
9 pages
Set B
No ratings yet
Set B
8 pages
Practical 3
No ratings yet
Practical 3
8 pages
IP_Employee_Project
No ratings yet
IP_Employee_Project
31 pages
ANS KEY SET A
No ratings yet
ANS KEY SET A
6 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
python 1
No ratings yet
python 1
3 pages
Aastha IP Employee Project
No ratings yet
Aastha IP Employee Project
32 pages
Salary Prediction
No ratings yet
Salary Prediction
32 pages
Ai Programs
No ratings yet
Ai Programs
22 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
EDP-3[2]
No ratings yet
EDP-3[2]
16 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
7 pages
Exercise - 3 Submission - Group - 12
No ratings yet
Exercise - 3 Submission - Group - 12
14 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
Data (3)
No ratings yet
Data (3)
17 pages
SQL & Python Interview Q&A
No ratings yet
SQL & Python Interview Q&A
7 pages
employee management-Ghanim,Rudra
No ratings yet
employee management-Ghanim,Rudra
25 pages
Python practice questions (1)
No ratings yet
Python practice questions (1)
5 pages
Business Intelligence and Analytics
No ratings yet
Business Intelligence and Analytics
8 pages
Module-2 (3 & 4 Lab Programs) Merge Sort
No ratings yet
Module-2 (3 & 4 Lab Programs) Merge Sort
8 pages
Kurskals&Dijkstras
No ratings yet
Kurskals&Dijkstras
6 pages
DAA-Module 4
No ratings yet
DAA-Module 4
75 pages
DAA Module-5
No ratings yet
DAA Module-5
30 pages
Jsae Jaso M305-1988
100% (1)
Jsae Jaso M305-1988
25 pages
MIDC Provisional Fire NOC
No ratings yet
MIDC Provisional Fire NOC
12 pages
2023 Catalogue
No ratings yet
2023 Catalogue
32 pages
2019.10.01 Updated Company Profile Astaka
No ratings yet
2019.10.01 Updated Company Profile Astaka
15 pages
Upper Limb and Lower Limb
No ratings yet
Upper Limb and Lower Limb
16 pages
Project
No ratings yet
Project
48 pages
2024 Winter MATH101 COMMON Syllabus
No ratings yet
2024 Winter MATH101 COMMON Syllabus
5 pages
Kalam Hazrat Sultan Bahu From Qausain
No ratings yet
Kalam Hazrat Sultan Bahu From Qausain
20 pages
Inflammatory_Foods
No ratings yet
Inflammatory_Foods
4 pages
3 Layers of Earth
No ratings yet
3 Layers of Earth
18 pages
A User-Centric Handover Scheme For Ultra-Dense LEO Satellite Networks
No ratings yet
A User-Centric Handover Scheme For Ultra-Dense LEO Satellite Networks
5 pages
Chapter 2 Primary Impression
100% (1)
Chapter 2 Primary Impression
8 pages
Pi 500C
No ratings yet
Pi 500C
3 pages
Homemade Rasam Powder - Rasam Podi Nalini'sKitchen
No ratings yet
Homemade Rasam Powder - Rasam Podi Nalini'sKitchen
11 pages
Materials Selection and Corrosion Control: A Training Course From EEMUA in Association With SUT
No ratings yet
Materials Selection and Corrosion Control: A Training Course From EEMUA in Association With SUT
4 pages
Lab 04-Timer0 Module of PIC18F4520uC and Its Applications
No ratings yet
Lab 04-Timer0 Module of PIC18F4520uC and Its Applications
3 pages
A Whole New World Reflection: Lea Salonga Brad Kane Christina Aguilera
No ratings yet
A Whole New World Reflection: Lea Salonga Brad Kane Christina Aguilera
5 pages
Annual Examination Syllabus Playgroup Final
No ratings yet
Annual Examination Syllabus Playgroup Final
1 page
Presentation Transcript-Carl Jung
No ratings yet
Presentation Transcript-Carl Jung
6 pages
English Project PRAKHAR KABRA
No ratings yet
English Project PRAKHAR KABRA
8 pages
ecgr2155-experiment-10-time-constant-of-an-rc-circuit (1)
No ratings yet
ecgr2155-experiment-10-time-constant-of-an-rc-circuit (1)
7 pages
3E Refrigerating System
No ratings yet
3E Refrigerating System
59 pages
Strategic Alternatives in The Pharmaceutical Industry: Jeff Cohen William Gangi Jason Lineen Alice Manard
No ratings yet
Strategic Alternatives in The Pharmaceutical Industry: Jeff Cohen William Gangi Jason Lineen Alice Manard
37 pages
Fursys T40 User Guide
No ratings yet
Fursys T40 User Guide
4 pages
Dynax-Maxxum XTsi en
No ratings yet
Dynax-Maxxum XTsi en
49 pages
Mega Revision Pub Quiz Answers
No ratings yet
Mega Revision Pub Quiz Answers
5 pages