0% found this document useful (0 votes)

7 views5 pages

Data - Preprocessing - Jupyter Notebook

The document outlines a Jupyter Notebook for data preprocessing, detailing the steps to import libraries, read a dataset, handle missing data, encode categorical variables, split the dataset into training and test sets, and apply feature scaling. It includes code snippets for using libraries like NumPy, pandas, and scikit-learn to perform these tasks. The notebook emphasizes the importance of each preprocessing step in preparing data for machine learning models.

Uploaded by

Hnd Final

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views5 pages

Data - Preprocessing - Jupyter Notebook

Uploaded by

Hnd Final

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...

Data Preprocessing Tools

Importing the libraries

In [ ]:  # Import the NumPy library for numerical operations on arrays
import numpy as np

# Import the matplotlib library for data visualization

import matplotlib.pyplot as plt

# Import the pandas library for data manipulation and analysis

import pandas as pd

Importing the dataset

The code below reads a CSV file into a pandas DataFrame, then splits the data into feature
variables (all columns except the last) and the target variable (the last column).

In [ ]:  # Read the dataset from a CSV file named 'Data.csv' into a pandas DataFrame
dataset = pd.read_csv('Data.csv')

# Select all rows and all columns except the last one as features (input variables
# The iloc method is used for integer-location based indexing
# [:, :-1] means all rows (:) and all columns except the last one (:-1)
X = dataset.iloc[:, :-1].values

# Select all rows of the last column as the target variable (output variable)
# The iloc method is used for integer-location based indexing
# [:, -1] means all rows (:) and the last column (-1)
y = dataset iloc[:, 1].values
In [ ]:  print(X)

In [ ]:  print(y)

Taking care of missing data

Missing data in machine learning can lead to loss of information, bias, reduced accuracy, and
computational challenges, so it's essential to handle it properly using methods like imputation or
algorithms that support missing values to build reliable models.

1 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...

In [ ]:  # Import SimpleImputer from the scikit-learn library for handling missing data
from sklearn.impute import SimpleImputer

# Create an instance of the SimpleImputer class

# This imputer will replace missing values (np.nan) with the mean value of the cor
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

# Fit the imputer on the specified columns (columns 1 and 2) of the feature matrix
# This step calculates the mean of these columns, which will be used to replace mi
imputer.fit(X[:, 1:3])

# Transform the specified columns (columns 1 and 2) of the feature matrix X

# This step replaces any missing values in these columns with the calculated mean
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [ ]:  print(X)

Encoding categorical data

We encode categorical data to convert it into a numerical format that machine learning
algorithms can interpret and process.

Machine learning algorithms require numerical input to perform mathematical operations and
make predictions, so encoding categorical data ensures the model can work with and learn
from these features.

Encoding the Independent Variable

In [ ]:  # Import ColumnTransformer and OneHotEncoder from scikit-learn for feature transfo

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Create an instance of ColumnTransformer

# The 'encoder' transformer applies OneHotEncoder to the first column (index 0)
# 'remainder='passthrough'' means all other columns not specified will remain unch
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder

# Fit the ColumnTransformer to the feature matrix X and transform it

# This step applies one-hot encoding to the specified column and leaves other colu
X = np.array(ct.fit_transform(X))

In [ ]:  print(X)

Encoding the Dependent Variable

2 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...

In [ ]:  # Import LabelEncoder from scikit-learn for encoding categorical labels into numer
from sklearn.preprocessing import LabelEncoder

# Create an instance of LabelEncoder

le = LabelEncoder()

# Fit the LabelEncoder on the target variable y and transform it into numerical va
# This step converts each unique categorical label in y to a corresponding integer
y = le fit_transform(y)
In [ ]:  print(y)

Splitting the dataset into the Training set and Test

set
Splitting the dataset into a training set and a test set is crucial to evaluate how well the model
generalizes to new, unseen data. The training set is used to train the model, while the test set
assesses its performance and ensures it isn't overfitting to the training data.

In [ ]:  # Import train_test_split from scikit-learn to split the dataset into training and
from sklearn.model_selection import train_test_split

# Split the feature matrix X and target variable y into training and testing sets
# test_size=0.2 specifies that 20% of the data will be used for testing and 80% fo
# random_state=1 ensures reproducibility by setting a seed for random number gener
X_train X_test y_train y_test = train_test_split(X y test_size=0.2 random_st
In [ ]:  print(X_train)

In [ ]:  print(X_test)

In [ ]:  print(y_train)

In [ ]:  print(y_test)

Feature Scaling
My apologies for the confusion. The statement "Scales the data to a fixed range, usually [0, 1]"
actually applies to Min-Max Scaling, not StandardScaler. StandardScaler standardizes the data
to have a mean of 0 and a standard deviation of 1. Here are the corrected explanations:

StandardScaler
StandardScaler standardizes features by removing the mean and scaling to unit variance.
Mathematically,

This ensures that each feature contributes equally to the result, preventing features with larger
scales from dominating the model.

3 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...

Min-Max Scaling
Min-Max Scaling (Normalization) scales the data to a fixed range, usually [0,
1].Mathematically,

Min-Max Scaling is useful for algorithms that require a bounded feature space.

Example Code with StandardScaler

Here’s the corrected code with comments for StandardScaler:

# Import StandardScaler from scikit-learn for feature scaling

from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler

sc = StandardScaler()

# Fit the scaler on the training data and transform it

# This step standardizes the features by removing the mean and scalin
g to unit variance
X_train = sc.fit_transform(X_train)

# Transform the test data using the same scaler fitted on the trainin
g data
# This ensures that the test data is scaled in the same way as the tr
aining data
X_test = sc.transform(X_test)

These comments correctly explain that StandardScaler standardizes the data rather than
scaling it to a fixed range.m

In [ ]: 

In [ ]:  # Import StandardScaler from scikit-learn for feature scaling

from sklearn.preprocessing import StandardScaler

# Create an instance of StandardScaler

sc = StandardScaler()

# Fit the scaler on the training data and transform it

# This step standardizes the features by removing the mean and scaling to unit var
X_train = sc.fit_transform(X_train)

# Transform the test data using the same scaler fitted on the training data
# This ensures that the test data is scaled in the same way as the training data
X_test = sc.transform(X_test)

4 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...

In [ ]:  print(X_train)

In [ ]:  print(X_test)

5 of 5 7/23/2024, 1:19 PM

Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Key - Clauses of Concession Exercises 2
No ratings yet
Key - Clauses of Concession Exercises 2
8 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Mini 4
No ratings yet
Mini 4
9 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
Mtech Study Material
No ratings yet
Mtech Study Material
10 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Preprocessing
No ratings yet
Data Preprocessing
8 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
No ratings yet
Machine Learning Laboratory (BTCS619-18) B.Tech Cse 6Th 2024 EVEN
29 pages
MLP Week 2 Slides
No ratings yet
MLP Week 2 Slides
82 pages
How To Prepare Your Dataset For Machine Learning in Python
No ratings yet
How To Prepare Your Dataset For Machine Learning in Python
14 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Preprocessing For Machine Learning in Python
No ratings yet
Data Preprocessing For Machine Learning in Python
27 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
24 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
Data Pre-Processing With Sklearn Using Standard and Minmax
No ratings yet
Data Pre-Processing With Sklearn Using Standard and Minmax
21 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
DS 1
No ratings yet
DS 1
20 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
ML Pgms - 24mar2025
No ratings yet
ML Pgms - 24mar2025
23 pages
Week 10
No ratings yet
Week 10
50 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Data Wrangling and Preprocessing
100% (1)
Data Wrangling and Preprocessing
41 pages
6 - Machine Learning 2
No ratings yet
6 - Machine Learning 2
14 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
Lec 2 ML S4 Data Preprocessing
No ratings yet
Lec 2 ML S4 Data Preprocessing
20 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Ml-lec-4
No ratings yet
Ml-lec-4
9 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
List of Imported Libraries
No ratings yet
List of Imported Libraries
12 pages
06 - Data Preprocessing
No ratings yet
06 - Data Preprocessing
68 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Intro To Sensors
No ratings yet
Intro To Sensors
9 pages
VHDL Assignment
No ratings yet
VHDL Assignment
2 pages
The Tent of Meeting
No ratings yet
The Tent of Meeting
5 pages
Pneumatic Systems
No ratings yet
Pneumatic Systems
6 pages
OBD Lecture
No ratings yet
OBD Lecture
4 pages
Collision Warning Avoidance
No ratings yet
Collision Warning Avoidance
2 pages
Akwapoly EEI442 Exams
No ratings yet
Akwapoly EEI442 Exams
1 page
Ukana Equipment
No ratings yet
Ukana Equipment
4 pages
PLC Introduction
No ratings yet
PLC Introduction
9 pages
Tech Pneumatic Symbols
No ratings yet
Tech Pneumatic Symbols
3 pages
LVDT and Termocouple Lab
No ratings yet
LVDT and Termocouple Lab
7 pages
What Does The N-WPS Office
No ratings yet
What Does The N-WPS Office
12 pages
What Is Pneumatics - Answered by Our Experts - Rowse
No ratings yet
What Is Pneumatics - Answered by Our Experts - Rowse
8 pages
Auto Electronic Systems Curriculum
0% (1)
Auto Electronic Systems Curriculum
1 page
The Portals
100% (2)
The Portals
4 pages
Resist and Defeat The Enemy
No ratings yet
Resist and Defeat The Enemy
5 pages
PVC Collection Centers For Ward Level 1
No ratings yet
PVC Collection Centers For Ward Level 1
312 pages
God Has Remembered Me January 2023
100% (1)
God Has Remembered Me January 2023
6 pages
Electrical Electronic Systems
No ratings yet
Electrical Electronic Systems
4 pages
PLC Starter Kit
No ratings yet
PLC Starter Kit
1 page
Magoos Originals
No ratings yet
Magoos Originals
18 pages
(Ebook) Explaining Social Life: A Guide to Using Social Theory by John Parker ISBN 9781137007643, 1137007648 download
No ratings yet
(Ebook) Explaining Social Life: A Guide to Using Social Theory by John Parker ISBN 9781137007643, 1137007648 download
67 pages
vs133 Datasheet en
No ratings yet
vs133 Datasheet en
3 pages
PPOP 02.04 - New Employee Safety Orientation Program
No ratings yet
PPOP 02.04 - New Employee Safety Orientation Program
4 pages
Adama Science and Technology University School of Mechanical, Chemical and Materials Engineering Department of Mechanical Engineering
No ratings yet
Adama Science and Technology University School of Mechanical, Chemical and Materials Engineering Department of Mechanical Engineering
21 pages
Understanding Indias Digital Personal Data Protection Act 2023 GLA Update
100% (2)
Understanding Indias Digital Personal Data Protection Act 2023 GLA Update
24 pages
Day 23 (With Key)
No ratings yet
Day 23 (With Key)
3 pages
Syllabus BLEMBA 28 - MM5012 Business Strategy - Enterprise Modeling
No ratings yet
Syllabus BLEMBA 28 - MM5012 Business Strategy - Enterprise Modeling
16 pages
Activity 3.2 Servo Motor
No ratings yet
Activity 3.2 Servo Motor
2 pages
WinCC RuntimeAdv Access SQL Via Script DOCU en
No ratings yet
WinCC RuntimeAdv Access SQL Via Script DOCU en
25 pages
4420.5 - AS-1996 Windows - Methods of Test - Water Penetration Resistance Test
No ratings yet
4420.5 - AS-1996 Windows - Methods of Test - Water Penetration Resistance Test
4 pages
Saudi Aramco Inspection Checklist: Ferroxyl Test Per ASTM A380 (Overlay Weld Cracking) SAIC-Y-2001 24-Mar-16 Mech
100% (2)
Saudi Aramco Inspection Checklist: Ferroxyl Test Per ASTM A380 (Overlay Weld Cracking) SAIC-Y-2001 24-Mar-16 Mech
2 pages
Kendriya Vidyalaya Sangathan, Chennai Region PRACTICE TEST 2020-2021 Class XII
100% (1)
Kendriya Vidyalaya Sangathan, Chennai Region PRACTICE TEST 2020-2021 Class XII
8 pages
Os CV en PDF
No ratings yet
Os CV en PDF
7 pages
NextGen Edu Cloud and AI Day
No ratings yet
NextGen Edu Cloud and AI Day
37 pages
S Worksheet-2 Algebraic Fractions
No ratings yet
S Worksheet-2 Algebraic Fractions
1 page
Textbook of Psychiatry 7th Edition
100% (5)
Textbook of Psychiatry 7th Edition
1,767 pages
Published by The Worldwide Church of God: OF Herbert W. Armstrong'S Sabbath
No ratings yet
Published by The Worldwide Church of God: OF Herbert W. Armstrong'S Sabbath
14 pages
Artificial Intelligence Financial Services
No ratings yet
Artificial Intelligence Financial Services
27 pages
Yamile Saied M 233 Ndez - Our Shadows Have Claws
No ratings yet
Yamile Saied M 233 Ndez - Our Shadows Have Claws
369 pages
Essential Info and FAQs
No ratings yet
Essential Info and FAQs
2 pages
Zillenial
No ratings yet
Zillenial
18 pages
Does Philosophy of Education Have A Future?
No ratings yet
Does Philosophy of Education Have A Future?
8 pages
TQM Tools and Techniques
100% (3)
TQM Tools and Techniques
14 pages
Sugar Cane PDF
No ratings yet
Sugar Cane PDF
3 pages
Home Economics
No ratings yet
Home Economics
2 pages
Ode On A Grecian Urn
No ratings yet
Ode On A Grecian Urn
2 pages
I2 Adexa SAPcomparison
No ratings yet
I2 Adexa SAPcomparison
21 pages
Ic List Irtcishop
100% (2)
Ic List Irtcishop
24 pages

Data - Preprocessing - Jupyter Notebook

Uploaded by

Data - Preprocessing - Jupyter Notebook

Uploaded by

data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...

Data Preprocessing Tools

Importing the libraries

# Import the matplotlib library for data visualization

# Import the pandas library for data manipulation and analysis

Importing the dataset

Taking care of missing data

# Create an instance of the SimpleImputer class

# Transform the specified columns (columns 1 and 2) of the feature matrix X

Encoding categorical data

Encoding the Independent Variable

In [ ]:  # Import ColumnTransformer and OneHotEncoder from scikit-learn for feature transfo

# Create an instance of ColumnTransformer

# Fit the ColumnTransformer to the feature matrix X and transform it

Encoding the Dependent Variable

# Create an instance of LabelEncoder

Splitting the dataset into the Training set and Test

Example Code with StandardScaler

# Import StandardScaler from scikit-learn for feature scaling

# Create an instance of StandardScaler

# Fit the scaler on the training data and transform it

In [ ]:  # Import StandardScaler from scikit-learn for feature scaling

# Create an instance of StandardScaler

# Fit the scaler on the training data and transform it

You might also like