data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...
Data Preprocessing Tools
Importing the libraries
In [ ]: # Import the NumPy library for numerical operations on arrays
import numpy as np
# Import the matplotlib library for data visualization
import matplotlib.pyplot as plt
# Import the pandas library for data manipulation and analysis
import pandas as pd
Importing the dataset
The code below reads a CSV file into a pandas DataFrame, then splits the data into feature
variables (all columns except the last) and the target variable (the last column).
In [ ]: # Read the dataset from a CSV file named 'Data.csv' into a pandas DataFrame
dataset = pd.read_csv('Data.csv')
# Select all rows and all columns except the last one as features (input variables
# The iloc method is used for integer-location based indexing
# [:, :-1] means all rows (:) and all columns except the last one (:-1)
X = dataset.iloc[:, :-1].values
# Select all rows of the last column as the target variable (output variable)
# The iloc method is used for integer-location based indexing
# [:, -1] means all rows (:) and the last column (-1)
y = dataset iloc[:, 1].values
In [ ]: print(X)
In [ ]: print(y)
Taking care of missing data
Missing data in machine learning can lead to loss of information, bias, reduced accuracy, and
computational challenges, so it's essential to handle it properly using methods like imputation or
algorithms that support missing values to build reliable models.
1 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...
In [ ]: # Import SimpleImputer from the scikit-learn library for handling missing data
from sklearn.impute import SimpleImputer
# Create an instance of the SimpleImputer class
# This imputer will replace missing values (np.nan) with the mean value of the cor
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
# Fit the imputer on the specified columns (columns 1 and 2) of the feature matrix
# This step calculates the mean of these columns, which will be used to replace mi
imputer.fit(X[:, 1:3])
# Transform the specified columns (columns 1 and 2) of the feature matrix X
# This step replaces any missing values in these columns with the calculated mean
X[:, 1:3] = imputer.transform(X[:, 1:3])
In [ ]: print(X)
Encoding categorical data
We encode categorical data to convert it into a numerical format that machine learning
algorithms can interpret and process.
Machine learning algorithms require numerical input to perform mathematical operations and
make predictions, so encoding categorical data ensures the model can work with and learn
from these features.
Encoding the Independent Variable
In [ ]: # Import ColumnTransformer and OneHotEncoder from scikit-learn for feature transfo
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Create an instance of ColumnTransformer
# The 'encoder' transformer applies OneHotEncoder to the first column (index 0)
# 'remainder='passthrough'' means all other columns not specified will remain unch
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder
# Fit the ColumnTransformer to the feature matrix X and transform it
# This step applies one-hot encoding to the specified column and leaves other colu
X = np.array(ct.fit_transform(X))
In [ ]: print(X)
Encoding the Dependent Variable
2 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...
In [ ]: # Import LabelEncoder from scikit-learn for encoding categorical labels into numer
from sklearn.preprocessing import LabelEncoder
# Create an instance of LabelEncoder
le = LabelEncoder()
# Fit the LabelEncoder on the target variable y and transform it into numerical va
# This step converts each unique categorical label in y to a corresponding integer
y = le fit_transform(y)
In [ ]: print(y)
Splitting the dataset into the Training set and Test
set
Splitting the dataset into a training set and a test set is crucial to evaluate how well the model
generalizes to new, unseen data. The training set is used to train the model, while the test set
assesses its performance and ensures it isn't overfitting to the training data.
In [ ]: # Import train_test_split from scikit-learn to split the dataset into training and
from sklearn.model_selection import train_test_split
# Split the feature matrix X and target variable y into training and testing sets
# test_size=0.2 specifies that 20% of the data will be used for testing and 80% fo
# random_state=1 ensures reproducibility by setting a seed for random number gener
X_train X_test y_train y_test = train_test_split(X y test_size=0.2 random_st
In [ ]: print(X_train)
In [ ]: print(X_test)
In [ ]: print(y_train)
In [ ]: print(y_test)
Feature Scaling
My apologies for the confusion. The statement "Scales the data to a fixed range, usually [0, 1]"
actually applies to Min-Max Scaling, not StandardScaler. StandardScaler standardizes the data
to have a mean of 0 and a standard deviation of 1. Here are the corrected explanations:
StandardScaler
StandardScaler standardizes features by removing the mean and scaling to unit variance.
Mathematically,
This ensures that each feature contributes equally to the result, preventing features with larger
scales from dominating the model.
3 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...
Min-Max Scaling
Min-Max Scaling (Normalization) scales the data to a fixed range, usually [0,
1].Mathematically,
Min-Max Scaling is useful for algorithms that require a bounded feature space.
Example Code with StandardScaler
Here’s the corrected code with comments for StandardScaler:
# Import StandardScaler from scikit-learn for feature scaling
from sklearn.preprocessing import StandardScaler
# Create an instance of StandardScaler
sc = StandardScaler()
# Fit the scaler on the training data and transform it
# This step standardizes the features by removing the mean and scalin
g to unit variance
X_train = sc.fit_transform(X_train)
# Transform the test data using the same scaler fitted on the trainin
g data
# This ensures that the test data is scaled in the same way as the tr
aining data
X_test = sc.transform(X_test)
These comments correctly explain that StandardScaler standardizes the data rather than
scaling it to a fixed range.m
In [ ]:
In [ ]: # Import StandardScaler from scikit-learn for feature scaling
from sklearn.preprocessing import StandardScaler
# Create an instance of StandardScaler
sc = StandardScaler()
# Fit the scaler on the training data and transform it
# This step standardizes the features by removing the mean and scaling to unit var
X_train = sc.fit_transform(X_train)
# Transform the test data using the same scaler fitted on the training data
# This ensures that the test data is scaled in the same way as the training data
X_test = sc.transform(X_test)
4 of 5 7/23/2024, 1:19 PM
data_preprocessing - Jupyter Notebook http://localhost:8888/notebooks/Desktop/03%20Data%20Preprocessing...
In [ ]: print(X_train)
In [ ]: print(X_test)
5 of 5 7/23/2024, 1:19 PM