0% found this document useful (0 votes)

16 views52 pages

ML_Unit_2

The document outlines the process of data preparation in machine learning, detailing the types of data (training, validation, testing) and the importance of cleaning, transforming, and encoding data for accurate predictions. It also discusses various data visualization techniques, methods for handling missing values and outliers, and the significance of data splitting to avoid overfitting and assess model performance. Additionally, it covers feature engineering, data reduction methods, and structured approaches for implementing machine learning solutions.

Uploaded by

sadafbathul29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views52 pages

ML_Unit_2

Uploaded by

sadafbathul29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

UNIT 2

DATA PREPARATION
Categories of Data

Labelled Data Unlabelled Data

Data in Machine Learning

Training Data Validation Data Testing Data

• Training Data – Initial Dataset used to train Machine Learning model

Example – House Prediction Project – Analyses between the input features and target label.

• Validation Data – Dataset utilised to fine tune the model and evaluate performance during
training
A subset of data with similar features but different prices are designated as validation data

• Testing Data – Dataset employed to assess models performance on unseen data.

Following training and validation , a fresh dataset is given , completely new to the model to
predict selling price of the houses.
Data Preparation in Machine Learning

Defined as gathering , combining , cleaning and transforming raw data to make accurate
predictions in machine learning.

Benefits of Data Preparation

• Identify data issues or errors

• Ensures reliable prediction.
• Removes duplicate content
• Enhances model performance.
Data Preparation issues in Machine Learning

• Missing Data
• Anomalies
• Unstructured Data Format
• Limited Features
• Understanding Feature Engineering
Steps in Data Preparation Process
Feature Transformation,
Feature Creation
Feature
Data Engineerin
Transformati g
Data on
Collectio Data
n Normalisation/
Spitting
standardisation
Gather raw data Encoding categorical Data
Data variables
Sources Reductio Training Data
• UC Irvine Cleaning
n Validation Data
• Kaggle Test Data
• Amazon’s AWS Dimensionality Reduction ,
Handling missing values
• Wikipedia Feature Selection
and filtering outliers
• Quora.com
Load Data and
explore data
• Load Data – Import necessary
libraries and load data into
dataframe .(Usually done
using pandas)
• Exploring data – Understand
its characteristics – features,
rows, possible missing values ,
and the type of data.
The OS module in Python provides functions for interacting with the operating
system. OS comes under Python’s standard utility modules. This module provides a
portable way of using operating system-dependent functionality.
The *os* and *os.path* modules include many functions to interact with the file
system.
Python-OS-Module Functions
Here we will discuss some important functions of the Python os module :
•Handling the Current Working Directory

•Creating a Directory

•Listing out Files and Directories with Python

•Deleting Directory or Files using Python

Using describe() for Descriptive Statistics
The describe() method is a powerful tool to generate descriptive statistics of a
DataFrame. It provides a comprehensive summary including count, mean, standard
deviation, minimum, 25th percentile, median (50th percentile), 75th percentile and
maximum.
Data Visualisation Techniques

Scatter Plots

• Displays relationship between two variables by plotting import pandas as pd

data points. import matplotlib.pyplot as plt
• Useful in data preparation for exploring associations
between variables and detecting outliers. # reading the database
data = pd.read_csv("tips.csv")

# Scatter plot with day against tip

plt.scatter(data['day'], data['tip’])

# Adding Title to the Plot

plt.title("Scatter Plot")

# Setting the X and Y labels

plt.xlabel('Day')
plt.ylabel('Tip')
plt.show()
Data Visualisation Techniques
Bar Plots
import pandas as pd
A bar plot uses rectangular bars to represent data import matplotlib.pyplot as plt
categories, with bar length or height proportional to their
values. It compares discrete categories, with one axis for # reading the database
categories and the other for values. data = pd.read_csv("tips.csv")

# Bar chart with day against tip

plt.bar(data['day'], data['tip'])

plt.title("Bar Chart")

# Setting the X and Y labels

plt.xlabel('Day')
plt.ylabel('Tip')

# Adding the legends

plt.show()
Data Visualisation Techniques
Histogram

A histogram is a graph showing frequency distributions. import pandas as pd

import matplotlib.pyplot as plt

# reading the database

data = pd.read_csv("tips.csv")

# histogram of total_bills
plt.hist(data['total_bill'])

plt.title("Histogram")

# Adding the legends

plt.show()
Data Visualisation Techniques

Line Chart

• Depicts data trends over time by connecting data points

with lines ,making them ideal for tracking changes and import matplotlib.pyplot as plt
patterns.
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

plt.plot(x, y)

plt.title("Line Chart")

plt.ylabel('Y-Axis')

plt.xlabel('X-Axis')
plt.show()
Data Visualisation Techniques

Box Plot

Box plots provide a visual summary of the data with which we can quickly identify the
average value of the data, how dispersed the data is
Data Visualisation Techniques
Heatmap

Use Color gradients to represent data values in a matrix format making it easier to identify relationships or
correlations in large datasets

Creating a heatmap of feature correlations

in customer survey can highlight strong
relationships between satisfaction scores
and purchase behavior
Preparing the data for Machine Learning Algorithm

Data Cleaning

• Handling Missing Values

• Handling Outliers

Methods to handle missing values

• Deletion
• Mean/Median/Mode
Imputation(Pg2.32)
• Forward Fill/Backward Fill
• K_nearest neighbour imputation
• Prediction model
KNNimputer is a scikit-learn class used to fill out or predict the missing values in a dataset. It is a more
useful method that works on the basic approach of the KNN algorithm rather than the naive approach of
filling all the values with the mean or the median. In this approach, we specify a distance from the
missing values which is also known as the K parameter. The missing value will be predicted about the
mean of the neighbors.

For example, consider a dataset with a missing value in a column representing a student’s math
score. Instead of simply filling this missing value with the overall mean or median of the math
scores, KNNImputer finds the k-nearest students (based on other features like scores in physics,
chemistry, etc.) and imputes the missing value using the mean or median of these neighbors’ math
scores.
Handling
Outliers
•An outlier is a data point
that significantly deviates
from the rest of the data. It
can be either much higher
or much lower than the
other data points
Identification of Outlier

Statistical Methods:

•Z-Score: This method calculates the standard deviation of the data points and identifies outliers as those with
Z-scores and IQR(Interquartile Range)

Z-Score = (Data point-Mean)/Standard Deviation

IQR=Q3-Q1

Outlier Detection:
Data points below Q1 - 1.5IQR or above Q3 + 1.5IQR are often considered outliers

Distance-Based Methods:

•K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest neighbors are far away
from them.
Methods to handle Outliers

1. Removal:

This involves identifying and removing outliers from the dataset before training the model. Common methods include:

• Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g., Z-score > 3).
• Distance-based methods: Outliers are identified based on their distance from their nearest neighbors.
• Clustering: Outliers are identified as points not belonging to any cluster or belonging to very small clusters.

2. Transformation:

This involves transforming the data to reduce the influence of outlier.

Transformations make the data more normally distributed and lessen the influence of extreme values.

3.Winsoring

Replacing outlier value with non outlier value.

4.Binning

Grouping outliers into a separate category.

Methods to handle Outliers

Z-score(Pg 2.34)
Z- Score is also called a standard score.
This value/score helps to understand that how far is the data point from the mean.

And after setting up a threshold value one can utilize z score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation

.
Data Transformation

• Normalisation
Data transformation that scales the value of numerical features to a standard range between 0 and 1.

# apply normalization techniques

for column in df:
df [column] = (df [column] - df [column].min()) / (df [column].max() – df [column].min())
# view normalized data
print(df)
Data Transformation

• Standardisation
Data standardization, a crucial data transformation technique, converts data into a uniform format with a
mean of 0 and a standard deviation of 1, ensuring consistency and facilitating easier analysis and machine learning
model training.
Z = (x - mean) / standard deviation
Scikit-Learn provides a transformer called StandardScaler for standardization .

• Log Transformation
Log transformation is a common technique used to reduce skewness in a distribution and make it more
symmetric.

The log transformation can be applied using the NUMPY library.

The NUMPY library provides a natural logarithm function, np.log(), which can be used to apply
the log transformation to an array of data
Data Transformation

• Encoding Categorical Variables – One Hot Encoding , Label Encoding

Categorical data is a common occurrence in many datasets, especially in fields like marketing, finance,
and social sciences. Unlike numerical data, categorical data represents discrete values or categories, such as gender,
country, or product type. Machine learning algorithms, however, require numerical input, making it essential to
convert categorical data into a numerical format. This process is known as encoding.

1. Label Encoding

Label Encoding is a simple and straightforward method that assigns a unique integer to each category. This method is
suitable for ordinal data where the order of categories is meaningful.

from sklearn.preprocessing import LabelEncoder

data = ['red', 'blue', 'green', 'blue', 'red’]
Output:
label_encoder = LabelEncoder() [2 0 1 0 2]
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)
2. One-Hot Encoding

One-Hot Encoding converts categorical data into a binary matrix, where each category is represented by a
binary vector. This method is suitable for nominal data.
Data Reduction

•Improved Efficiency:
Large datasets can be computationally expensive to process, making data reduction
crucial for speeding up training and prediction times.

•Reduced Storage Costs:

Smaller datasets require less storage space, leading to cost savings, especially when
dealing with massive datasets.

•Enhanced Model Performance:

By removing irrelevant or redundant features, data reduction can help machine learning
models generalize better and avoid overfitting.

•Simplified Analysis:
Reduced datasets are easier to visualize and analyze, making it easier to understand
patterns and insights.
Common Methods of Data Reduction

• Feature Selection – Chooses a subset of relevant features from the original dataset while
discarding irrelevant features.

• Feature Extraction – Transforms original features into lower dimensionality space to capture
essential information.

• Instance Selection – Choosing a subset of representative instances from dataset while

maintaining data characteristics.

• Dimensionality Reduction – Reduces the number of input variables in the dataset. Improves the
model for better efficiency . Techniques like PCA,LDA are used.
Example to demonstrate Data Reduction
Feature Engineering

Create new features from existing data ,transform data into new useful formats or enhance the quality of
data.

Components of Feature Engineering

• Feature Creation – Creating new data from existing data

• Feature Transformation – Common transforming include normalisation, scaling, Applying

mathematical functions like logarithms or exponentials.
Example to demonstrate Feature Creation and Feature Transformation
Example to demonstrate Feature Creation and Feature Transformation
Data Splitting

• Training Data – Data used for training . Usually 80% of data is used for training.
• Validation Data – This subset is used to make decisions about which models to use.
• Test Data – Used to evaluate the final model’s performance after it is trained and
validated.

Methods of Data Splitting

• Random Splitting
• Stratified Splitting
• Time based Splitting
Why Data Splitting is important
Avoiding Overfitting – When extensive data is getting trained,
there is a possibility of noise getting included in the data. To
avoid the negative impact,the data is split as training and
validation data.

Model Validation – Process of identifying the best model.

Assessing model performance – Critical for understanding

how model works.

Improving model Robustness – Reliability enhanced.

Create Test Set

Crucial for evaluating the models performance accurately and ensuring its ability to generalise
to new , unseen data.

Steps to create test set

• Allocation – Allocate a portion of dataset for testing

• Purpose – Critical component in model development process
• Independence – Test set should be independent of training data – It ensures reliability
• Evaluation – Testing the model on a separate test set allows for an evaluation of its predictive
capabilities
• Validation – Provides a benchmark for assessing its performance and identify any issues like
overfitting and underfitting.
• Consistency – Provides reliability in model evaluation
Techniques for test set creation

• Random Sampling - In this method of sampling, the data is collected at random. It means that every
item of the universe has an equal chance of getting selected for the investigation purpose. In other
words, each item has an equal probability of being in the sample, which makes the method
impartial. Used for large datasets.

• Stratified Sampling - Stratified or Mixed Sampling is a method in which the population is divided
into different groups, also known as strata with different characteristics, and from those strata
some of the items are selected to represent the population. The investigator while forming strata
has to ensure that each of the stratum is represented in a correct proportion.
For example, there are 60 students in Class 10th. Out of these 60 students, 10 opted for Arts and
Humanities, 30 opted for Commerce, and 20 opted for Science in Class 11th. It means that the population
of 60 students is divided into three strata; viz., Arts and Humanities, Commerce, and Science, containing
10, 30, and 20 students, respectively. Now, for investigation purpose, some of the items will be
proportionately selected from each of the strata in a way that those items forming a sample represents the
entire population.
Creating Test Set in Python
Preparing the data for Machine Learning
Algorithm

Data Cleaning
• Handling Missing Values
• Handling Outliers

Methods to handle missing values

• Deletion
• Mean/Median/Mode Imputation
• Forward Fill/Backward Fill
• K_nearest neighbour imputation
• Prediction model
Structured Approach in implementing Machine
Learning

• Define objective
• Establish data streamline
• Evaluate existing solutions
• Problem Formulation

A Level Maths RAG Tracker
No ratings yet
A Level Maths RAG Tracker
23 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
ML2
No ratings yet
ML2
24 pages
Unit 2
No ratings yet
Unit 2
19 pages
Bruce F.Torrence, Eve A.Torrence - The Student's Introduction To Mathematica and The Wolfram Language-Cambridge University Press (2019) PDF
100% (5)
Bruce F.Torrence, Eve A.Torrence - The Student's Introduction To Mathematica and The Wolfram Language-Cambridge University Press (2019) PDF
550 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
A&r CBQ Pre-Course Materil
No ratings yet
A&r CBQ Pre-Course Materil
129 pages
ICAO 9501 - v1 - Cons - en
No ratings yet
ICAO 9501 - v1 - Cons - en
312 pages
G10 MAT U-2 - Worksheet
No ratings yet
G10 MAT U-2 - Worksheet
4 pages
SML Updated UNIT-2
No ratings yet
SML Updated UNIT-2
43 pages
3-3 Properties of Logarithms
No ratings yet
3-3 Properties of Logarithms
29 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Eda
No ratings yet
Eda
48 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
1737527078055
No ratings yet
1737527078055
111 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Symposium On The Probability Approach in Psychology : Psychological Review
No ratings yet
Symposium On The Probability Approach in Psychology : Psychological Review
25 pages
History of Exponential and Logarithm
100% (1)
History of Exponential and Logarithm
14 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Week 10
No ratings yet
Week 10
50 pages
Calculations Guide
No ratings yet
Calculations Guide
45 pages
g1115-90041 Understandinggp
No ratings yet
g1115-90041 Understandinggp
102 pages
Unit 4_Working With Graphs _python
No ratings yet
Unit 4_Working With Graphs _python
49 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Baranyi Et Al 1993 A Non-Autonomous Differential Equation To Model Bacterial Growth PDF
No ratings yet
Baranyi Et Al 1993 A Non-Autonomous Differential Equation To Model Bacterial Growth PDF
17 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
3-4 Key
No ratings yet
3-4 Key
22 pages
PPT 1.1.5
No ratings yet
PPT 1.1.5
20 pages
DSI237_GROUP_2
No ratings yet
DSI237_GROUP_2
27 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Teks DATA SCIENCE Syllabus - QR
No ratings yet
Teks DATA SCIENCE Syllabus - QR
26 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
dm(2)
No ratings yet
dm(2)
3 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
datascience
No ratings yet
datascience
26 pages
Calculation Using Taguchi Method For Hardness
100% (1)
Calculation Using Taguchi Method For Hardness
8 pages
Unit 1
No ratings yet
Unit 1
21 pages
Logarithm (Trigonometry)
No ratings yet
Logarithm (Trigonometry)
9 pages
UNIT 2 dt
No ratings yet
UNIT 2 dt
8 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Detailed Instructions Guidelines Reminders
No ratings yet
Detailed Instructions Guidelines Reminders
5 pages
MAT 051-Limits PDF
No ratings yet
MAT 051-Limits PDF
148 pages
EDA_INDEPTH
No ratings yet
EDA_INDEPTH
19 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Astm C918C918M
No ratings yet
Astm C918C918M
6 pages
Meaning of Logarithms tasks
No ratings yet
Meaning of Logarithms tasks
2 pages
Antenna Analysis and Design Chapter 7
No ratings yet
Antenna Analysis and Design Chapter 7
9 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Formulas 1ST Sem
No ratings yet
Formulas 1ST Sem
7 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
MELCS-General Mathematics
No ratings yet
MELCS-General Mathematics
3 pages
Edexcel GCE Core 2 Mathematics C2 Advanced Subsidary Jan 2008 6664 Mark Scheme
No ratings yet
Edexcel GCE Core 2 Mathematics C2 Advanced Subsidary Jan 2008 6664 Mark Scheme
12 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Worksheet: Logarithmic Function
No ratings yet
Worksheet: Logarithmic Function
12 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Math Form Three April Holiday Assignment 2024
No ratings yet
Math Form Three April Holiday Assignment 2024
6 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
Maths HG Paper1 Eng
No ratings yet
Maths HG Paper1 Eng
2 pages
Test Paper in Math Newies
No ratings yet
Test Paper in Math Newies
2 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Redlich Kister Algebraic Representation of T
No ratings yet
Redlich Kister Algebraic Representation of T
4 pages
Study Guide Ib-Academy PDF
No ratings yet
Study Guide Ib-Academy PDF
140 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Pe05035 Qa 3
No ratings yet
Pe05035 Qa 3
18 pages