0% found this document useful (0 votes)
16 views52 pages

ML_Unit_2

The document outlines the process of data preparation in machine learning, detailing the types of data (training, validation, testing) and the importance of cleaning, transforming, and encoding data for accurate predictions. It also discusses various data visualization techniques, methods for handling missing values and outliers, and the significance of data splitting to avoid overfitting and assess model performance. Additionally, it covers feature engineering, data reduction methods, and structured approaches for implementing machine learning solutions.

Uploaded by

sadafbathul29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views52 pages

ML_Unit_2

The document outlines the process of data preparation in machine learning, detailing the types of data (training, validation, testing) and the importance of cleaning, transforming, and encoding data for accurate predictions. It also discusses various data visualization techniques, methods for handling missing values and outliers, and the significance of data splitting to avoid overfitting and assess model performance. Additionally, it covers feature engineering, data reduction methods, and structured approaches for implementing machine learning solutions.

Uploaded by

sadafbathul29
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

UNIT 2

DATA PREPARATION
Categories of Data

Labelled Data Unlabelled Data


Data in Machine Learning

Training Data Validation Data Testing Data

• Training Data – Initial Dataset used to train Machine Learning model


Example – House Prediction Project – Analyses between the input features and target label.

• Validation Data – Dataset utilised to fine tune the model and evaluate performance during
training
A subset of data with similar features but different prices are designated as validation data

• Testing Data – Dataset employed to assess models performance on unseen data.


Following training and validation , a fresh dataset is given , completely new to the model to
predict selling price of the houses.
Data Preparation in Machine Learning

Defined as gathering , combining , cleaning and transforming raw data to make accurate
predictions in machine learning.

Benefits of Data Preparation

• Identify data issues or errors


• Ensures reliable prediction.
• Removes duplicate content
• Enhances model performance.
Data Preparation issues in Machine Learning

• Missing Data
• Anomalies
• Unstructured Data Format
• Limited Features
• Understanding Feature Engineering
Steps in Data Preparation Process
Feature Transformation,
Feature Creation
Feature
Data Engineerin
Transformati g
Data on
Collectio Data
n Normalisation/
Spitting
standardisation
Gather raw data Encoding categorical Data
Data variables
Sources Reductio Training Data
• UC Irvine Cleaning
n Validation Data
• Kaggle Test Data
• Amazon’s AWS Dimensionality Reduction ,
Handling missing values
• Wikipedia Feature Selection
and filtering outliers
• Quora.com
Load Data and
explore data
• Load Data – Import necessary
libraries and load data into
dataframe .(Usually done
using pandas)
• Exploring data – Understand
its characteristics – features,
rows, possible missing values ,
and the type of data.
The OS module in Python provides functions for interacting with the operating
system. OS comes under Python’s standard utility modules. This module provides a
portable way of using operating system-dependent functionality.
The *os* and *os.path* modules include many functions to interact with the file
system.
Python-OS-Module Functions
Here we will discuss some important functions of the Python os module :
•Handling the Current Working Directory

•Creating a Directory

•Listing out Files and Directories with Python

•Deleting Directory or Files using Python


Using describe() for Descriptive Statistics
The describe() method is a powerful tool to generate descriptive statistics of a
DataFrame. It provides a comprehensive summary including count, mean, standard
deviation, minimum, 25th percentile, median (50th percentile), 75th percentile and
maximum.
Data Visualisation Techniques

Scatter Plots

• Displays relationship between two variables by plotting import pandas as pd


data points. import matplotlib.pyplot as plt
• Useful in data preparation for exploring associations
between variables and detecting outliers. # reading the database
data = pd.read_csv("tips.csv")

# Scatter plot with day against tip


plt.scatter(data['day'], data['tip’])

# Adding Title to the Plot


plt.title("Scatter Plot")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')
plt.show()
Data Visualisation Techniques
Bar Plots
import pandas as pd
A bar plot uses rectangular bars to represent data import matplotlib.pyplot as plt
categories, with bar length or height proportional to their
values. It compares discrete categories, with one axis for # reading the database
categories and the other for values. data = pd.read_csv("tips.csv")

# Bar chart with day against tip


plt.bar(data['day'], data['tip'])

plt.title("Bar Chart")

# Setting the X and Y labels


plt.xlabel('Day')
plt.ylabel('Tip')

# Adding the legends


plt.show()
Data Visualisation Techniques
Histogram

A histogram is a graph showing frequency distributions. import pandas as pd


import matplotlib.pyplot as plt

# reading the database


data = pd.read_csv("tips.csv")

# histogram of total_bills
plt.hist(data['total_bill'])

plt.title("Histogram")

# Adding the legends


plt.show()
Data Visualisation Techniques

Line Chart

• Depicts data trends over time by connecting data points


with lines ,making them ideal for tracking changes and import matplotlib.pyplot as plt
patterns.
x = [10, 20, 30, 40]
y = [20, 25, 35, 55]

plt.plot(x, y)

plt.title("Line Chart")

plt.ylabel('Y-Axis')

plt.xlabel('X-Axis')
plt.show()
Data Visualisation Techniques

Box Plot

Box plots provide a visual summary of the data with which we can quickly identify the
average value of the data, how dispersed the data is
Data Visualisation Techniques
Heatmap

Use Color gradients to represent data values in a matrix format making it easier to identify relationships or
correlations in large datasets

Creating a heatmap of feature correlations


in customer survey can highlight strong
relationships between satisfaction scores
and purchase behavior
Preparing the data for Machine Learning Algorithm

Data Cleaning

• Handling Missing Values


• Handling Outliers

Methods to handle missing values


• Deletion
• Mean/Median/Mode
Imputation(Pg2.32)
• Forward Fill/Backward Fill
• K_nearest neighbour imputation
• Prediction model
KNNimputer is a scikit-learn class used to fill out or predict the missing values in a dataset. It is a more
useful method that works on the basic approach of the KNN algorithm rather than the naive approach of
filling all the values with the mean or the median. In this approach, we specify a distance from the
missing values which is also known as the K parameter. The missing value will be predicted about the
mean of the neighbors.

For example, consider a dataset with a missing value in a column representing a student’s math
score. Instead of simply filling this missing value with the overall mean or median of the math
scores, KNNImputer finds the k-nearest students (based on other features like scores in physics,
chemistry, etc.) and imputes the missing value using the mean or median of these neighbors’ math
scores.
Handling
Outliers
•An outlier is a data point
that significantly deviates
from the rest of the data. It
can be either much higher
or much lower than the
other data points
Identification of Outlier

Statistical Methods:

•Z-Score: This method calculates the standard deviation of the data points and identifies outliers as those with
Z-scores and IQR(Interquartile Range)

Z-Score = (Data point-Mean)/Standard Deviation

IQR=Q3-Q1

Outlier Detection:
Data points below Q1 - 1.5IQR or above Q3 + 1.5IQR are often considered outliers

Distance-Based Methods:

•K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest neighbors are far away
from them.
Methods to handle Outliers

1. Removal:

This involves identifying and removing outliers from the dataset before training the model. Common methods include:

• Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g., Z-score > 3).
• Distance-based methods: Outliers are identified based on their distance from their nearest neighbors.
• Clustering: Outliers are identified as points not belonging to any cluster or belonging to very small clusters.

2. Transformation:

This involves transforming the data to reduce the influence of outlier.

Transformations make the data more normally distributed and lessen the influence of extreme values.

3.Winsoring

Replacing outlier value with non outlier value.

4.Binning

Grouping outliers into a separate category.


Methods to handle Outliers

Z-score(Pg 2.34)
Z- Score is also called a standard score.
This value/score helps to understand that how far is the data point from the mean.

And after setting up a threshold value one can utilize z score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation

.
Data Transformation

• Normalisation
Data transformation that scales the value of numerical features to a standard range between 0 and 1.

# apply normalization techniques


for column in df:
df [column] = (df [column] - df [column].min()) / (df [column].max() – df [column].min())
# view normalized data
print(df)
Data Transformation

• Standardisation
Data standardization, a crucial data transformation technique, converts data into a uniform format with a
mean of 0 and a standard deviation of 1, ensuring consistency and facilitating easier analysis and machine learning
model training.
Z = (x - mean) / standard deviation
Scikit-Learn provides a transformer called StandardScaler for standardization .

• Log Transformation
Log transformation is a common technique used to reduce skewness in a distribution and make it more
symmetric.

The log transformation can be applied using the NUMPY library.


The NUMPY library provides a natural logarithm function, np.log(), which can be used to apply
the log transformation to an array of data
Data Transformation

• Encoding Categorical Variables – One Hot Encoding , Label Encoding


Categorical data is a common occurrence in many datasets, especially in fields like marketing, finance,
and social sciences. Unlike numerical data, categorical data represents discrete values or categories, such as gender,
country, or product type. Machine learning algorithms, however, require numerical input, making it essential to
convert categorical data into a numerical format. This process is known as encoding.

1. Label Encoding

Label Encoding is a simple and straightforward method that assigns a unique integer to each category. This method is
suitable for ordinal data where the order of categories is meaningful.

from sklearn.preprocessing import LabelEncoder


data = ['red', 'blue', 'green', 'blue', 'red’]
Output:
label_encoder = LabelEncoder() [2 0 1 0 2]
encoded_data = label_encoder.fit_transform(data)
print(encoded_data)
2. One-Hot Encoding

One-Hot Encoding converts categorical data into a binary matrix, where each category is represented by a
binary vector. This method is suitable for nominal data.
Data Reduction

•Improved Efficiency:
Large datasets can be computationally expensive to process, making data reduction
crucial for speeding up training and prediction times.

•Reduced Storage Costs:


Smaller datasets require less storage space, leading to cost savings, especially when
dealing with massive datasets.

•Enhanced Model Performance:


By removing irrelevant or redundant features, data reduction can help machine learning
models generalize better and avoid overfitting.

•Simplified Analysis:
Reduced datasets are easier to visualize and analyze, making it easier to understand
patterns and insights.
Common Methods of Data Reduction

• Feature Selection – Chooses a subset of relevant features from the original dataset while
discarding irrelevant features.

• Feature Extraction – Transforms original features into lower dimensionality space to capture
essential information.

• Instance Selection – Choosing a subset of representative instances from dataset while


maintaining data characteristics.

• Dimensionality Reduction – Reduces the number of input variables in the dataset. Improves the
model for better efficiency . Techniques like PCA,LDA are used.
Example to demonstrate Data Reduction
Feature Engineering

Create new features from existing data ,transform data into new useful formats or enhance the quality of
data.

Components of Feature Engineering

• Feature Creation – Creating new data from existing data

• Feature Transformation – Common transforming include normalisation, scaling, Applying


mathematical functions like logarithms or exponentials.
Example to demonstrate Feature Creation and Feature Transformation
Example to demonstrate Feature Creation and Feature Transformation
Data Splitting

• Training Data – Data used for training . Usually 80% of data is used for training.
• Validation Data – This subset is used to make decisions about which models to use.
• Test Data – Used to evaluate the final model’s performance after it is trained and
validated.

Methods of Data Splitting

• Random Splitting
• Stratified Splitting
• Time based Splitting
Why Data Splitting is important
Avoiding Overfitting – When extensive data is getting trained,
there is a possibility of noise getting included in the data. To
avoid the negative impact,the data is split as training and
validation data.

Model Validation – Process of identifying the best model.

Assessing model performance – Critical for understanding


how model works.

Improving model Robustness – Reliability enhanced.


Create Test Set

Crucial for evaluating the models performance accurately and ensuring its ability to generalise
to new , unseen data.

Steps to create test set

• Allocation – Allocate a portion of dataset for testing


• Purpose – Critical component in model development process
• Independence – Test set should be independent of training data – It ensures reliability
• Evaluation – Testing the model on a separate test set allows for an evaluation of its predictive
capabilities
• Validation – Provides a benchmark for assessing its performance and identify any issues like
overfitting and underfitting.
• Consistency – Provides reliability in model evaluation
Techniques for test set creation

• Random Sampling - In this method of sampling, the data is collected at random. It means that every
item of the universe has an equal chance of getting selected for the investigation purpose. In other
words, each item has an equal probability of being in the sample, which makes the method
impartial. Used for large datasets.

• Stratified Sampling - Stratified or Mixed Sampling is a method in which the population is divided
into different groups, also known as strata with different characteristics, and from those strata
some of the items are selected to represent the population. The investigator while forming strata
has to ensure that each of the stratum is represented in a correct proportion.
For example, there are 60 students in Class 10th. Out of these 60 students, 10 opted for Arts and
Humanities, 30 opted for Commerce, and 20 opted for Science in Class 11th. It means that the population
of 60 students is divided into three strata; viz., Arts and Humanities, Commerce, and Science, containing
10, 30, and 20 students, respectively. Now, for investigation purpose, some of the items will be
proportionately selected from each of the strata in a way that those items forming a sample represents the
entire population.
Creating Test Set in Python
Preparing the data for Machine Learning
Algorithm

Data Cleaning
• Handling Missing Values
• Handling Outliers

Methods to handle missing values


• Deletion
• Mean/Median/Mode Imputation
• Forward Fill/Backward Fill
• K_nearest neighbour imputation
• Prediction model
Structured Approach in implementing Machine
Learning

• Define objective
• Establish data streamline
• Evaluate existing solutions
• Problem Formulation

You might also like