ML_Unit_2
ML_Unit_2
DATA PREPARATION
Categories of Data
• Validation Data – Dataset utilised to fine tune the model and evaluate performance during
training
A subset of data with similar features but different prices are designated as validation data
Defined as gathering , combining , cleaning and transforming raw data to make accurate
predictions in machine learning.
• Missing Data
• Anomalies
• Unstructured Data Format
• Limited Features
• Understanding Feature Engineering
Steps in Data Preparation Process
Feature Transformation,
Feature Creation
Feature
Data Engineerin
Transformati g
Data on
Collectio Data
n Normalisation/
Spitting
standardisation
Gather raw data Encoding categorical Data
Data variables
Sources Reductio Training Data
• UC Irvine Cleaning
n Validation Data
• Kaggle Test Data
• Amazon’s AWS Dimensionality Reduction ,
Handling missing values
• Wikipedia Feature Selection
and filtering outliers
• Quora.com
Load Data and
explore data
• Load Data – Import necessary
libraries and load data into
dataframe .(Usually done
using pandas)
• Exploring data – Understand
its characteristics – features,
rows, possible missing values ,
and the type of data.
The OS module in Python provides functions for interacting with the operating
system. OS comes under Python’s standard utility modules. This module provides a
portable way of using operating system-dependent functionality.
The *os* and *os.path* modules include many functions to interact with the file
system.
Python-OS-Module Functions
Here we will discuss some important functions of the Python os module :
•Handling the Current Working Directory
•Creating a Directory
Scatter Plots
plt.title("Bar Chart")
# histogram of total_bills
plt.hist(data['total_bill'])
plt.title("Histogram")
Line Chart
plt.plot(x, y)
plt.title("Line Chart")
plt.ylabel('Y-Axis')
plt.xlabel('X-Axis')
plt.show()
Data Visualisation Techniques
Box Plot
Box plots provide a visual summary of the data with which we can quickly identify the
average value of the data, how dispersed the data is
Data Visualisation Techniques
Heatmap
Use Color gradients to represent data values in a matrix format making it easier to identify relationships or
correlations in large datasets
Data Cleaning
For example, consider a dataset with a missing value in a column representing a student’s math
score. Instead of simply filling this missing value with the overall mean or median of the math
scores, KNNImputer finds the k-nearest students (based on other features like scores in physics,
chemistry, etc.) and imputes the missing value using the mean or median of these neighbors’ math
scores.
Handling
Outliers
•An outlier is a data point
that significantly deviates
from the rest of the data. It
can be either much higher
or much lower than the
other data points
Identification of Outlier
Statistical Methods:
•Z-Score: This method calculates the standard deviation of the data points and identifies outliers as those with
Z-scores and IQR(Interquartile Range)
IQR=Q3-Q1
Outlier Detection:
Data points below Q1 - 1.5IQR or above Q3 + 1.5IQR are often considered outliers
Distance-Based Methods:
•K-Nearest Neighbors (KNN): KNN identifies outliers as data points whose K nearest neighbors are far away
from them.
Methods to handle Outliers
1. Removal:
This involves identifying and removing outliers from the dataset before training the model. Common methods include:
• Thresholding: Outliers are identified as data points exceeding a certain threshold (e.g., Z-score > 3).
• Distance-based methods: Outliers are identified based on their distance from their nearest neighbors.
• Clustering: Outliers are identified as points not belonging to any cluster or belonging to very small clusters.
2. Transformation:
Transformations make the data more normally distributed and lessen the influence of extreme values.
3.Winsoring
4.Binning
Z-score(Pg 2.34)
Z- Score is also called a standard score.
This value/score helps to understand that how far is the data point from the mean.
And after setting up a threshold value one can utilize z score values of data points to define the outliers.
Zscore = (data_point -mean) / std. deviation
.
Data Transformation
• Normalisation
Data transformation that scales the value of numerical features to a standard range between 0 and 1.
• Standardisation
Data standardization, a crucial data transformation technique, converts data into a uniform format with a
mean of 0 and a standard deviation of 1, ensuring consistency and facilitating easier analysis and machine learning
model training.
Z = (x - mean) / standard deviation
Scikit-Learn provides a transformer called StandardScaler for standardization .
• Log Transformation
Log transformation is a common technique used to reduce skewness in a distribution and make it more
symmetric.
1. Label Encoding
Label Encoding is a simple and straightforward method that assigns a unique integer to each category. This method is
suitable for ordinal data where the order of categories is meaningful.
One-Hot Encoding converts categorical data into a binary matrix, where each category is represented by a
binary vector. This method is suitable for nominal data.
Data Reduction
•Improved Efficiency:
Large datasets can be computationally expensive to process, making data reduction
crucial for speeding up training and prediction times.
•Simplified Analysis:
Reduced datasets are easier to visualize and analyze, making it easier to understand
patterns and insights.
Common Methods of Data Reduction
• Feature Selection – Chooses a subset of relevant features from the original dataset while
discarding irrelevant features.
• Feature Extraction – Transforms original features into lower dimensionality space to capture
essential information.
• Dimensionality Reduction – Reduces the number of input variables in the dataset. Improves the
model for better efficiency . Techniques like PCA,LDA are used.
Example to demonstrate Data Reduction
Feature Engineering
Create new features from existing data ,transform data into new useful formats or enhance the quality of
data.
• Training Data – Data used for training . Usually 80% of data is used for training.
• Validation Data – This subset is used to make decisions about which models to use.
• Test Data – Used to evaluate the final model’s performance after it is trained and
validated.
• Random Splitting
• Stratified Splitting
• Time based Splitting
Why Data Splitting is important
Avoiding Overfitting – When extensive data is getting trained,
there is a possibility of noise getting included in the data. To
avoid the negative impact,the data is split as training and
validation data.
Crucial for evaluating the models performance accurately and ensuring its ability to generalise
to new , unseen data.
• Random Sampling - In this method of sampling, the data is collected at random. It means that every
item of the universe has an equal chance of getting selected for the investigation purpose. In other
words, each item has an equal probability of being in the sample, which makes the method
impartial. Used for large datasets.
• Stratified Sampling - Stratified or Mixed Sampling is a method in which the population is divided
into different groups, also known as strata with different characteristics, and from those strata
some of the items are selected to represent the population. The investigator while forming strata
has to ensure that each of the stratum is represented in a correct proportion.
For example, there are 60 students in Class 10th. Out of these 60 students, 10 opted for Arts and
Humanities, 30 opted for Commerce, and 20 opted for Science in Class 11th. It means that the population
of 60 students is divided into three strata; viz., Arts and Humanities, Commerce, and Science, containing
10, 30, and 20 students, respectively. Now, for investigation purpose, some of the items will be
proportionately selected from each of the strata in a way that those items forming a sample represents the
entire population.
Creating Test Set in Python
Preparing the data for Machine Learning
Algorithm
Data Cleaning
• Handling Missing Values
• Handling Outliers
• Define objective
• Establish data streamline
• Evaluate existing solutions
• Problem Formulation