0% found this document useful (0 votes)
6 views

CH 02 Data Handling Technique

Uploaded by

1032210687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

CH 02 Data Handling Technique

Uploaded by

1032210687
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

Chapter 2

Data Handling Techniques


Content
• Data cleaning
• Data Transformations
• Outliner detection and visualization
• Imbalance dataset handling
• Feature Selection and extraction
Importance of data preprocessing

• It improves accuracy and reliability. Preprocessing data removes


missing or inconsistent data values resulting from human or
computer error, which can improve the accuracy and quality of a
dataset, making it more reliable.
• It makes data consistent. When collecting data, it's possible to have
data duplicates, and discarding them during preprocessing can
ensure the data values for analysis are consistent, which helps
produce accurate results.
• It increases the data's algorithm readability. Preprocessing
enhances the data's quality and makes it easier for machine learning
algorithms to read, use, and interpret it.
Features of data preprocessing

• Preprocessing has many features that make it an important


preparation step for data analysis. The following are the two
main features with a brief explanation:

Data
Data imputation
validation
Features of data preprocessing

• Data validation: This is the process where businesses analyze and


assess the raw data for a project to determine if it's complete and
accurate to achieve the best results.

• Data imputation: Data imputation is where you input missing values


and rectify data errors during the validation process manually or
through programming, like business process automation.
Data Cleaning
Effective Strategies for Handling Missing Values in Data
Analysis (How to handle missing values?)

• What Is a Missing Value?


• Missing data is defined as the values or data that is not stored
(or not present) for some variable/s in the given dataset. Below
is a sample of the missing data from the Titanic dataset. You can
see the columns ‘Age’ and ‘Cabin’ have some missing values.
Effective Strategies for Handling Missing
Values in Data Analysis
Effective Strategies for Handling Missing
Values in Data Analysis

• How Is a Missing Value Represented in a Dataset?


• In the dataset, the blank shows the missing values.
• In Pandas, usually, missing values are represented by NaN. It
stands for Not a Number.
Effective Strategies for Handling Missing
Values in Data Analysis
Effective Strategies for Handling Missing
Values in Data Analysis

Why Is Data Missing From the Dataset?

• Past data might get corrupted due to improper maintenance.

• Observations are not recorded for certain fields due to some reasons.
There might be a failure in recording the values due to human error.

• The user has not provided the values intentionally

• Item nonresponse: This means the participant refused to respond.


Types of Missing Values

• Formally the missing values are categorized as follows:

• Missing completely at
random
• Missing at random
• Missing not at random
Missing Completely At Random (MCAR)

• In MCAR, the probability of data being missing is the same for all the observations. In this case, there is no
relationship between the missing data and any other values observed or unobserved (the data which is not
recorded) within the given dataset. That is, missing values are completely independent of other data. There is no
pattern.

• In the case of MCAR data, the value could be missing due to human error, some system/equipment failure, loss
of sample, or some unsatisfactory technicalities while recording the values. For Example, suppose in a library
there are some overdue books. Some values of overdue books in the computer system are missing. The reason
might be a human error, like the librarian forgetting to type in the values. So, the missing values of overdue books
are not related to any other variable/data in the system. It should not be assumed as it’s a rare case. The
advantage of such data is that the statistical analysis remains unbiased.
Missing At Random (MAR)

• MAR data means that the reason for missing values can be explained by variables on which you have complete information, as there is
some relationship between the missing data and other values/data. In this case, the data is not missing for all the observations. It is
missing only within sub-samples of the data, and there is some pattern in the missing values.

• For example, if you check the survey data, you may find that all the people have answered their ‘Gender,’ but ‘Age’ values
are mostly missing for people who have answered their ‘Gender’ as ‘female.’ (The reason being most of the females don’t want to reveal
their age.)

• So, the probability of data being missing depends only on the observed value or data. In this case, the variables ‘Gender’ and ‘Age’ are
related. The reason for missing values of the ‘Age’ variable can be explained by the ‘Gender’ variable, but you can not predict the missing
value itself.

• Suppose a poll is taken for overdue books in a library. Gender and the number of overdue books are asked in the poll. Assume that most of
the females answer the poll and men are less likely to answer. So why the data is missing can be explained by another factor, that is gender.
In this case, the statistical analysis might result in bias. Getting an unbiased estimate of the parameters can be done only by modeling the
missing data.
Missing Not At Random (MNAR)

• Missing values depend on the unobserved data. If there is some structure/pattern in missing data and other observed data can

not explain it, then it is considered to be Missing Not At Random (MNAR).

• If the missing data does not fall under the MCAR or MAR, it can be categorized as MNAR. It can happen due to the reluctance of

people to provide the required information. A specific group of respondents may not answer some questions in a survey.

• For example, suppose the name and the number of overdue books are asked in the poll for a library. So most of the people having

no overdue books are likely to answer the poll. People having more overdue books are less likely to answer the poll. So, in this

case, the missing value of the number of overdue books depends on the people who have more books overdue.

• Another example is that people having less income may refuse to share some information in a survey or questionnaire.

• In the case of MNAR as well, the statistical analysis might result in bias.
Handling Missing Data
• Why Do We Need to Care About Handling Missing Data?

• It is important to handle the missing values appropriately.

• Many machine learning algorithms fail if the dataset contains missing values. However,
algorithms like K-nearest and Naive Bayes support data with missing values.

• You may end up building a biased machine learning model, leading to incorrect results if
the missing values are not handled properly.

• Missing data can lead to a lack of precision in the statistical analysis.


Handling Missing Values

• Now that you have found the missing data, how do you handle the missing values?

• Analyze each column with missing values carefully to understand the reasons behind the
missing of those values, as this information is crucial to choose the strategy for handling
the missing values.

• There are 2 primary ways of handling missing values:

1.Deleting the Missing values

2.Imputing the Missing Values


Deleting the Missing value

• Generally, this approach is not recommended. It is one of the


quick and dirty techniques one can use to deal with missing
values. If the missing value is of the type Missing Not At Random
(MNAR), then it should not be deleted.
• If the missing value is of type Missing At Random (MAR) or
Missing Completely At Random (MCAR) then it can be deleted
(In the analysis, all cases with available data are utilized, while
missing observations are assumed to be completely random
(MCAR) and addressed through pairwise deletion.)
Deleting the Missing value

• The disadvantage of this method is one might end up deleting


some useful data from the dataset.
• There are 2 ways one can delete the missing data values:
• Deleting the entire row (listwise deletion)
• If a row has many missing values, you can drop the entire row. If
every row has some (column) value missing, you might end up
deleting the whole data. The code to drop the entire row is as
follows:
Deleting entire Row and Column
IN: df = train_df.dropna(axis=0) IN: df = train_df.drop(['Dependents'],axis=1)
df.isnull().sum() df.isnull().sum()

Full Row is deleted Full column is deleted


Imputing the Missing Value

• There are many imputation methods for replacing the missing


values. You can use different python libraries such as Pandas,
and Sci-kit Learn to do this. Let’s go through some of the ways
of replacing the missing values.
Replacing with an arbitrary value

• If you can make an educated guess about the missing value, then
you can replace it with some arbitrary value using the following
code. E.g., in the following code, we are replacing the missing
values of the ‘Dependents’ column with ‘0’.
Replacing with the mean

• This is the most common method of


imputing missing values of numeric
columns. If there are outliers, then the
mean will not be appropriate. In such
cases, outliers need to be treated first.

• use the ‘fillna’ method for imputing


the columns ‘LoanAmount’ and
‘Credit_History’ with the mean of the
respective column values.
Replacing with the mode

• Mode is the most


frequently occurring value.
It is used in the case of
categorical features. You
can use the ‘fillna’ method
for imputing the
categorical columns
‘Gender,’ ‘Married,’ and
‘Self_Employed.’
Replacing with the median

• The median is the middlemost value. It’s better to use the


median value for imputation in the case of outliers. You can use
the ‘fillna’ method for imputing the column ‘Loan_Amount_Term’
with the median value.

train_df['Loan_Amount_Term']= train_df['Loan_Amount_Term'].fillna(train_df['Loan_Amount_Term'].median())
Replacing with the previous value – forward
fill
• In some cases, imputing the
values with the previous value
instead of the mean, mode, or
median is more appropriate.

• This is called forward fill.

• It is mostly used in time series


data. You can use the ‘fillna’
function with the parameter
‘method = ffill’
Replacing with the next value – backward fill

• In backward fill, the


missing value is
imputed using the next
value.
Interpolation

• Missing values can also be


imputed using interpolation.
Pandas’ interpolate method
can be used to replace the
missing values with different
interpolation methods like
‘polynomial,’ ‘linear,’ and
‘quadratic.’ The default
method is ‘linear.’
Data Transformation
Data Transformation-in detail
• Data transformation is used when data needs to be converted to match that of
the destination system
• Data transformation is the process of converting raw data into a format or
structure that would be more suitable for model building and also data discovery
in general.
• It is an imperative step in feature engineering that facilitates discovering insights.
Feature Transforming techniques
1) MinMax Scaler

2) Standard Scaler

3) MaxAbsScaler

4) Robust Scaler

5) Quantile Transformer Scaler

6) Power Transformer Scaler

7) Unit Vector Scaler/Normalizer

8) Custom Transformer
Base code in python

Data set with income, Age


and Department for some
firm and its employee
Base code in python
• Observe Categorical data as ‘Department ’

df_scaled = df.copy()
col_names = ['Income', 'Age’]
features = df_scaled[col_names]

We will execute this snippet before using a new scaler every time.
MinMax Scaler

• The MinMax scaler is one of the simplest scalers to


understand. It just scales all the data between 0 and 1.

• x_scaled = (x – x_min)/(x_max – x_min)

Though (0, 1) is the default range, we can define our range of max and
min values as well.
MinMax Scaler

Income (data
given) x_scaled = (x – x_min)/(x_max – x_min)
15000 Xmax-Xmin= X-Xmin = X scaled= 13,200/1,18,200
1,20,000-1800 15,000 – 1800 = 0.111675
= 1,18,200/- = 13,200
1800 1800-1800=0 Calculate
120000
10000

Observe each column carefully


Minimum value = 0
Max value =1
So the name Min Max scaler .
Eg:If you do not want Age = 0 than range can be from 5 to 10 also
How?
MinMax Scaler

The min-max scaler lets you set the range in which you want the variables to be.
Standard Scaler

• For each feature, the Standard Scaler scales the values such that the
mean is 0 and the standard deviation is 1(or the variance).

x_scaled = x – mean/std_dev
Standard Scaler assumes that the distribution of the variable is normal.
In case, the variables are not normally distributed,

either choose a different scaler


Or
first, convert the variables to a normal distribution and then apply this scaler
Standard Scaler- Normal data distribution

• Histogram will help to identify the normal distribution of data


Standard Scaler- Normal data distribution
Age X- [X-Xmean ]2 Standard X-mean/ SD
(data mean deviation
given
)

25 mea -9.75 95.06 Calculate


18 n=3 -16.75 280.6
4.75
45 10.25 105.06
51 16.25 264.06 = 13.64
Sum= 744.78
Sum/4= 186.195
MaxAbsScaler

• In simplest terms, the MaxAbs scaler takes the absolute


maximum value of each column and divides each value in the
column by the maximum value.

• It first takes the absolute value of each value in the column and
then takes the maximum value out of those. This operation
scales the data between the range [-1, 1].
MaxAbsScaler

Age and income can not be negative let us take one more variable as Balance

Balance value Absolute value Divide each value by


max value
100 100 100/2000=0.05
-263 263 Calculate
2000 2000
-5 5
Compare with percentile marks
Robust Scaler

• If you have noticed in the scalers we used so far, each of them


was using values like
• the mean, maximum and minimum values of the columns.
• All these values are sensitive to outliers.
• If there are too many outliers in the data, they will influence the
mean and the max value or the min value.
• Thus, even if we scale this data using the above methods, we
cannot guarantee a balanced data with a normal distribution.
Robust Scaler

The Robust Scaler, as the name suggests is not sensitive to


outliers. This scaler-
1.removes the median from the data
2.scales the data by the InterQuartile Range(IQR)
• Q1= First half of data and its median
• Q2= Actual median
• Q3= Second half of data and its median
IQR = Q3 – Q1
x_scaled = (x – Q1)/(Q3 – Q1)
Quantile Transformer Scaler

• the Quantile Transformer Scaler converts the variable


distribution to a normal distribution. and scales it accordingly.
• few important points regarding the Quantile Transformer Scaler:

• 1. It computes the cumulative distribution function of the variable


• 2. It uses this cdf to map the values to a normal distribution
• 3. Maps the obtained values to the desired output distribution using the
associated quantile function
Quantile Transformer Scaler

• his scaler changes the very distribution of the variables, linear


relationships among variables may be destroyed by using this
scaler.
• Thus, it is best to use this for non-linear data.
• Useful for larger data sets
Log Transform

• It is primarily used to convert a skewed distribution to a normal


distribution/less-skewed distribution.
• In this transform, we take the log of the values in a column and
use these values as the column instead.
• Why does it work? It is because the log function is equipped to
deal with large numbers. Here is an example-
• log(10) = 1
• log(100) = 2, and
• log(10000) = 4.
Log Transform

• Thus, in our example, while plotting the histogram of Income, it ranges from
0 to 1,20,000:

While our Income column had extreme values ranging from 1800 to 1,20,000 – the log values are
now ranging from approximately 7.5 to 11.7! Thus, the log operation had a dual role

•Reducing the impact of too-low values


•Reducing the impact of too-high values.

if our data has negative values, NAN or values ranging from 0 to 1, we cannot apply log transform
directly – since the log of negative numbers and numbers between 0 and 1 is undefined,
Original data and Log Transformed data
Power Transformer Scaler

• the Power Transformer also changes the distribution of the


variable, as in, it makes it more Gaussian(normal).
• The Power Transformer actually automates this decision making
by introducing a parameter called lambda. It decides on a
generalized power transform by finding the best value of lambda
using either the:

• 1. Box-Cox transform (only +ve values)


• 2. The Yeo-Johnson transform (+ve and –Ve both)
Unit Vector Scaler/Normalizer

• Normalization is the process of scaling individual samples to


have unit norm
• The most interesting part is that unlike the other scalers which
work on the individual column values, the Normalizer works on
the rows
• Each row of the dataframe with at least one non-zero
component is rescaled independently of other samples so that
its norm (l1, l2, or inf) equals one.
Unit Vector Scaler/Normalizer

• If we are using L1 norm, the values in each column are converted so that
the sum of their absolute values along the row = 1
• If we are using L2 norm, the values in each column are first squared and
added so that the sum of their absolute values along the row = 1

if you check the first row,


(.999999)^2 + (0.001667)^2 = 1.000(approx)
Custom Transformer

• Your own transform as per the need of data


• I have a feature transformation technique that involves taking
(log to the base 2) of the values. In NumPy, there is a function
called log2. let us use it
Log base 2

Log base 10
Categorial Feature handling
Categorial data Label Encoding One Hot Encoding

Feature 0 Feature 1 Feature 2

Due to this, there is a very high probability that the model captures the
relationship between countries such as India < Japan < the US.
One-Hot Encoding vs Label Encoding

We apply One-Hot Encoding when: We apply Label Encoding when:


1.The categorical feature is not ordinal (like 1.The categorical feature is ordinal (like Jr. kg, Sr.
the countries above) kg, Primary school, high school)
2.The number of categorical features is less so 2. The number of categories is quite large as one-
one-hot encoding can be effectively applied hot encoding can lead to high memory
consumption
How to Impute Missing Values for Categorical
Features?
Apply Strategy-1 Apply Strategy-3(Delete the variable
(Delete the missing observations). which is having missing values).

Apply Strategy-2 Apply Strategy-4


(Replace missing values with the most (Develop a model to predict missing
frequent value). values).
Apply Strategy-4
(Develop a model to predict missing values).

Read and Load the Encoded Dataset.

– Make missing records as our Testing data.

– Make non-missing records as our Training data.

– Separate Dependent and Independent variables.

– Fit our Logistic Regression model.

– Predict the class for missing records.


Outlier detection and visualization
Scatter plot

58
Box Plot
Box plot is another very simple visualization tool to detect outliers
which use the concept of Interquartile range (IQR) technique.

August 14, 2024 Data Mining: Concepts and Techniques 59


Outlier detection and Visualization
There are several ways to treat outliers in a dataset, depending on the
nature of the outliers and the problem being solved.
• Trimming
• It excludes the outlier values from our analysis. By applying this technique, our
data becomes thin when more outliers are present in the dataset. Its main
advantage is its fastest nature.
• Capping
• In this technique, outliers data is CAP and make the limit i.e, above a particular
value or less than that value, all the values will be considered as outliers, and
the number of outliers in the dataset gives that capping number.
• Discretization
• In this technique, by making the groups, we include the outliers in a particular
group and force them to behave in the same manner as those of other points
in that group. This technique is also known as Binning.
Normal, Skewed and Other distribution

3 3
For Normal Distributions Use Inter-Quartile Range (IQR)
Use empirical relations of Normal proximity rule. Use a percentile-based approach.
distribution. The data points that fall below For Example, data points that are far from the
The data points that fall below mean- Q1 – 1.5 IQR or above the third
99% percentile and less than 1 percentile are
3*(sigma) or above mean+3*(sigma) are quartile Q3 + 1.5 IQR are
outliers, where Q1 and Q3 are considered an outlier.
outliers, where mean and sigma are the
average value and standard deviation of a the 25th and 75th percentile of
particular column. the dataset, respectively. IQR
represents the inter-quartile
range and is given by Q3 – Q1.
Imbalanced data handling
Data imbalance Problem
• Classification problems are quite common in the machine learning world. As we
know in the classification problem we try to predict the class label by studying the
input data or predictor where the target or output variable is a categorical variable
in nature.

• Imbalanced data refers to those types of datasets where the target class has an
uneven distribution of observations, i.e one class label has a very high number of
observations and the other has a very low number of observations.

• understand imbalanced dataset handling with an example.


Data imbalance
• Let’s assume that XYZ is a bank that issues a credit card to its customers.
Now the bank is concerned that some fraudulent transactions are going on
and when the bank checks their data they found that for each 2000
transaction there are only 30 Nos of fraud recorded.

• So, the number of fraud per 100 transactions is less than 2%, or we can say
more than 98% transaction is “No Fraud” in nature. Here, the class “No
Fraud” is called the majority class, and the much smaller in size “Fraud”
class is called the minority class.
Data imbalance
More such example of imbalanced data is –
· Disease diagnosis
· Customer churn prediction
· Fraud detection
· Natural disaster
Data imbalance
• the main problem with imbalanced dataset prediction is how
accurately are we actually predicting both majority and minority
class?
• Let’s explain it with an example of disease diagnosis.
• Let’s assume we are going to predict disease from an existing dataset
where for every 100 records only 5 patients are diagnosed with the
disease. What is majority class and minority class ?
• the majority class is 95% with no disease and the minority class is
only 5% with the disease.
• Now, ML model might predicts that all 100 out of 100 patients have
no disease.
Data imbalance
• Sometimes when the records of a certain class are much more than the other class, our
classifier may get biased towards the prediction.

• In this case, the confusion matrix for the classification problem shows how well our model
classifies the target classes and we arrive at the accuracy of the model from the confusion
matrix.

• It is calculated based on the total no of correct predictions by the model divided by the
total no of predictions. In the above case it is (0+95)/(0+95+0+5)=0.95 or 95%. It means
that the model fails to identify the minority class yet the accuracy score of the model will
be 95%.
Data imbalance
• Thus our traditional approach of classification and model
accuracy calculation is not useful in the case of the imbalanced
dataset.
ML Classifier with Imbalance Data set
Majority Class
healthy
New Case Minority Class
Heart disease
if the classifier
identifies the
minority class
poorly, i.e. more of FN increases FP increases
this class wrongfully
predicted as the
majority class then
false negatives will
increase if the classifier predicts the minority class but the prediction is
erroneous and false-positive increases, the precision metric will
be low
Approach to deal with the imbalanced dataset
problem
• In rare cases like fraud detection or disease prediction, it is vital
to identify the minority classes correctly. Techniques for the
same are
1. Choose Proper Evaluation Metric

• The accuracy of a classifier is the total number of correct predictions by the


classifier divided by the total number of predictions. This may be good
enough for a well-balanced class but not ideal for the imbalanced class
problem. The other metrics such as precision is the measure of how
accurate the classifier’s prediction of a specific class and recall is the
measure of the classifier’s ability to identify a class.

• For an imbalanced class dataset F1 score is a more appropriate metric. It is


the harmonic mean of precision and recall and the expression is –
Choose Proper Evaluation Metric

if the classifier predicts the minority class but the prediction is erroneous and false-positive
increases, the precision metric will be low and so as F1 score. Also, if the classifier identifies
the minority class poorly, i.e. more of this class wrongfully predicted as the majority class
then false negatives will increase, so recall and F1 score will low. F1 score only increases if
both the number and quality of prediction improves.

F1 score keeps the balance between precision and recall and improves the score only if the
classifier identifies more of a certain class correctly.
Choose proper evaluation matrix
• Precision:
• the number of true positives divided by all positive predictions.
• Precision is also called Positive Predictive Value.
• It is a measure of a classifier’s exactness.
• Low precision indicates a high number of false positives.
• Recall:
• the number of true positives divided by the number of positive values in the test data.
• The recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s
completeness.
• Low recall indicates a high number of false negatives.
• F1: Score:
• the weighted average of precision and recall.
• Area Under ROC Curve (AUROC):
• AUROC represents the likelihood of your model distinguishing observations from two classes.
• In other words, if you randomly select one observation from each class, what’s the
probability that your model will be able to “rank” them correctly?
2. Resampling (Oversampling and Under sampling)

• This technique is used to upsample or downsample the minority or majority class.

• When we are using an imbalanced dataset, we can oversample the minority class using replacement.

• This technique is called oversampling.

• Similarly, we can randomly delete rows from the majority class to match them with the minority class which
is called under sampling.

• After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both
classes have a similar number of records present in the dataset, we can assume that the classifier will give
equal importance to both classes.
under sampling.
oversampling.
2. Resampling (Oversampling and Under sampling)

• An example of this technique using the sklearn library’s resample() is


shown Here, Is_Lead is our target variable. Let’s see the distribution
of the classes in the target.
2. Resampling (Oversampling and Under sampling)

It has been observed that our target class has an imbalance. So,
we’ll try to upsample the data so that the minority class
matches with the majority class.
from sklearn.utils import resample
#create two different dataframe of majority and minority class
df_majority = df_train[(df_train['Is_Lead']==0)]
df_minority = df_train[(df_train['Is_Lead']==1)]
# upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples= 131177, # to match majority class
random_state=42) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
After upsampling, the distribution of class is balanced as below –
2. Resampling (Oversampling and Under sampling)

Sklearn.utils
resample can be
used for both
undersamplings the
majority class and
oversample
minority class
instances.
3. SMOTE

• Synthetic Minority Oversampling Technique or SMOTE is another


technique to oversample the minority class.
• Simply adding duplicate records of minority class often don’t add any new
information to the model.
• In SMOTE new instances are synthesized from the existing data. If we
explain it in simple words, SMOTE looks into minority class instances and
use k nearest neighbor to select a random nearest neighbor, and a synthetic
instance is created randomly in feature space.
SMOTE algorithm works in 4 simple steps:

1. Choose a minority class as the input vector.


2. Find its k nearest neighbors (k_neighbors is specified as an
argument in the SMOTE() function).
3. Choose one of these neighbors and place a synthetic point
anywhere on the line joining the point under consideration and
its chosen neighbor.
4. Repeat the steps until the data is balanced.
Feature selection and Extraction
Feature Selection and extraction

• In real-life machine learning problems, it’s almost rare that all the
variables in the dataset are useful for building a model.

• Adding redundant variables reduces the model’s generalization


capability and may also reduce the overall accuracy of a classifier.

• Furthermore, adding more variables to a model increases the overall


complexity of the model.
Feature Selection Techniques in Machine Learning
• Feature selection is the process of selecting the subset
of the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high
accuracy.

• In other words, it is a way of selecting the optimal


features from the input dataset.

• Three methods are used for the feature selection are_


1. Filter methods
2. Wrapper methods
3. Embedded methods

83
1. Filter Methods
• features are selected on the basis of statistics measures.
• The filter method filters out the irrelevant feature and
redundant columns from the model by using different
metrics through ranking.
Some common techniques of Filter methods are
• Information Gain
• Chi-square Test
• Fisher’s Score
• Correlation Coefficient
• Variance Threshold
• Mean Absolute Difference (MAD)
2. Wrapper Methods
• In wrapper methodology, the selection of features is done by
considering it as a search problem, in which different combinations are
made, evaluated, and compared with other combinations.
• It trains the algorithm by using the subset of features iteratively.
• On the basis of the output of the model, features are added or
subtracted, and with this feature set, the model is trained again.

Some techniques of wrapper methods are


• Forward Feature Selection
• Backward Feature Elimination
• Exhaustive Feature Selection
• Recursive Feature Elimination
3. Embedded Methods
• Embedded methods combine the advantages of both filter
and wrapper methods by considering the interaction of
features along with low computational cost.
• These are fast processing methods similar to the filter
method but more accurate than the filter method.
• These methods are also iterative, which evaluates each
iteration, and optimally finds the most important features
that contribute the most to training in a particular iteration.
Some techniques of embedded methods are:
• Regularization (L1)
• Tree-based methods
Types of Feature Selection Methods in ML

Filter Methods Wrapper Methods Embedded Methods

• Filter methods pick up the intrinsic Wrappers require some method to search the These methods encompass the benefits of
properties of the features measured space of all possible subsets of features, both the wrapper and filter methods by
via univariate statistics instead of assessing their quality by learning and including interactions of features but also
cross-validation performance. These evaluating a classifier with that feature subset. maintaining reasonable computational costs.
methods are faster and less Embedded methods are iterative in the sense
computationally expensive than The wrapper methods usually result in better that takes care of each iteration of the model
wrapper methods. When dealing predictive accuracy than filter methods. training process and carefully extract those
with high-dimensional data, it is features which contribute the most to the
computationally cheaper to use filter training for a particular iteration.
methods.
• Information Gain • Forward Feature Selection • Regularization (L1)
• Chi-square Test • Backward Feature Elimination • Tree-based methods
• Fisher’s Score • Exhaustive Feature Selection
• Correlation Coefficient • Recursive Feature Elimination
• Variance Threshold
• Mean Absolute Difference (MAD)
Information Gain

• Information gain calculates the


reduction in entropy from the
transformation of a dataset.

• It can be used for feature selection by


evaluating the Information gain of each
variable in the context of the target
variable.
Chi-square Test

• The Chi-square test is used for categorical


features in a dataset.

• A chi-square test is a statistical test that is


used to compare observed and expected
results.

• The goal of this test is to identify whether a


disparity between actual and predicted data
is due to chance or to a link between the
variables under consideration.
Fisher’s Score

• Fisher’s Score is calculated as the


ratio of between-class and within-
class variance.
where μij and ρij are the mean and the variance
• A higher Fisher’s Score implies the of the i-th feature in the j-th class, respectively,
characteristic is more
nj is the number of instances in the j-th class and
discriminative and valuable for μi is the mean of the i-th feature.

study.
Correlation Coefficient
• A correlation coefficient is a numerical measure of some
type of correlation, meaning a statistical relationship
between two variables. It lies between -1 to +1.

• Good variables correlate highly with the target.


Furthermore, variables should be correlated with the
target but uncorrelated among themselves.

• Higher the correlation with the target variable better the


chances of the variable to be included in the model.

• We need to set an absolute value, say 0.5, as the


threshold for selecting the variables. If we find that the
predictor variables are correlated, we can drop the
variable with a lower correlation coefficient value than
the target variable.
Correlation coefficient formula

• r =correlation coefficient
• xi = values of the x-variable in a sample
• yi = values of the y-variable in a sample
• x ¯ and y ¯ = mean of x and y respectively
Variance Threshold
• It removes all features whose variance
doesn’t meet some threshold.
• By default, it removes all zero-
variance features, i.e., features with
the same value in all samples.
The get_support returns a Boolean vector where True
• The assumption made using this means the variable does not have zero variance.

method is higher variance features are


likely to contain more information.
Mean Absolute Difference (MAD)
• The mean absolute difference
(MAD) computes the absolute
difference from the mean value.

• The higher the MAD, the higher


the discriminatory power.

• This method is similar to variance


threshold method but the
difference is there is no square in
MAD.
(Wrapper Methods)
Forward Feature Selection and B/W feature elimination

• Forward selection –This method is an iterative approach


where we initially start with an empty set of features and
keep adding a feature which best improves our model
after each iteration. The stopping criterion is till the
addition of a new variable does not improve the
performance of the model.
• Backward elimination – This method is also an iterative
approach where we initially start with all features and
after each iteration, we remove the least significant
feature. The stopping criterion is till no improvement in
the performance of the model is observed after the
feature is removed.
Recursive Feature Elimination
• Given an external estimator that assigns weights to features (e.g., the
coefficients of a linear model), the goal of recursive feature
elimination (RFE) is to select features by recursively considering
smaller and smaller sets of features.
• First, the estimator is trained on the initial set of features, and each
feature’s importance is obtained either through a coef_ attribute or a
feature_importances_ attribute.
• Then, the least important features are pruned from the current set of
features. That procedure is recursively repeated on the pruned set
until the desired number of features to select is eventually reached
Exhaustive Feature Selection
• it tries every possible
combination of the variables
and returns the best-
performing subset.

• This can be computationally


expensive, especially with a large
number of features.
Embedded Methods
Regularization (L1)
• This method adds a penalty to different parameters of the machine
learning model to avoid over-fitting of the model.
• This approach of feature selection uses Lasso (L1 regularization) and
Elastic nets (L1 and L2 regularization). The penalty is applied over the
coefficients, thus bringing down some coefficients to zero. The
features having zero coefficient can be removed from the dataset.
• Lasso or L1 has the property that can shrink some of the coefficients
to zero. Therefore, that feature can be removed from the model.
• (Note: Ridge Regression allows coefficients to be very close to zero but never
actually zero)
Tree-based methods
• These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way
to select features as well.

• Nodes with the greatest decrease in impurity


happen at the start of the trees, while nodes with
the least decrease in impurity occur at the end of
the trees.

• Thus, by pruning trees below a particular node,


we can create a subset of the most important
features.
Feature Extraction techniques
• Feature Extraction aims to reduce the number of features in a dataset by
creating new features from the existing ones (and then discarding the
original features).
• These new reduced set of features should then be able to summarize most
of the information contained in the original set of features.
• In this way, a summarized version of the original features can be created
from a combination of the original set.
Techniques_
• PCA (Principle Components Analysis)
• ICA (Independent Component Analysis)
• LDA (Linear Discriminant Analysis)
• Auto encoders
August 14, 2024 Data Mining: Concepts and Techniques 100
Principle Components Analysis (PCA)
• PCA is one of the mostely used linear dimensionality reduction technique.

• When using PCA, we take as input our original data and try to find a
combination of the input features which can best summarize the original
data distribution so that to reduce its original dimensions.

• PCA is able to do this by maximizing variances and minimizing the


reconstruction error by looking at pair wised distances.
• In PCA, our original data is projected into a set of orthogonal axes and each
of the axes gets ranked in order of importance.

August 14, 2024 Data Mining: Concepts and Techniques 101


Independent Component Analysis (ICA)
• ICA is a linear dimensionality reduction method which takes as input data a
mixture of independent components and it aims to correctly identify each of
them (deleting all the unnecessary noise).
• Two input features can be considered independent if both their linear and not
linear dependence is equal to zero.
• Independent Component Analysis is commonly used in medical applications such
as EEG and fMRI analysis to separate useful signals from unhelpful ones.
• As a simple example of an ICA application, let’s consider we are given an audio
registration in which there are two different people talking.
• Using ICA we could, for example, try to identify the two different independent
components in the registration (the two different people).
• In this way, we could make our unsupervised learning algorithm recognize
between the different speakers in the conversation.

August 14, 2024 Data Mining: Concepts and Techniques 102


Linear Discriminant Analysis (LDA)

• LDA aims to maximize the distance between the mean of each class and
minimize the spreading within the class itself.
• LDA uses therefore within classes and between classes as measures. This is
a good choice because maximizing the distance between the means of
each class when projecting the data in a lower-dimensional space can lead
to better classification results

August 14, 2024 Data Mining: Concepts and Techniques 103


Autoencoder
• Autoencoders are a family of Machine Learning algorithms which can
be used as a dimensionality reduction technique.

• The main difference between Autoencoders and other dimensionality


reduction techniques is that Autoencoders use non-linear
transformations to project data from a high dimension to a lower one.

• There exist different types of Autoencoders such as_


1. Denoising Autoencoder
2. Variational Autoencoder
3. Convolutional Autoencoder
4. Sparse Autoencoder

August 14, 2024 Data Mining: Concepts and Techniques 104


Autoencoder

1.Encoder: takes the input data and compress it,


so that to remove all the possible noise and
unhelpful information. The output of the
Encoder stage is usually called bottleneck or
latent-space.

2.Decoder: takes as input the encoded latent


space and tries to reproduce the original
Autoencoder input using just it’s compressed
form (the encoded latent space).

August 14, 2024 Data Mining: Concepts and Techniques 105

You might also like