CH1
CH1
CH1
Feature Engineering
What is Feature engineering?
Feature engineering is the process of selecting,
•`
manipulating, and transforming raw data into
features that better represent the underlying
problem and thus improve the accuracy of the
machine learning model.
Height Height
Tall 0
Medium 1
Short 2
One Hot Encoding
• It is a method for converting categorical variables into a binary
format.
• It creates new binary columns (0s and 1s) for each category in the
original variable.
• Each category in the original column is represented as a separate
column, where a value of 1 indicates the presence of that category,
and 0 indicates its absence.
Wherever the fruit is “Apple,” the Apple column will have a value of 1,
while the other fruit columns (like Mango or Orange) will contain 0.
One Hot Coding-Example
Binarization
• Often raw numeric frequencies or counts are not necessary in building models
especially with regard to methods applied in building recommender engines.
Merely using raw numeric frequencies or counts is not a good practice to use for
recommender engines.
• Suppose our task is to build a recommender to recommend songs to users. One
component of the recommender might predict how much a user will enjoy a
particular song.
• In this case, the raw listen count is not a robust measure of user taste. Why ??
Users have different listening habits. Some people might put their favorite songs
on infinite loop, while others might savor them only on special occasions. We
can’t necessarily say that someone who listens to a song 20 times must like it
twice as much as someone else who listens to it 10 times.
• In this case, a binary feature is preferred as opposed to a count based feature. A
more robust representation of user preference is to binarize the count and clip all
counts greater than 1 to 1. In other words, if the user listened to a song at least
Binning
• Often when working with numeric data, you might come across
features or attributes which depict raw measures such as values or
frequencies.
• In many cases, often the distributions of these attributes are skewed
in the sense that some sets of values will occur a lot and some will be
very rare.
• Besides that, there is also the added problem of varying range of
these values.
Binning
• Suppose we are talking about song or video view counts. In some
cases, the view counts will be abnormally large and in some cases
very small. Directly using these features in modeling might cause
issues. For example, if we are calculating similarity, A large count in
one element of the data vector would outweigh the similarity in all
other elements, which could throw off the entire similarity
measurement for that feature vector.
• One solution is to contain the scale by quantizing the count. In other
words, we group the counts into bins, and get rid of the actual count
values. Quantization maps a continuous number to a discrete one. We
can think of the discretized numbers as an ordered sequence of bins
that represent a measure of intensity.
• Data binning, or bucketing, is a process used to minimize the effects
of observation errors. It is the process of transforming numerical
variables into their categorical counterparts.
• Z-Score Method
• The Z-score of a data point is calculated as the number of standard
deviations it falls away from the mean of the dataset. Z-score
represents a data point’s distance from the mean in terms of the
standard deviation. Mathematically, the Z-score for a data point x is
calculated as -
• Z−score=(x−mean)/standard deviation
• Using Z-Score, we can define the upper and lower bounds of a
dataset. A data point with a Z-score greater than a certain threshold
(usually 2.5 or 3) is considered an outlier.
Feature Selection
• Feature selection in machine learning refers to selecting essential
features from all features and discarding the useless ones. Having
redundant variables present in data reduces the model's
generalization capability and may also reduce the overall accuracy of
a classifier.
• Moreover, having too many redundant features increases the training
time of the model as well. Hence it is essential to identify and select
the most appropriate features from the data and remove the
irrelevant or less important features.
• Feature selection in machine learning helps in selecting important
features and removing useless ones.
• Removing less important features improves the model's
generalization capabilities. The model can focus on only vital features
and not try to learn misleading patterns in less important features.
• Feature selection in machine learning helps in improving the training
time of the model.
• Sometimes, It helps in reducing the performance of the model.
• In addition, feature selection in machine learning makes code
building, debugging, maintenance, and understanding easier.
• 1. Filters Method:
• In the Filter Method, features are selected based on statistics
measures. The filter method filters out the model's irrelevant features
and redundant columns by using different metrics through ranking.
• Filter methods need low computational time and do not overfit the
data. Some standard techniques of Filter methods are as follows:
• Chi-square Test: The chi-square value is calculated between each
feature and the target variable, and the desired number of features
with the best chi-square value is selected.
Wrapper methods
• Wrapper methods also referred to as greedy algorithms train the
algorithm by using a subset of features in an iterative manner.
• Based on the conclusions made from training in prior to the model,
addition and removal of features takes place.
• Stopping criteria for selecting the best subset are usually pre-defined
by the person training the model such as when the performance of
the model decreases or a specific number of features has been
achieved.
• The main advantage of wrapper methods over the filter methods is
that they provide an optimal set of features for training the model,
thus resulting in better accuracy than the filter methods but are
computationally more expensive.
• Some techniques used are:
• Forward selection – This method is an iterative approach where we initially start
with an empty set of features and keep adding a feature which best improves our
model after each iteration. The stopping criterion is till the addition of a new
variable does not improve the performance of the model.
• Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least
significant feature. The stopping criterion is till no improvement in the
performance of the model is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward selection and
backward elimination technique simultaneously to reach one unique solution.
• Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is
trained on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed
from the current set of features till we are left with the required number of
features.
Embedded methods
• In embedded methods, the feature selection algorithm is blended as
part of the learning algorithm, thus having its own built-in feature
selection methods.
• Embedded methods encounter the drawbacks of filter and wrapper
methods and merge their advantages. These methods are faster like
those of filter methods and more accurate than the filter methods
and take into consideration a combination of features as well.
Model Selection
• It is a process of choosing a best suited model for a particular problem.
• Selecting a model depends on various factors such as the
dataset,task,nature of model,etc
• Model can selected based on
• Type of data available:
• Image and videos-CNN
• Text data or speech data-RNN
• Numerical data-SVM,Logistic regression,decision trees,etc
• Based on the task we need to carry out:
• Classification-Classify data point into one of the classes-SVM(more
dimensions),Logistic regression(Binary class),decision trees
• Regression task-Linear regression,random forest,polynomial regression,tec
• Clustering:Clustering,hierarchical clustering.
Example:Linear Regression
• Advantages:Very simple to
implement
• Performs well on data
with linear relationship
• Disadvantages:
• Not suitable for data
having non linear
relationship
• Underfitting issue
• Sensitive to outliers
Example:Logistic Regression
Advantages:
Easy to implement
Performs well on data with linear
relationship
Disadvantages:
High dimensional dataset causes
overfiiting
Difficult to capture complex
relationship in a dataset
Sensitive to outliers
Needs a larger dataset
Example:Decision tree
Advantages: Can be used for both
classification and regression
Easy to interpret
No need for normalization or scaling
Not sensitive to outlier
Disadvantages:Overfitting issue
Small changes in the data alter the
tree structure causing instability
Training time is relatively higher
`