0% found this document useful (0 votes)
0 views64 pages

CH1

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 64

Model Selection and

Feature Engineering
What is Feature engineering?
Feature engineering is the process of selecting,
•`
manipulating, and transforming raw data into
features that better represent the underlying
problem and thus improve the accuracy of the
machine learning model.

Feature Engineering is the process of creating new


features or transforming existing features to
improve the performance of a machine-learning
model. It involves selecting relevant information
from raw data and transforming it into a format that
can be easily understood by a model. The goal is to
improve model accuracy by providing more
meaningful and relevant information.
Features
• In the context of machine learning, a feature (also known as a variable
or attribute) is an individual measurable property or characteristic of a
data point that is used as input for a machine learning algorithm.
• Features can be numerical, categorical, or text-based
• For example, in a dataset of housing prices, features could include the
number of bedrooms, the square footage, the location, and the age of
the property. In a dataset of customer demographics, features could
include age, gender, income level, and occupation.
• The choice and quality of features are critical in machine learning, as
they can greatly impact the accuracy and performance of the model.
Types of features
• Generally, all machine learning algorithms take input data to generate the
output. The input data remains in a tabular form consisting of rows
(instances or observations) and columns (variable or attributes), and these
attributes are often known as features.
• In feature engineering, two types of features are commonly used: numerical
and categorical.
• Numerical features are continuous values that can be measured on a scale.
Examples of numerical features include age, height, weight, and income.
• Categorical features are discrete values that can be grouped into categories.
Examples of categorical features include gender, color, and zip code.
Categorical features typically need to be converted to numerical features
before they can be used in machine learning algorithms.
• Feature engineering in ML contains mainly four processes:
• Feature Creation
• Feature Transformation
• Feature Extraction
• Feature Selection.
Feature Creation
• Feature Creation is the process of generating new features based on
domain knowledge or by observing patterns in the data.
• For example, if you have a dataset with the birth date of individuals,
you can create a new feature age by subtracting it from the current
date.
• This technique is advantageous when the existing features do not
provide enough information to train and develop a robust machine-
learning model or when the existing features are not in a usable
format for an ML model development.
• Types of Feature Creation:
• Domain-Specific: Creating new features based on domain knowledge,
such as creating features based on business rules or industry
standards.
• Data-Driven: Creating new features by observing patterns in the data,
such as calculating aggregations or creating interaction features.
• Synthetic: Generating new features by combining existing features or
synthesizing new data points.
• Feature Transformation is the process of transforming the features
into a more suitable representation for the machine learning model.
• Feature transformation is a mathematical transformation in which we
apply a mathematical formula to a particular column (feature) and
transform the values, which are useful for our further analysis.
• This is done to ensure that the model can effectively learn from the
data.
• Ex.
Types of Feature Transformation:
• Normalization: Rescaling the features to have a similar range, such as between
0 and 1, to prevent some features from dominating others.
• Scaling: Scaling is a technique used to transform numerical variables to have a
similar scale, so that they can be compared more easily. Rescaling the features
to have a similar scale, such as having a standard deviation of 1, to make sure
the model considers all features equally.
• Encoding: Transforming categorical features into a numerical representation.
Examples are one-hot encoding and label encoding.
• Transformation: Transforming the features using mathematical operations to
change the distribution or scale of the features. Examples are logarithmic,
square root, and reciprocal transformations.
Normalization
• Min-Max Normalization
• Min-Max normalization scales the data to fit within a specified range,
usually between 0 and 1. The formula for min-max normalization is:
• Normalized value=value−minmax−minNormalized
value=max−minvalue−min​
• For example, consider a dataset containing ages ranging from 20 to
60. If we want to scale the ages using min-max normalization, an age
of 20 would be scaled to 0 and an age of 60 would be scaled to 1. An
age of 40 would be scaled to 0.5, sitting directly in the middle of the
new scale.
Normalization
• Z-score normalization, or standardization, scales the data so that it has a
mean of 0 and a standard deviation of 1. The formula for z-score
normalization is:
• Normalized value=value−meanstandard deviationNormalized
value=standard deviationvalue−mean​
• For example, consider a dataset containing test scores from a class of
students. The scores range from 50 to 100 with a mean of 75 and a
standard deviation of 10. A score of 75 would be scaled to 0 (since it’s
the mean), a score of 85 would be scaled to 1 (one standard deviation
above the mean), and a score of 65 would be scaled to -1 (one standard
deviation below the mean).
• Decimal Scaling Normalization
• This technique scales the data by moving the decimal point of values.
The number of decimal places moved depends on the maximum
absolute value in the dataset.
• Suppose we have a dataset of house prices, where the house sizes are
measured in square feet and range from 500 to 5000. The prices of
the houses range from $50,000 to $500,000. These two features (size
and price) are on vastly different scales. Normalization can help bring
these features onto a similar scale, typically in the range of 0 to 1.
• In machine learning, normalization is critical to ensure that all inputs
are treated equally. For instance, consider a machine learning model
that predicts house prices based on features like house size and the
number of bedrooms. If the house size is measured in square feet and
ranges from 500 to 5000, and the number of bedrooms ranges from 1
to 5, the model might unduly focus on the house size simply because
its values are larger. Normalizing these features can help ensure the
model treats both inputs equally.
• Feature Scaling is a technique to standardize the independent
features present in the data in a fixed range. It is performed during
the data pre-processing to handle highly varying magnitudes or values
or units.
• Absolute Maximum Scaling:
• This method of scaling requires two-step:
• We should first select the maximum absolute value out of all the entries
of a particular measure.
• Then after this, we divide each entry of the column by this maximum
value.
• After performing the above-mentioned two steps we will observe that
each entry of the column lies in the range of -1 to 1.
• Min-Max Scaling
• This method of scaling requires below two-step:
• First, we are supposed to find the minimum and the maximum value
of the column.
• Then we will subtract the minimum value from the entry and divide
the result by the difference between the maximum and the minimum
value.
• As we are using the maximum and the minimum value this method is
also prone to outliers but the range in which the data will range after
performing the above two steps is between 0 to 1.
• Standardization
• This method of scaling is basically based on the central tendencies
and variance of the data.
• First, we should calculate the mean and standard deviation of the
data we would like to normalize.
• Then we are supposed to subtract the mean value from each entry
and then divide the result by the standard deviation.
• This helps us achieve a normal distribution(if it is already normal but
skewed) of the data with a mean equal to zero and a standard
deviation equal to 1.
Robust Scaling

• In this method of scaling, we use two main statistical measures of the


data.
• Median
• Inter-Quartile Range
• After calculating these two values we are supposed to subtract the
median from each entry and then divide the result by the
interquartile range.
• Feature transformation is a mathematical transformation in which we
apply a mathematical formula to a particular column (feature) and
transform the values, which are useful for our further analysis.
Feature Extraction
• feature extraction entails constructing new features that retain the
key information from the original data but in a more efficient manner
transforming raw data into a set of numerical features that a
computer program can easily understand and use.
• When working with huge datasets, particularly in fields such as image
processing, natural language processing, and signal processing, it is
usual to encounter data containing multiple characteristics, many of
which may be useless or redundant. Feature extraction simplifies the
data, these features capture the essential characteristics of the
original data, allowing for more efficient processing and analysis
Different types of Techniques for
Feature Extraction
• Statistical Methods
• Statistical methods are widely used in feature extraction to summarize and
explain patterns of data. Common data attributes include:
• Mean: The average number of a dataset.
• Median: The middle number of a value when it is sorted in ascending order.
• Standard Deviation: A measure of the spread or dispersion of a sample.
• Correlation and Covariance: Measures of the linear relationship between
two or more factors.
• Regression Analysis: A way to model the link between a dependent variable
and one or more independent factors.
Dimensionality Reduction Methods
for feature extraction
• Dimensionality reduction is the process of reducing the number of
features (or dimensions) in a dataset while retaining as much
information as possible.
• it is a process of transforming high-dimensional data into a lower-
dimensional space that still preserves the essence of the original data.
• Dimensionality reduction can help to mitigate these problems by
reducing the complexity of the model and improving its generalization
performance
• There are two main approaches to dimensionality reduction: feature
selection and feature extraction.
Feature selection
• Feature selection is a process that chooses a subset of features from
the original features so that the feature space is optimally reduced
according to a certain criterion.
• The goal is to reduce the dimensionality of the dataset while retaining
the most important features.
• There are several methods for feature selection, including
• filter methods
• wrapper methods
• embedded methods.
Filter Methods
• These methods are generally used while doing the pre-processing
step.
• These methods select features from the dataset irrespective of the
use of any machine learning algorithm
• In terms of computation, they are very fast and inexpensive and are
very good for removing duplicated, correlated, redundant features
• Selection of feature is evaluated individually which can sometimes
help when features are in isolation (don’t have a dependency on
other features) but will lag when a combination of features can lead
to increase in the overall performance of the model.
• Information Gain – It is defined as the amount of information provided
by the feature for identifying the target value and measures reduction
in the entropy values. Information gain of each attribute is calculated
considering the target values for feature selection.
• Chi-square test — Chi-square method (X2) is generally used to test the
relationship between categorical variables. It compares the observed
values from different attributes of the dataset to its expected value.
• Fisher’s Score – Fisher’s Score selects each feature independently
according to their scores under Fisher criterion leading to a suboptimal
set of features. The larger the Fisher’s score is, the better is the selected
feature.
• Correlation Coefficient – Pearson’s Correlation Coefficient is a measure
of quantifying the association between the two continuous variables
and the direction of the relationship with its values ranging from -1 to 1.
• Variance Threshold – It is an approach where all features are removed whose variance
doesn’t meet the specific threshold. By default, this method removes features having zero
variance. The assumption made using this method is higher variance features are likely to
contain more information.
• Mean Absolute Difference (MAD) – This method is similar to variance threshold method
but the difference is there is no square in MAD. This method calculates the mean absolute
difference from the mean value.
• Dispersion Ratio – Dispersion ratio is defined as the ratio of the Arithmetic mean (AM) to
that of Geometric mean (GM) for a given feature. Its value ranges from +1 to ∞ as AM ≥
GM for a given feature. Higher dispersion ratio implies a more relevant feature.
• Mutual Dependence – This method measures if two variables are mutually dependent, and
thus provides the amount of information obtained for one variable on observing the other
variable. Depending on the presence/absence of a feature, it measures the amount of
information that feature contributes to making the target prediction.
• Relief – This method measures the quality of attributes by randomly sampling an instance
from the dataset and updating each feature and distinguishing between instances that are
near to each other based on the difference between the selected instance and two nearest
instances of same and opposite classes.
Embedded methods
• In embedded methods, the feature selection algorithm is blended as
part of the learning algorithm, thus having its own built-in feature
selection methods.
• Embedded methods encounter the drawbacks of filter and wrapper
methods and merge their advantages. These methods are faster like
those of filter methods and more accurate than the filter methods
and take into consideration a combination of features as well.
• Some techniques used are:
• Regularization – This method adds a penalty to different parameters of
the machine learning model to avoid over-fitting of the model. This
approach of feature selection uses Lasso (L1 regularization) and Elastic
nets (L1 and L2 regularization). The penalty is applied over the
coefficients, thus bringing down some coefficients to zero. The features
having zero coefficient can be removed from the dataset.
• Tree-based methods – These methods such as Random Forest, Gradient
Boosting provides us feature importance as a way to select features as
well. Feature importance tells us which features are more important in
making an impact on the target feature.
Feature Extraction
• It involves creating new features by combining or transforming the
original features.
• The goal is to create a set of features that captures the essence of the
original data in a lower-dimensional space.
• There are several methods for feature extraction, including
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
Principal Component Analysis (PCA)

• The main goal of Principal Component Analysis (PCA) is to reduce the


dimensionality of a dataset while preserving the most important
patterns or relationships between the variables without any prior
knowledge of the target variables.
• Principal Component Analysis (PCA) is a technique for dimensionality
reduction that identifies a set of orthogonal axes, called principal
components, that capture the maximum variance in the data.
• The principal components are linear combinations of the original
variables in the dataset and are ordered in decreasing order of
importance.
• The total variance captured by all the principal components is equal to
the total variance in the original dataset.
• The first principal component captures the most variation in the data,
but the second principal component captures the maximum variance
that is orthogonal to the first principal component, and so on.
• Diagram from textbook
Feature Engineering on Numerical
data,Categorical Data and Text Data
Feature Transformation
• One Hot Coding
• Binning
• Binarization
Encoding in machine learning
• Transforming categorical features into a numerical representation.
• Categorical features are generally divided into 3 types:
• A. Binary: Either/or
Examples:
• Yes, No
• True, False
• B. Ordinal: Specific ordered Groups.
Examples:
• low, medium, high
• cold, hot, lava Hot
• C. Nominal: Unordered Groups. Examples
• cat, dog, tiger
• pizza, burger, coke
Label Encoding
• Label Encoding is a technique that is used to convert categorical columns
into numerical ones so that they can be fitted by machine learning
models which only take numerical data.
• e.g a column income level having elements as low, medium, or high in
this case we can replace these elements with 1,2,3. where 1 represents
‘low’ 2 ‘medium’ and 3′ high’.
• Through this type of encoding, we try to preserve the meaning of the
element where higher weights are assigned to the elements having
higher priority.
• Suppose we have a column Height in some dataset that has elements as Tall,
Medium, and short. To convert this categorical column into a numerical
column we will apply label encoding to this column. After applying label
encoding, the Height column is converted into a numerical column having
elements 0,1, and 2 where 0 is the label for tall, 1 is the label for medium,
and 2 is the label for short height.

Height Height
Tall 0
Medium 1
Short 2
One Hot Encoding
• It is a method for converting categorical variables into a binary
format.
• It creates new binary columns (0s and 1s) for each category in the
original variable.
• Each category in the original column is represented as a separate
column, where a value of 1 indicates the presence of that category,
and 0 indicates its absence.
Wherever the fruit is “Apple,” the Apple column will have a value of 1,
while the other fruit columns (like Mango or Orange) will contain 0.
One Hot Coding-Example
Binarization
• Often raw numeric frequencies or counts are not necessary in building models
especially with regard to methods applied in building recommender engines.
Merely using raw numeric frequencies or counts is not a good practice to use for
recommender engines.
• Suppose our task is to build a recommender to recommend songs to users. One
component of the recommender might predict how much a user will enjoy a
particular song.
• In this case, the raw listen count is not a robust measure of user taste. Why ??
Users have different listening habits. Some people might put their favorite songs
on infinite loop, while others might savor them only on special occasions. We
can’t necessarily say that someone who listens to a song 20 times must like it
twice as much as someone else who listens to it 10 times.
• In this case, a binary feature is preferred as opposed to a count based feature. A
more robust representation of user preference is to binarize the count and clip all
counts greater than 1 to 1. In other words, if the user listened to a song at least
Binning
• Often when working with numeric data, you might come across
features or attributes which depict raw measures such as values or
frequencies.
• In many cases, often the distributions of these attributes are skewed
in the sense that some sets of values will occur a lot and some will be
very rare.
• Besides that, there is also the added problem of varying range of
these values.
Binning
• Suppose we are talking about song or video view counts. In some
cases, the view counts will be abnormally large and in some cases
very small. Directly using these features in modeling might cause
issues. For example, if we are calculating similarity, A large count in
one element of the data vector would outweigh the similarity in all
other elements, which could throw off the entire similarity
measurement for that feature vector.
• One solution is to contain the scale by quantizing the count. In other
words, we group the counts into bins, and get rid of the actual count
values. Quantization maps a continuous number to a discrete one. We
can think of the discretized numbers as an ordered sequence of bins
that represent a measure of intensity.
• Data binning, or bucketing, is a process used to minimize the effects
of observation errors. It is the process of transforming numerical
variables into their categorical counterparts.

• In other words, binning will take a column with continuous numbers


and place the numbers in “bins” based on ranges that we determine.
This will give us a new categorical variable feature.
• This algorithm divides the continuous variable into several categories
having bins or ranges of the same width. Let x be the number of
categories and max and min be the maximum and minimum values in
the concerned column. Then width(w) will be:-
• w=⌊max⁡−min⁡⌋/x
• and the categories will be:-
• Categories: [min⁡,min⁡+w−1],[min+w,min+2∗w− 1],
[min⁡+2∗w,min⁡+3∗w−1],[min+(x−1)∗w,max]
[min⁡,min⁡+w−1]
[min+w,min+2∗w− 1]
[min⁡+2∗w,min⁡+3∗w−1]
[min+(x−1)∗w,max]
Normalization
• Min-Max Normalization
• Min-Max normalization scales the data to fit within a specified range,
usually between 0 and 1. The formula for min-max normalization is:
•​

• For example, consider a dataset containing ages ranging from 20 to


60. If we want to scale the ages using min-max normalization, an age
of 20 would be scaled to 0 and an age of 60 would be scaled to 1. An
age of 40 would be scaled to 0.5, sitting directly in the middle of the
new scale.
Normalization
• Z-score normalization, or standardization, scales the data so that it
has a mean of 0 and a standard deviation of 1. The formula for z-score
normalization is:
• For example, consider a dataset containing test scores from a class of
students. The scores range from 50 to 100 with a mean of 75 and a
standard deviation of 10. A score of 75 would be scaled to 0 (since it’s
the mean), a score of 85 would be scaled to 1 (one standard deviation
above the mean), and a score of 65 would be scaled to -1 (one
standard deviation below the mean).
• In machine learning, normalization is critical to ensure that all inputs
are treated equally. For instance, consider a machine learning model
that predicts house prices based on features like house size and the
number of bedrooms. If the house size is measured in square feet and
ranges from 500 to 5000, and the number of bedrooms ranges from 1
to 5, the model might unduly focus on the house size simply because
its values are larger. Normalizing these features can help ensure the
model treats both inputs equally.
Outlier detection

• Z-Score Method
• The Z-score of a data point is calculated as the number of standard
deviations it falls away from the mean of the dataset. Z-score
represents a data point’s distance from the mean in terms of the
standard deviation. Mathematically, the Z-score for a data point x is
calculated as -
• Z−score=(x−mean)/standard deviation
• Using Z-Score, we can define the upper and lower bounds of a
dataset. A data point with a Z-score greater than a certain threshold
(usually 2.5 or 3) is considered an outlier.
Feature Selection
• Feature selection in machine learning refers to selecting essential
features from all features and discarding the useless ones. Having
redundant variables present in data reduces the model's
generalization capability and may also reduce the overall accuracy of
a classifier.
• Moreover, having too many redundant features increases the training
time of the model as well. Hence it is essential to identify and select
the most appropriate features from the data and remove the
irrelevant or less important features.
• Feature selection in machine learning helps in selecting important
features and removing useless ones.
• Removing less important features improves the model's
generalization capabilities. The model can focus on only vital features
and not try to learn misleading patterns in less important features.
• Feature selection in machine learning helps in improving the training
time of the model.
• Sometimes, It helps in reducing the performance of the model.
• In addition, feature selection in machine learning makes code
building, debugging, maintenance, and understanding easier.
• 1. Filters Method:
• In the Filter Method, features are selected based on statistics
measures. The filter method filters out the model's irrelevant features
and redundant columns by using different metrics through ranking.
• Filter methods need low computational time and do not overfit the
data. Some standard techniques of Filter methods are as follows:
• Chi-square Test: The chi-square value is calculated between each
feature and the target variable, and the desired number of features
with the best chi-square value is selected.
Wrapper methods
• Wrapper methods also referred to as greedy algorithms train the
algorithm by using a subset of features in an iterative manner.
• Based on the conclusions made from training in prior to the model,
addition and removal of features takes place.
• Stopping criteria for selecting the best subset are usually pre-defined
by the person training the model such as when the performance of
the model decreases or a specific number of features has been
achieved.
• The main advantage of wrapper methods over the filter methods is
that they provide an optimal set of features for training the model,
thus resulting in better accuracy than the filter methods but are
computationally more expensive.
• Some techniques used are:
• Forward selection – This method is an iterative approach where we initially start
with an empty set of features and keep adding a feature which best improves our
model after each iteration. The stopping criterion is till the addition of a new
variable does not improve the performance of the model.
• Backward elimination – This method is also an iterative approach where we
initially start with all features and after each iteration, we remove the least
significant feature. The stopping criterion is till no improvement in the
performance of the model is observed after the feature is removed.
• Bi-directional elimination – This method uses both forward selection and
backward elimination technique simultaneously to reach one unique solution.
• Recursive elimination – This greedy optimization method selects features by
recursively considering the smaller and smaller set of features. The estimator is
trained on an initial set of features and their importance is obtained using
feature_importance_attribute. The least important features are then removed
from the current set of features till we are left with the required number of
features.
Embedded methods
• In embedded methods, the feature selection algorithm is blended as
part of the learning algorithm, thus having its own built-in feature
selection methods.
• Embedded methods encounter the drawbacks of filter and wrapper
methods and merge their advantages. These methods are faster like
those of filter methods and more accurate than the filter methods
and take into consideration a combination of features as well.
Model Selection
• It is a process of choosing a best suited model for a particular problem.
• Selecting a model depends on various factors such as the
dataset,task,nature of model,etc
• Model can selected based on
• Type of data available:
• Image and videos-CNN
• Text data or speech data-RNN
• Numerical data-SVM,Logistic regression,decision trees,etc
• Based on the task we need to carry out:
• Classification-Classify data point into one of the classes-SVM(more
dimensions),Logistic regression(Binary class),decision trees
• Regression task-Linear regression,random forest,polynomial regression,tec
• Clustering:Clustering,hierarchical clustering.
Example:Linear Regression
• Advantages:Very simple to
implement
• Performs well on data
with linear relationship
• Disadvantages:
• Not suitable for data
having non linear
relationship
• Underfitting issue
• Sensitive to outliers
Example:Logistic Regression
Advantages:
Easy to implement
Performs well on data with linear
relationship

Disadvantages:
High dimensional dataset causes
overfiiting
Difficult to capture complex
relationship in a dataset
Sensitive to outliers
Needs a larger dataset
Example:Decision tree
Advantages: Can be used for both
classification and regression
Easy to interpret
No need for normalization or scaling
Not sensitive to outlier

Disadvantages:Overfitting issue
Small changes in the data alter the
tree structure causing instability
Training time is relatively higher
`

You might also like