Unit No: 4 Basics of Feature Engineering (31707 24)
Unit No: 4 Basics of Feature Engineering (31707 24)
Unit No: 4 Basics of Feature Engineering (31707 24)
Unit no : 4
Basics of Feature
Engineering(31707
24)
~ key drivers
~ measure and overall process
Feature
construction
New features are created from a combination
of original features.
Some of the commonly used operators for
combining the original features include:
For Boolean features: Conjunctions,
Feature Disjunctions, Negation, etc.
Extraction For nominal features: Cartesian product, M
of N, etc.
For numerical features: Min, Max, Addition,
Subtraction, Multiplication, Division, Average,
Equivalence, Inequality, etc.
Say, we have a data set with a feature set Fi
(F1 ,F2 , …, Fn ). After feature extraction using
a mapping function f(F1 , F2 , …,Fn) say, we will
have a set of features F’(F’1, F’2 , …,F’m) such
that F’i = f(Fi) and m<n.
Feature
Extraction
Principal Component Analysis(PCA)
Every data set, has multiple attributes or dimensions –
many of which might have similarity with each other.
For example: Housing Data has 5 dimensions which can be
reduced to 2 features.
Principal
Component
1-D Sample can be represented in a line.
Analysis(PCA)
Principal 2-D Data can be represented in a 2-D graph form
Component
Analysis(PCA)
3-D Data can be represented in a 3-D plan.
Principal
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
Visualizing 4-D data is difficult.
Principal
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
Step -2 Find the average values of Gene 1&2
and plot it
Principal
Component
Analysis(PCA)
Step – 3 Shift the data such that center of data
is on origin
Gene 2
Principal
Component
Analysis(PCA) Gene 1
Step – 4 Find a line that best fits data
We start by drawing a random line and project the
data point on that line.
Gene 2
Principal
Component
Analysis(PCA)
Gene 1
Now we shift that line, the goal is to either minimize
the distance between actual point and projected
point or to maximize the distance between origin and
data point.
We will calculate the distance between origin and
project points.
Principal
Component
Analysis(PCA)
We calculate the sum of all the distance
Principal
Component
Analysis(PCA)
PC1 has the slope of 0.25 i.e. for every four
unit in x-axis only one unit is moved in y-axis.
Principal
Component
Analysis(PCA)
We can find the value of the red line which is
a vector using Pythagorean theorem.
Principal
Component
Analysis(PCA) = 4.12
We can convert the eigenvalues into variation
around origin by dividing by the sample
size - 1
Principal
Component
Analysis(PCA)
Step -5 Find Eigenvector and Eigenvalue
When we do PCA with SVD the vector value is
scaled and we obtain new value but the ratio
remains the same.
Principal
Component
Analysis(PCA)
The 1 unit long vector consisting of 0.94 parts of gene
1 and 0.242 parts gene 2, is called Eigenvector for PC1
The sum of squared distance is called Eigenvalue for
PC1
The equation to find eigenvalue and eigenvector
Principal
Component
Analysis(PCA)
v = Eigenvector
λ = be the scalar quantity that is termed as eigenvalue
Step -6 Now find PC2
As it’s 2-D data the PC2 is the line from origin that is
perpendicular to PC1.
Principal
Component
Analysis(PCA)
After normalization
Principal
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
Finally we have obtained our PC2
Principal
Component
Analysis(PCA)
Now we have both PC1 and PC2, we finally
project our points on PC1 and PC2
Principal
Component
Analysis(PCA)
Then to draw the final PCA plot we just rotate
our graph such that PC1 is horizontal.
And we use projected points to find where
samples go on PCA plot.
Principal
Component
Analysis(PCA)
Assume for this example, Variation for PC1 =
15 and PC2 = 3.
So total variation of both PCs is 18.
PC1 accounts for 83% of PCs(i.e 15/18)
Principal PC2 accounts for 17% of PCs(i.e 3/18)
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
Calculation PCA for 3-D data
We will calculate the PCs similar to 2-D
First center the data
Find PC1, but now the line contains 3 vectors
Then find PC2, perpendicular to PC1
Principal Then find PC3, perpendicular to PC1 & PC2.
Component
Analysis(PCA)
Now we find total variation of the PCs.
Principal
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
Working of PCA
1. First, calculate the covariance matrix of a data set.
2. Then, calculate the eigenvalues of the covariance
matrix.
3. The eigenvector having highest eigenvalue
represents the direction in which there is the
Principal highest variance. So this will help in identifying the
first principal component.
Component 4. The eigenvector having the next highest eigenvalue
represents the direction in which data has the
Analysis(PCA) highest remaining variance and also orthogonal to
the first direction. So this helps in identifying the
second principal component.
5. Like this, identify the top ‘k’ eigenvectors having
top ‘k’ eigenvalues so as to get the ‘k’ principal
components.
Singular Value Decomposition(SVD)
Singular value decomposition (SVD) is a matrix
factorization technique commonly used in linear
algebra
SVD of a matrix A (m × n) is a factorization of the
form:
Feature
Extraction where, U and V are orthonormal matrices, U is an m ×
m unitary matrix, V is an n × n unitary matrix and S is
an m × n rectangular diagonal matrix. The diagonal
entries of S are known as singular values of matrix A.
The columns of U and V are called the left-singular
and right-singular vectors of matrix A, respectively.
Singular Value
Decompositio
n(SVD)
Singular Value
Decompositio
n(SVD)
SVD of a data matrix is expected to have the
properties highlighted below:
Feature
Extraction
Linear
Discriminant
Analysis
Linear discriminant analysis (LDA) is another
commonly used feature extraction technique like PCA
or SVD. The objective of LDA is similar to the sense
that it intends to transform a data set into a lower
dimensional feature space.
Linear However, unlike PCA, the focus of LDA is not to
Discriminant capture the data set variability. Instead, LDA focuses
on class separability, i.e. separating the features
Analysis based on class separability so as to avoid over-fitting
of the machine learning model.
Working of LDA
Similar to PCA, LDA create a new axis to project the
data in less dimension.
Linear
Discriminant
Analysis
Linear
Discriminant
Analysis
Here now data is projected on less dimension but the
data is not separated.
So, LDS separates the data on new axis.
Step-1
Maximize the distance between the means
Linear
Discriminant Step -2
Analysis Minimize the variation(scatter = s2) within each
category
Step – 3
Optimization the distance between means and scatter.
Linear
Discriminant
Analysis
LDA for 3-D dataset
Linear
Discriminant
Analysis
Linear
Discriminant
Analysis
LDA for 3 categories
Linear
Discriminant
Analysis
Linear
Discriminant
Analysis
Difference B/W LDA and PCA
Linear
Discriminant
Analysis
. Below are the steps to be followed:
1. Calculate the mean vectors for the individual classes.
2. Calculate intra-class and inter-class scatter matrices.
3. Calculate eigenvalues and eigenvectors for SW and Si
, where SB is the intra-class scatter matrix and S is the
Linear inter-class scatter, where, m is the mean vector of the i-
Discriminant th class where, mi is the sample mean for each class, m
is the overall mean of the data set, Ni is the sample size
Analysis of each class
Issues in high-
dimensional
data
The objectives of feature selection:
Correlation-
based
similarity
measure
Euclidean distance:
Euclidean distance between two features F1 and F2
are calculated as:
Distance-
based
similarity
measure
Minkowski distance:
A more generalized form of the Euclidean
distance:
Distance-
based
similarity
measure Minkowski distance takes the form of
Euclidean distance when r = 2.
Manhattan distance:
At r = 1, it takes the form of Manhattan
distance, as shown below:
Distance-
based
similarity
measure
Hamming distance:
A specific example of Manhattan distance,
used more frequently to calculate the
distance between binary vectors is the
Distance- Hamming distance.
based For example, the Hamming distance between
similarity two vectors 01101011 and 11001001 is 3.
measure
Jaccard index/coefficient:
For two features having binary values, Jaccard index
is measured as
Other
similarity
where, n11 = number of cases where both the
measures features have value 1
Jaccard distance, d = 1 - J
Calculate: Let’s consider two features F and F having
values (0, 1, 1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).
Simple matching coefficient (SMC) is almost
same as Jaccard coeficient except the fact that
it includes a number of cases where both the
features have a value of 0
Other
similarity
measures
Cosine Similarity
Cosine similarity which is one of the most
popular measures in text classification is
calculated as:
Other
similarity
measures
A typical feature selection process consists of
four steps:
1. generation of possible subsets
2. subset evaluation
Overall 3. stop searching based on some stopping criterion
4. validation of the result
feature
selection
process
Overall
feature
selection
process
There are four types of approach for feature
selection:
1. Filter approach
2. Wrapper approach
Feature 3. Hybrid approach
4. Embedded approach
selection
approaches
In the filter approach, the feature subset is
selected based on statistical measures done
to assess the merits of the features from the
data perspective. No learning algorithm is
employed to evaluate the goodness of the
Filter feature selected.
Approach Some of the common statistical tests
conducted on features as a part of filter
approach are – Pearson’s correlation,
information gain, Fisher score, analysis of
variance (ANOVA), Chi-Square, etc.
Filter
Approach
In the wrapper approach ,identification of
best feature subset is done using the
induction algorithm as a black box. The
feature selection algorithm searches for a
good feature subset using the induction
algorithm itself as a part of the evaluation
Wrapper function.
Approach Since for every candidate subset, the learning
model is trained and the result is evaluated
by running the learning algorithm, wrapper
approach is computationally very expensive.
However, the performance is generally
superior compared to filter approach.
Wrapper
Approach
Hybrid approach takes the advantage of both
filter and wrapper approaches. A typical
hybrid algorithm makes use of both the
statistical tests as used in filter approach to
decide the best subsets for a given cardinality
Hybrid and a learning algorithm to select the final
best subset among the best subsets across
approach different cardinalities.
Embedded approach is quite similar to
wrapper approach as it also uses and
inductive algorithm to evaluate the generated
feature subsets. However, the difference is it
performs feature selection and classification
Embedded simultaneously.
approach
Department of CE
Thanks
Unit no : 4
Basics of Feature
Engineering(31707
24)