Unit No: 4 Basics of Feature Engineering (31707 24)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

Department of CE

Unit no : 4
Basics of Feature
Engineering(31707
24)

Prof. Hetvy Jadeja


Department of CE

ML: Machine Learning Unit no : 4


Basics of Feature Basics of Feature
Engineering(31707
Engineering 24)

Prof. Hetvy Jadeja


Outline :
Department of CE

Feature and Feature Engineering


Unit no : 4
Feature transformation: Construction and extraction
Basics of Feature
Feature subset selection : Engineering(31707
 ~ Issues in high-dimensional data 24)

 ~ key drivers
 ~ measure and overall process

Prof. Hetvy Jadeja


 A feature is an attribute of a data set that is used in a
machine learning process. There is a view amongst certain
machine learning practitioners that only those attributes
which are meaningful to a machine learning problem are to
be called as features
 The features in a data set are also called its dimensions. So a
data set having ‘n’ features is called an n-dimensional data
What is a set.
feature?  Let’s take the example of a famous machine learning data set,
Iris
 Feature engineering refers to the process of
translating a data set into features such that these
features are able to represent the data set more
effectively and result in a better learning
performance.
What is  It has two major elements:
1. feature transformation - transforms the data into
feature a new set of features which can represent
engineering? effectively the underlying problem.

1. feature subset selection - derive a subset of


features from the full feature set which is most
meaningful.
 There are two variants of feature
transformation:
1. Feature construction - discover missing
information about the relationships between
features and augment the feature space by creating
additional features.
Feature 2. Feature extraction - extract or create a new set of
transformatio features from the original set of features using
some functional mapping.
n
 Feature construction: involves transforming a given set
of input features to generate a new set of more powerful
features. Hence, if there are ‘n’ features or dimensions in a
data set, after feature construction ‘m’ more features or
dimensions may get added. So at the end, the data set will
become ‘n + m’ dimensional.
 Feature construction involves transforming a given set of
Feature input features to generate a new set of more powerful
construction features to understand more clearly.
Encoding categorical (nominal) variables
 Say the data set has features age, city of origin,
parents athlete (i.e. indicate whether any one of the
parents was an athlete) and Chance of Win. The
feature chance of a win is a class variable while the
others are predictor variables.
Feature
construction
 Any machine learning algorithm, whether it’s a
classification algorithm (like kNN) or a regression
algorithm, requires numerical figures to learn from.
 In this case, feature construction can be used to
create new dummy features which are usable by
machine learning algorithms..
Feature  The conversion of nominal categorical data into
construction numerical data is called as one hot encoding.
Feature
construction
 Encoding categorical (ordinal) variables
 Let’s take an example of a student data set. Let’s assume
that there are three variable – science marks, maths
marks and grade as shown in Figure. As we can see, the
grade is an ordinal variable with values A, B, C, and D.
Feature  To transform this variable to a numeric variable, we can
create a feature num_grade mapping a numeric value
construction against each ordinal value. In the context of the current
example, grades A, B, C, and D is mapped to values 1, 2,
3, and 4 in the transformed variable.
Feature
construction
 Transforming numeric (continuous) features
to categorical features
 Sometimes there is a need of transforming a
continuous numerical variable into a categorical
variable. For example, we may want to treat the
real estate price prediction problem, which is a
Feature regression problem, as a real estate price
category prediction, which is a classification
construction problem.
 In that case, we can ‘bin’ the numerical data into
multiple categories based on the data range. In
the context of the real estate price prediction
example, the original data set has a numerical
feature apartment_price.
Feature
construction
 Text-specific feature construction
 text mining is an important area of research – not
only for technology practitioners but also for
industry practitioners.
 However, making sense of text data, due to the
inherent unstructured nature of the data, is not
Feature so straightforward.
construction  the text data chunks that we can think about do
not have readily available features, like
structured data sets, on which machine learning
tasks can be executed. All machine learning
models need numerical data as input. So the text
data in the data sets need to be transformed into
numerical features.
 Text data, or corpus which is the more popular
keyword, is converted to a numerical
representation following a process is known as
vectorization.
 In this process, word occurrences in all
documents belonging to the corpus are
consolidated in the form of bag-of-words.
Feature  There are three major steps that are followed:
construction 1. tokenize 2. count 3. normalize
 In order to tokenize a corpus, the blank spaces
and punctuations are used as delimiters to
separate out the words, or tokens. Then the
number of occurrences of each token is counted,
for each document. Lastly, tokens are weighted
with reducing importance when they occur in the
majority of the documents.
Feature
construction
 A matrix is then formed with each token representing
a column and a specific document of the corpus
representing each row. Each cell contains the count
of occurrence of the token in a specific document.
This matrix is known as a document-term matrix
(also known as a term-document matrix). Figure
represents a typical document-term matrix which
Feature forms an input to a machine learning model.
construction
 Bag of Words
1. Jim and Pam travelled by bus.
2. The train was late.
3. The flight was full. Traveling by flight is expensive.

Feature
construction
 New features are created from a combination
of original features.
 Some of the commonly used operators for
combining the original features include:
 For Boolean features: Conjunctions,
Feature Disjunctions, Negation, etc.
Extraction  For nominal features: Cartesian product, M
of N, etc.
 For numerical features: Min, Max, Addition,
Subtraction, Multiplication, Division, Average,
Equivalence, Inequality, etc.
 Say, we have a data set with a feature set Fi
(F1 ,F2 , …, Fn ). After feature extraction using
a mapping function f(F1 , F2 , …,Fn) say, we will
have a set of features F’(F’1, F’2 , …,F’m) such
that F’i = f(Fi) and m<n.
Feature
Extraction
 Principal Component Analysis(PCA)
 Every data set, has multiple attributes or dimensions –
many of which might have similarity with each other.
 For example: Housing Data has 5 dimensions which can be
reduced to 2 features.

Feature Size (Area)


Extraction Number of Rooms Size Feature
Number of Bathrooms

Schools Around Location Feature


Crime Rate
 The Principal Component Analysis is a popular
unsupervised learning technique for reducing the
dimensionality of data.
 It increases interpretability yet, at the same time, it
minimizes information loss. It helps to find the most
Principal significant features in a dataset and makes the data
Component easy for plotting in 2D and 3D.
Analysis(PCA)  In PCA, a new set of features are extracted from the
original features which are quite dissimilar in nature.
So an n-dimensional feature space gets transformed
to an m-dimensional feature space
 Where the dimensions are orthogonal to each other,
i.e. completely independent of each other
 Consider the following example:

Principal
Component
 1-D Sample can be represented in a line.
Analysis(PCA)
Principal  2-D Data can be represented in a 2-D graph form
Component
Analysis(PCA)
 3-D Data can be represented in a 3-D plan.

Principal
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
 Visualizing 4-D data is difficult.

Principal
Component
Analysis(PCA)

 So, PCA is used to take multi dimension data and


convert in the 2-D or 3-D plot.
 Calculating PCA for 2-D data
Step -1 Plot your data points

Principal
Component
Analysis(PCA)
Step -2 Find the average values of Gene 1&2
and plot it

Principal
Component
Analysis(PCA)
Step – 3 Shift the data such that center of data
is on origin

Gene 2
Principal
Component
Analysis(PCA) Gene 1
 Step – 4 Find a line that best fits data
 We start by drawing a random line and project the
data point on that line.

Gene 2
Principal
Component
Analysis(PCA)
Gene 1
 Now we shift that line, the goal is to either minimize
the distance between actual point and projected
point or to maximize the distance between origin and
data point.
 We will calculate the distance between origin and
project points.
Principal
Component
Analysis(PCA)
 We calculate the sum of all the distance

SS(Sum of squared distance) = d12 + d22 + d32 + d42 +


d52 + d62

Principal
Component
Analysis(PCA)
 PC1 has the slope of 0.25 i.e. for every four
unit in x-axis only one unit is moved in y-axis.

Principal
Component
Analysis(PCA)
 We can find the value of the red line which is
a vector using Pythagorean theorem.

Principal
Component
Analysis(PCA) = 4.12
 We can convert the eigenvalues into variation
around origin by dividing by the sample
size - 1

Principal
Component
Analysis(PCA)
 Step -5 Find Eigenvector and Eigenvalue
 When we do PCA with SVD the vector value is
scaled and we obtain new value but the ratio
remains the same.
Principal
Component
Analysis(PCA)
 The 1 unit long vector consisting of 0.94 parts of gene
1 and 0.242 parts gene 2, is called Eigenvector for PC1
 The sum of squared distance is called Eigenvalue for
PC1
 The equation to find eigenvalue and eigenvector
Principal
Component
Analysis(PCA)
 v = Eigenvector
 λ = be the scalar quantity that is termed as eigenvalue
 Step -6 Now find PC2
 As it’s 2-D data the PC2 is the line from origin that is
perpendicular to PC1.

Principal
Component
Analysis(PCA)
 After normalization

Principal
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
 Finally we have obtained our PC2

Principal
Component
Analysis(PCA)
 Now we have both PC1 and PC2, we finally
project our points on PC1 and PC2

Principal
Component
Analysis(PCA)
 Then to draw the final PCA plot we just rotate
our graph such that PC1 is horizontal.
 And we use projected points to find where
samples go on PCA plot.
Principal
Component
Analysis(PCA)
 Assume for this example, Variation for PC1 =
15 and PC2 = 3.
 So total variation of both PCs is 18.
 PC1 accounts for 83% of PCs(i.e 15/18)
Principal  PC2 accounts for 17% of PCs(i.e 3/18)
Component
Analysis(PCA)
Principal
Component
Analysis(PCA)
 Calculation PCA for 3-D data
 We will calculate the PCs similar to 2-D
 First center the data
 Find PC1, but now the line contains 3 vectors
 Then find PC2, perpendicular to PC1
Principal  Then find PC3, perpendicular to PC1 & PC2.
Component
Analysis(PCA)
 Now we find total variation of the PCs.

Principal
Component
Analysis(PCA)

 PC1 and PC2 accounts vast majority of the variation.


So, 2-D graph is sufficient to represent our data.
 To convert into 2-D graph
 We strip away everything but PC1, PC2 and data
 We project our data on PC1 and PC2.
 Now we rotate our graph such that PC1 is horizontal
 Now we project back the sample points
Principal corresponding to PC1 and PC2.
Component
Analysis(PCA)
 Calculation PCA for 4-D data
 We can just represent the 4-D into a 2-D form by
finding PC1,PC2,PC3 and PC4, and the discovering
which PCs have high value

Principal
Component
Analysis(PCA)
 Working of PCA
1. First, calculate the covariance matrix of a data set.
2. Then, calculate the eigenvalues of the covariance
matrix.
3. The eigenvector having highest eigenvalue
represents the direction in which there is the
Principal highest variance. So this will help in identifying the
first principal component.
Component 4. The eigenvector having the next highest eigenvalue
represents the direction in which data has the
Analysis(PCA) highest remaining variance and also orthogonal to
the first direction. So this helps in identifying the
second principal component.
5. Like this, identify the top ‘k’ eigenvectors having
top ‘k’ eigenvalues so as to get the ‘k’ principal
components.
 Singular Value Decomposition(SVD)
 Singular value decomposition (SVD) is a matrix
factorization technique commonly used in linear
algebra
 SVD of a matrix A (m × n) is a factorization of the
form:
Feature
Extraction  where, U and V are orthonormal matrices, U is an m ×
m unitary matrix, V is an n × n unitary matrix and S is
an m × n rectangular diagonal matrix. The diagonal
entries of S are known as singular values of matrix A.
The columns of U and V are called the left-singular
and right-singular vectors of matrix A, respectively.
Singular Value
Decompositio
n(SVD)
Singular Value
Decompositio
n(SVD)
 SVD of a data matrix is expected to have the
properties highlighted below:

1. Patterns in the attributes are captured by the right-


Singular Value singular vectors, i.e. the columns of V.
Decompositio 2. Patterns among the instances are captured by the
left-singular, i.e. the columns of U.
n(SVD) 3. Larger a singular value, larger is the part of the
matrix A that it accounts for and its associated vectors.
4. New data matrix with ‘k’ attributes is obtained using
the equation
 Linear Discriminant Analysis

 Suppose we have dataset of cancer drug effect.

Feature
Extraction
Linear
Discriminant
Analysis
 Linear discriminant analysis (LDA) is another
commonly used feature extraction technique like PCA
or SVD. The objective of LDA is similar to the sense
that it intends to transform a data set into a lower
dimensional feature space.
Linear  However, unlike PCA, the focus of LDA is not to
Discriminant capture the data set variability. Instead, LDA focuses
on class separability, i.e. separating the features
Analysis based on class separability so as to avoid over-fitting
of the machine learning model.
 Working of LDA
 Similar to PCA, LDA create a new axis to project the
data in less dimension.

Linear
Discriminant
Analysis
Linear
Discriminant
Analysis
 Here now data is projected on less dimension but the
data is not separated.
 So, LDS separates the data on new axis.
 Step-1
 Maximize the distance between the means

Linear
Discriminant  Step -2
Analysis  Minimize the variation(scatter = s2) within each
category
 Step – 3
 Optimization the distance between means and scatter.

Linear
Discriminant
Analysis
 LDA for 3-D dataset

Linear
Discriminant
Analysis
Linear
Discriminant
Analysis
 LDA for 3 categories

Linear
Discriminant
Analysis
Linear
Discriminant
Analysis
 Difference B/W LDA and PCA

Linear
Discriminant
Analysis
 . Below are the steps to be followed:
1. Calculate the mean vectors for the individual classes.
2. Calculate intra-class and inter-class scatter matrices.
3. Calculate eigenvalues and eigenvectors for SW and Si
, where SB is the intra-class scatter matrix and S is the
Linear inter-class scatter, where, m is the mean vector of the i-
Discriminant th class where, mi is the sample mean for each class, m
is the overall mean of the data set, Ni is the sample size
Analysis of each class

4. Identify the top ‘k’ eigenvectors having top ‘k’


eigenvalues
 Say we are trying to predict the weight of students
based on past information about similar students,
which is captured in a ‘student weight’ data set. The
student weight data set has features such as Roll
Number, Age, Height, and Weight
Feature  We can well understand that roll number can have no
bearing, whatsoever, in predicting student weight. So
Subset we can eliminate the feature roll number and build a
Selection feature subset to be considered in this machine
learning problem.
Feature
Subset
Selection
 With the rapid innovations in the digital space, the
volume of data generated has increased to an
unbelievable extent.
 Breakthroughs in the storage technology area have
made storage of large quantity of data quite cheap.
Issues in high-  Two new application domains have seen drastic
dimensional development: biomedical research & text
categorization.
data
 very high quantity of computational resources and
high amount of time will be required
 The performance of the model – both for
supervised and unsupervised machine learning
task, also degrades sharply due to unnecessary
noise in the data.
 Curse of Dimensionality

Issues in high-
dimensional
data
 The objectives of feature selection:

1. Having faster and more cost-effective (i.e. less


need for computational resources) learning
model
Issues in high- 2. Improving the efficiency of the learning model
dimensional 3. Having a better understanding of the
data underlying model that generated the data
 Feature relevance:
 In supervised learning: Each of the predictor
variables, is expected to contribute information to
Key drivers of decide the value of the class label. In case a variable
is not contributing any information, it is said to be
feature irrelevant.
selection –  In case the information contribution for prediction is
very little, the variable is said to be weakly relevant.
feature  Remaining variables, which make a significant
relevance and contribution to the prediction task are said to be
strongly relevant variables.
redundancy  Eg: Roll number of a student doesn’t contribute any
significant information in predicting what the Weight
of a student would be.
 For supervised learning, mutual information is
considered as a good measure of information
contribution of a feature to decide the value of the
class label
 Higher the value of mutual information of a feature,
Measures of more relevant is that feature.
feature
relevance
 In unsupervised learning: Certain variables do not
contribute any useful information for deciding the
similarity of dissimilarity of data instances
Key drivers of  Hence, those variables make no significant
information contribution in the grouping process.
feature These variables are marked as irrelevant variables in
selection – the context of the unsupervised machine learning
task.
feature
 Eg: If we are trying to group together students with
relevance and similar academic capabilities, Roll number can really
redundancy not contribute any information whatsoever
 In case of unsupervised learning, the entropy of the
set of features without one feature at a time is
calculated for all the features. Then, the features are
ranked in a descending order of information gain
from a feature and top k features are selected as
relevant features.
Measures of
feature
relevance
 Feature redundancy
 A feature may contribute information which is
similar to the information contributed by one or
Key drivers of more other features.
 For example, in the weight prediction problem
feature referred earlier in the section, both the features Age
selection – and Height contribute similar information.
feature  Age and Height increase with each other. So, we can
remove any one.
relevance and  All features having potential redundancy are
redundancy candidates for rejection in the final feature subset.
Only a small number of representative features out of
a set of potentially redundant features are considered
for being a part of the final feature subset.
 Correlation is a measure of linear
dependency between two random variables.
Pearson’s product moment correlation
coefficient is one of the most popular and
Correlation- accepted measures of correlation between
based two random variables.
similarity  Correlation values range between +1 and –1.
A correlation of 1 (+ / –) indicates perfect
measure correlation.
 In case the correlation is 0, then the features
seem to have no linear relationship.
 For two random feature variables F1 and F2 ,
Pearson correlation coefficient is defined as:

Correlation-
based
similarity
measure
 Euclidean distance:
 Euclidean distance between two features F1 and F2
are calculated as:

Distance-
based
similarity
measure
 Minkowski distance:
 A more generalized form of the Euclidean
distance:
Distance-
based
similarity
measure  Minkowski distance takes the form of
Euclidean distance when r = 2.
 Manhattan distance:
 At r = 1, it takes the form of Manhattan
distance, as shown below:
Distance-
based
similarity
measure
 Hamming distance:
 A specific example of Manhattan distance,
used more frequently to calculate the
distance between binary vectors is the
Distance- Hamming distance.
based  For example, the Hamming distance between
similarity two vectors 01101011 and 11001001 is 3.
measure
 Jaccard index/coefficient:
 For two features having binary values, Jaccard index
is measured as

Other
similarity
 where, n11 = number of cases where both the
measures features have value 1
 Jaccard distance, d = 1 - J
 Calculate: Let’s consider two features F and F having
values (0, 1, 1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0).
 Simple matching coefficient (SMC) is almost
same as Jaccard coeficient except the fact that
it includes a number of cases where both the
features have a value of 0
Other
similarity
measures
 Cosine Similarity
 Cosine similarity which is one of the most
popular measures in text classification is
calculated as:
Other
similarity
measures
 A typical feature selection process consists of
four steps:
1. generation of possible subsets
2. subset evaluation
Overall 3. stop searching based on some stopping criterion
4. validation of the result
feature
selection
process
Overall
feature
selection
process
 There are four types of approach for feature
selection:
1. Filter approach
2. Wrapper approach
Feature 3. Hybrid approach
4. Embedded approach
selection
approaches
 In the filter approach, the feature subset is
selected based on statistical measures done
to assess the merits of the features from the
data perspective. No learning algorithm is
employed to evaluate the goodness of the
Filter feature selected.
Approach  Some of the common statistical tests
conducted on features as a part of filter
approach are – Pearson’s correlation,
information gain, Fisher score, analysis of
variance (ANOVA), Chi-Square, etc.
Filter
Approach
 In the wrapper approach ,identification of
best feature subset is done using the
induction algorithm as a black box. The
feature selection algorithm searches for a
good feature subset using the induction
algorithm itself as a part of the evaluation
Wrapper function.
Approach  Since for every candidate subset, the learning
model is trained and the result is evaluated
by running the learning algorithm, wrapper
approach is computationally very expensive.
However, the performance is generally
superior compared to filter approach.
Wrapper
Approach
 Hybrid approach takes the advantage of both
filter and wrapper approaches. A typical
hybrid algorithm makes use of both the
statistical tests as used in filter approach to
decide the best subsets for a given cardinality
Hybrid and a learning algorithm to select the final
best subset among the best subsets across
approach different cardinalities.
 Embedded approach is quite similar to
wrapper approach as it also uses and
inductive algorithm to evaluate the generated
feature subsets. However, the difference is it
performs feature selection and classification
Embedded simultaneously.
approach
Department of CE

Thanks
Unit no : 4
Basics of Feature
Engineering(31707
24)

Prof. Hetvy Jadeja

You might also like