Principal Component Analysis
Principal Component Analysis
Principal Component Analysis
COMPONENT
ANALYSIS
Deeksha M
4SH19CS016
Shri Devi Institue Of Technology
Mangalore
What Is Principal Component Analysis?
• Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce
the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still
contains most of the information in the large set.
• Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick
in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets are easier
to explore and visualize and make analyzing data much easier and faster for machine learning algorithms
without extraneous variables to process.
• So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while preserving
as much information as possible.
Step by Step Explanation of PCA
1. STANDARDIZATION
4. FEATURE VECTOR
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the latter is quite sensitive
regarding the variances of the initial variables. That is, if there are large differences between the ranges of initial
variables, those variables with larger ranges will dominate over those with small ranges (For example, a variable that
ranges between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will lead to biased results.
So, transforming the data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for each value of each
variable.
Once the standardization is done, all the variables will be transformed to the same scale.
2. COVARIANCE MATRIX
COMPUTATION
The aim of this step is to understand how the variables of the input data set are varying
from the mean with respect to each other, or in other words, to see if there is any
relationship between them. Because sometimes, variables are highly correlated in such a
way that they contain redundant information. So, in order to identify these correlations, we
compute the covariance matrix.
Principal components are new variables that are constructed as linear combinations or mixtures of the
initial variables. These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is squeezed or
compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible information in the first component, then maximum
remaining information in the second and so on
How PCA Constructs the Principal
Components
As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component accounts for
the largest possible variance in the data set.
The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts for
the next highest variance.
This continues until a total of p principal components have been calculated, equal to the
original number of variables.
4. FEATURE VECTOR
As we saw in the previous step, computing the eigenvectors and ordering them by their eigenvalues in
descending order, allow us to find the principal components in order of significance. In this step, what
we do is, to choose whether to keep all these components or discard those of lesser significance (of
low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.
So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that
we decide to keep. This makes it the first step towards dimensionality reduction, because if we choose
to keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.
5.RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES
In the previous steps, apart from standardization, you do not make any changes on the data,
you just select the principal components and form the feature vector, but the input data set
remains always in terms of the original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the feature vector formed using the
eigenvectors of the covariance matrix, to reorient the data from the original axes to the ones
represented by the principal components (hence the name Principal Components Analysis).
This can be done by multiplying the transpose of the original data set by the transpose of the
feature vector.
Limitations
• As noted above, the results of PCA depend on the scaling of the variables. This can be cured by scaling each feature by its standard
deviation, so that one ends up with dimensionless features with unital variance.
• The applicability of PCA as described above is limited by certain (tacit) assumptions made in its derivation. In particular, PCA can
capture linear correlations between the features but fails when this assumption is violated (see Figure 6a in the reference). In some
cases, coordinate transformations can restore the linearity assumption and PCA can then be applied.
• Another limitation is the mean-removal process before constructing the covariance matrix for PCA. In fields such as astronomy, all
the signals are non-negative, and the mean-removal process will force the mean of some astrophysical exposures to be zero, which
consequently creates unphysical negative fluxes,and forward modeling has to be performed to recover the true magnitude of the
signals. As an alternative method, non negative matrix factorisation focusing only on the non-negative elements in the matrices, which
is well-suited for astrophysical observations. See more at Relation between PCA and non negative matrix factorisation.