Principal Component Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 2

Principal Component Analysis(PCA)

As the number of features or dimensions in a dataset increases, the amount of data


required to obtain a statistically significant result increases exponentially. This can
lead to issues such as overfitting, increased computation time, and reduced accuracy
of machine learning models this is known as the curse of dimensionality problems
that arise while working with high-dimensional data.
As the number of dimensions increases, the number of possible combinations of
features increases exponentially, which makes it computationally difficult to obtain a
representative sample of the data and it becomes expensive to perform tasks such as
clustering or classification because it becomes. Additionally, some machine
learning algorithms can be sensitive to the number of dimensions, requiring more
data to achieve the same level of accuracy as lower-dimensional data.
To address the curse of dimensionality, Feature engineering techniques are used
which include feature selection and feature extraction. Dimensionality reduction is a
type of feature extraction technique that aims to reduce the number of input features
while retaining as much of the original information as possible.
In this article, we will discuss one of the most popular dimensionality reduction
techniques i.e. Principal Component Analysis(PCA).
What is Principal Component Analysis(PCA)?
Principal Component Analysis (PCA) technique was introduced by the
mathematician Karl Pearson in 1901. It works on the condition that while the data
in a higher dimensional space is mapped to data in a lower dimension space, the
variance of the data in the lower dimensional space should be maximum.
 Principal Component Analysis (PCA) is a statistical procedure that uses
an orthogonal transformation that converts a set of correlated variables to
a set of uncorrelated variables.PCA is the most widely used tool in
exploratory data analysis and in machine learning for predictive models.
Moreover,
 PCA is an unsupervised learning algorithm technique used to examine the
interrelations among a set of variables. It is also known as a general factor
analysis where regression determines a line of best fit.
 The main goal of Principal Component Analysis (PCA) is to reduce the
dimensionality of a dataset while preserving the most important patterns or
relationships between the variables without any prior knowledge of the
target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data
set by finding a new set of variables, smaller than the original set of variables,
retaining most of the sample’s information, and useful for the regression and
classification of data.
1. PCA is a technique for dimensionality reduction that identifies a set of
orthogonal axes, called principal components, that capture the maximum
variance in the data. The principal components are linear combinations of
the original variables in the dataset and are ordered in decreasing order of
importance. The total variance captured by all the principal components is
equal to the total variance in the original dataset.
2. The first principal component captures the most variation in the data, but
the second principal component captures the maximum variance that
is orthogonal to the first principal component, and so on.
3. PCA can be used for a variety of purposes, including data visualization,
feature selection, and data compression. In data visualization, PCA can be
used to plot high-dimensional data in two or three dimensions, making it
easier to interpret. In feature selection, PCA can be used to identify the
most important variables in a dataset. In data compression, PCA can be
used to reduce the size of a dataset without losing important information.
4. In PCA, it is assumed that the information is carried in the variance of the
features, that is, the higher the variation in a feature, the more information
that features carries.
Overall, PCA is a powerful tool for data analysis and can help to simplify complex
datasets, making them easier to understand and work with.

You might also like