0% found this document useful (0 votes)
32 views

Visualising Multicollinearity in Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Visualising Multicollinearity in Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Visualizing Multi-

collinearity in Python
Multi-collinearity Business Situations
• In order to analyze relationship of company sizes and revenues
to stock prices in regression model market capitalizations and
revenues are independent variables

• A company’s market capitalization and its total revenues are


strongly correlated; as a company earns increasing revenues
and it also grows in size, it leads to multi-collinearity problem
What is Multi-collinearity?
• Multi-collinearity is present when
- two or more features are correlated with each other

• Correlation between independent and dependent features is desired

• Multi-collinearity of independent features


- is less desired in some settings

• They can be omitted as they are not necessarily more informative


- than feature they are correlated with

• Identifying these features is a form of feature selection


What is Multi-collinearity?
• In a dataset prior to training predictive models
- it is key to identify and understand multi-collinearity

• We need to limit highly collinear features


- as it can lead to misleading outcomes when explaining models
Why visualize Multi-collinearity?
• Checking correlation between independent and dependent features
- is typically done during EDA

• It provides insight towards


- feature understanding of informative features for prediction

• For feature selection


- it is not always necessarily to visually inspect features correlation

• VIF (Variance Inflation Factor) to detect multi-collinearity


• With multi-collinearity
- regression coefficients are still consistent
- but not reliable since standard errors are inflated

• It means that model’s predictive power is not reduced


- but coefficients are not be statistically significant [Type II error (FN)]

• Multi-collinearity exists with


- high coefficient of determination (R2)
• Correlation between features is visualized using
- correlation matrix and corresponding heatmap

• If dataset has large amount of features then


- it becomes complex in extracting any information

• With 50 features
- we have matrix with shape of 50 x 50

• There must be a better way


- clustermap
Variance Inflation Factor

• Ri2 represents unadjusted coefficient of determination for


regressing ith independent variable on remaining ones

• The reciprocal of VIF is known as tolerance

• Calculation of VIF [Refer attached slides alongwith]


• If Ri2 = 0, variance of remaining independent variables cannot be
predicted from ith independent variable

• When VIF or tolerance = 1, ith independent variable is not correlated


to remaining ones which means multi-collinearity does not exist [Here
variance of ith regression coefficient is not inflated]

• VIF > 4 or tolerance < 0.25 indicates that multi-collinearity might exist
and further investigation is required

• When VIF > 10 or tolerance < 0.1 there is significant multi-collinearity


which needs to be addressed
There are situations where high VIFs can be safely ignored without
suffering from multi-collinearity. The following are three situations:

• High VIFs only exist in control variables but not in variables of interest.
Here variables of interest are not collinear to each other or control
variables [The regression coefficients are not impacted]
• When high VIFs are caused as a result of inclusion of products or
powers of other variables, multi-collinearity does not cause negative
impacts [A regression model includes both x and x2 as independent
variables]
• When a dummy variable which represent more than two categories
has a high VIF, multi-collinearity does not necessarily exist [The
variables will always have high VIFs if there is a small portion of cases
in category, regardless of whether categorical variables are correlated
to other variables]
Correction of Multi-collinearity
• Remove one (or more) of highly correlated variables

• Use principal components analysis (PCA)

• Both minimize
- information loss
- improves model predictability
Visualizing strongly correlated S&P500
stocks
• S&P500 stock data (01/01/2020 -
31/12/2021)
- to visualize collinear stocks
- yahoofinance yfinance package in python
Daily price data of S&P500 stocks
Heatmap
Clustermap

You might also like