Visualising Multicollinearity in Python
Visualising Multicollinearity in Python
collinearity in Python
Multi-collinearity Business Situations
• In order to analyze relationship of company sizes and revenues
to stock prices in regression model market capitalizations and
revenues are independent variables
• With 50 features
- we have matrix with shape of 50 x 50
• VIF > 4 or tolerance < 0.25 indicates that multi-collinearity might exist
and further investigation is required
• High VIFs only exist in control variables but not in variables of interest.
Here variables of interest are not collinear to each other or control
variables [The regression coefficients are not impacted]
• When high VIFs are caused as a result of inclusion of products or
powers of other variables, multi-collinearity does not cause negative
impacts [A regression model includes both x and x2 as independent
variables]
• When a dummy variable which represent more than two categories
has a high VIF, multi-collinearity does not necessarily exist [The
variables will always have high VIFs if there is a small portion of cases
in category, regardless of whether categorical variables are correlated
to other variables]
Correction of Multi-collinearity
• Remove one (or more) of highly correlated variables
• Both minimize
- information loss
- improves model predictability
Visualizing strongly correlated S&P500
stocks
• S&P500 stock data (01/01/2020 -
31/12/2021)
- to visualize collinear stocks
- yahoofinance yfinance package in python
Daily price data of S&P500 stocks
Heatmap
Clustermap