|
| 1 | +# Regression |
| 2 | + |
| 3 | + |
| 4 | +* Regression is a supervised machine learning technique which is used to predict continuous values. |
| 5 | + |
| 6 | + |
| 7 | +> Now, Supervised learning is a category of machine learning that uses labeled datasets to train algorithms to predict outcomes and recognize patterns. |
| 8 | +
|
| 9 | +* Regression is a statistical method used to model the relationship between a dependent variable (often denoted as 'y') and one or more independent variables (often denoted as 'x'). The goal of regression analysis is to understand how the dependent variable changes as the independent variables change. |
| 10 | + # Types Of Regression |
| 11 | + |
| 12 | +1. Linear Regression |
| 13 | +2. Polynomial Regression |
| 14 | +3. Stepwise Regression |
| 15 | +4. Decision Tree Regression |
| 16 | +5. Random Forest Regression |
| 17 | +6. Ridge Regression |
| 18 | +7. Lasso Regression |
| 19 | +8. ElasticNet Regression |
| 20 | +9. Bayesian Linear Regression |
| 21 | +10. Support Vector Regression |
| 22 | + |
| 23 | +But, we'll first start with Linear Regression |
| 24 | +# Linear Regression |
| 25 | + |
| 26 | +* Linear regression is a fundamental statistical method used to model the relationship between a dependent variable (often denoted as |
| 27 | +𝑌) and one or more independent variables (often denoted as |
| 28 | +𝑋). The relationship is assumed to be linear, meaning that changes in the independent variables are associated with changes in the dependent variable in a straight-line fashion. |
| 29 | + |
| 30 | +The basic form of linear regression for a single independent variable is: |
| 31 | + |
| 32 | +**𝑌=𝛽0+𝛽1𝑋+𝜖** |
| 33 | + |
| 34 | +Where: |
| 35 | + |
| 36 | +* Y is the dependent variable. |
| 37 | +* X is the independent variable. |
| 38 | +* 𝛽0 is the intercept, representing the value of Y when X is zero |
| 39 | +* 𝛽1 is the slope coefficient, representing the change in Y for a one-unit change in X |
| 40 | +* ϵ is the error term, representing the variability in Y that is not explained by the linear relationship with X. |
| 41 | + |
| 42 | +# Basic Code of Linear Regression |
| 43 | + |
| 44 | +* This line imports the numpy library, which is widely used for numerical operations in Python. We use np as an alias for numpy, making it easier to reference functions and objects from the library. |
| 45 | +``` |
| 46 | +import numpy as np |
| 47 | +``` |
| 48 | + |
| 49 | +* This line imports the LinearRegression class from the linear_model module of the scikit-learn library.scikit-learn is a powerful library for machine learning tasks in Python, and LinearRegression is a class provided by it for linear regression. |
| 50 | +``` |
| 51 | +from sklearn.linear_model import LinearRegression |
| 52 | +``` |
| 53 | +* This line creates a NumPy array X containing the independent variable values. In this example, we have a simple one-dimensional array representing the independent variable. The reshape(-1, 1) method reshapes the array into a column vector, necessary for use with scikit-learn |
| 54 | + |
| 55 | +``` |
| 56 | +X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) |
| 57 | +``` |
| 58 | +* This line creates a NumPy array Y containing the corresponding dependent variable values. These are the observed values of the dependent variable corresponding to the independent variable values in X. |
| 59 | +``` |
| 60 | +Y = np.array([2, 4, 5, 8, 5]) |
| 61 | +``` |
| 62 | + |
| 63 | +* This line creates an instance of the LinearRegression class, which represents the linear regression model. We'll use this object to train the model and make predictions. |
| 64 | +``` |
| 65 | +model = LinearRegression() |
| 66 | +``` |
| 67 | + |
| 68 | +* This line fits the linear regression model to the data. The fit() method takes two arguments: the independent variable (X) and the dependent variable (Y). This method estimates the coefficients of the linear regression equation that best fit the given data. |
| 69 | +``` |
| 70 | +model.fit(X, Y) |
| 71 | +``` |
| 72 | +* These lines print out the intercept (beta_0) and coefficient (beta_1) of the linear regression model. model.intercept_ gives the intercept value, and model.coef_ gives an array of coefficients, where model.coef_[0] corresponds to the coefficient of the first independent variable (in this case, there's only one). |
| 73 | +``` |
| 74 | +print("Intercept:", model.intercept_) |
| 75 | +print("Coefficient:", model.coef_[0]) |
| 76 | +``` |
| 77 | + |
| 78 | +* These lines demonstrate how to use the trained model to make predictions for new data. |
| 79 | +* We create a new NumPy array new_data containing the values of the independent variable for which we want to predict the dependent variable values. |
| 80 | +* We then use the predict() method of the model to obtain the predictions for these new data points. Finally, we print out the predicted values. |
| 81 | +``` |
| 82 | +new_data = np.array([[6], [7]]) |
| 83 | +predictions = model.predict(new_data) |
| 84 | +print("Predictions:", predictions) |
| 85 | +``` |
| 86 | +# Assumptions of Linear Regression |
| 87 | + |
| 88 | +# Linearity: |
| 89 | + |
| 90 | +* To assess the linearity assumption, we can visually inspect a scatter plot of the observed values versus the predicted values. |
| 91 | +* If the relationship between them appears linear, it suggests that the linearity assumption is reasonable. |
| 92 | +``` |
| 93 | +import matplotlib.pyplot as plt |
| 94 | +predictions = model.predict(X) |
| 95 | +plt.scatter(predictions,Y) |
| 96 | +plt.xlabel("Predicted Values") |
| 97 | +plt.ylabel("Observed Values") |
| 98 | +plt.title("Linearity Check: Observed vs Predicted") |
| 99 | +plt.show() |
| 100 | +``` |
| 101 | +# Homoscedasticity: |
| 102 | +* Homoscedasticity refers to the constant variance of the residuals across all levels of the independent variable(s). We can visually inspect a plot of residuals versus predicted values to check for homoscedasticity. |
| 103 | +``` |
| 104 | +residuals = Y - predictions |
| 105 | +plt.scatter(predictions, residuals) |
| 106 | +plt.xlabel("Predicted Values") |
| 107 | +plt.ylabel("Residuals") |
| 108 | +plt.title("Homoscedasticity Check: Residuals vs Predicted Values") |
| 109 | +plt.axhline(y=0, color='red', linestyle='--') # Add horizontal line at y=0 |
| 110 | +plt.show() |
| 111 | +
|
| 112 | +``` |
| 113 | +# Normality of Residuals: |
| 114 | +* To assess the normality of residuals, we can visually inspect a histogram or a Q-Q plot of the residuals. |
| 115 | +``` |
| 116 | +import seaborn as sns |
| 117 | +
|
| 118 | +sns.histplot(residuals, kde=True) |
| 119 | +plt.xlabel("Residuals") |
| 120 | +plt.ylabel("Frequency") |
| 121 | +plt.title("Normality of Residuals: Histogram") |
| 122 | +plt.show() |
| 123 | +
|
| 124 | +import scipy.stats as stats |
| 125 | +
|
| 126 | +stats.probplot(residuals, dist="norm", plot=plt) |
| 127 | +plt.title("Normal Q-Q Plot") |
| 128 | +plt.show() |
| 129 | +
|
| 130 | +``` |
| 131 | +# Metrics for Regression |
| 132 | + |
| 133 | + |
| 134 | +# Mean Absolute Error (MAE) |
| 135 | + |
| 136 | +* MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between predicted and actual values. |
| 137 | +``` |
| 138 | +from sklearn.metrics import mean_absolute_error |
| 139 | +
|
| 140 | +mae = mean_absolute_error(Y, predictions) |
| 141 | +print(f"Mean Absolute Error (MAE): {mae}") |
| 142 | +
|
| 143 | +``` |
| 144 | +# Mean Squared Error (MSE) |
| 145 | + |
| 146 | +* MSE measures the average of the squares of the errors. It gives more weight to larger errors, making it sensitive to outliers. |
| 147 | +``` |
| 148 | +from sklearn.metrics import mean_squared_error |
| 149 | +
|
| 150 | +mse = mean_squared_error(Y, predictions) |
| 151 | +print(f"Mean Squared Error (MSE): {mse}") |
| 152 | +``` |
| 153 | +# Root Mean Squared Error (RMSE) |
| 154 | +* RMSE is the square root of the MSE. It provides an error metric that is in the same units as the dependent variable, making it more interpretable. |
| 155 | +``` |
| 156 | +rmse = np.sqrt(mse) |
| 157 | +print(f"Root Mean Squared Error (RMSE): {rmse}") |
| 158 | +
|
| 159 | +``` |
| 160 | +# R-squared (Coefficient of Determination) |
| 161 | +* R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates a perfect fit. |
| 162 | +``` |
| 163 | +from sklearn.metrics import r2_score |
| 164 | +
|
| 165 | +r2 = r2_score(Y, predictions) |
| 166 | +print(f"R-squared (R^2): {r2}") |
| 167 | +``` |
| 168 | + |
| 169 | +> In this tutorial, The sample dataset is there for learning purpose only |
| 170 | +
|
| 171 | + |
0 commit comments