Exp 1 121a1047 Lavanya Kurup ML

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

NAME: LAVANYA KURUP

ROLL NO : 121A1047

C3

EXPERIMENT 1: TO IMPLEMENT LINEAR AND


MULTIPLE LINEAR REGRESSION

AIM: In this experiment we will learn to implement linear and


multiple regression

THEORY :

What is Linear Regression?


Linear regression is a statistical technique used to model and analyze the
relationship between a dependent variable (outcome) and one independent
variable (predictor). It assumes a linear relationship between these variables,
meaning that changes in the independent variable result in proportional
changes in the dependent variable.

Key Concepts

1. Dependent and Independent Variables:


 Dependent Variable (Y): The outcome or response variable
that you want to predict or explain.
 Independent Variable (X): The predictor or explanatory
variable used to predict the dependent variable.
2. Linear Relationship:

This line is described by the linear equation: Y=β0+β1X+ϵY = \


 The relationship between X and Y is modeled as a straight line.

beta_0 + \beta_1 X + \epsilonY=β0 +β1 X+ϵ


 β0\beta_0β0 : The intercept of the line, representing the value of

 β1\beta_1β1 : The slope of the line, representing the change in Y


Y when X is zero.

for a one-unit change in X.


 ϵ\epsilonϵ: The error term, capturing the deviations of observed
values from the predicted values.

Steps in Linear Regression

3. Formulate the Model:

simple linear regression, the model is: Y=β0+β1X+ϵY = \beta_0


 Decide the form of the linear relationship you want to model. For

+ \beta_1 X + \epsilonY=β0 +β1 X+ϵ


4. Estimate Parameters:

for estimating β0\beta_0β0 and β1\beta_1β1 . It aims to


 Ordinary Least Squares (OLS) is the most common method

minimize the sum of the squared differences between observed

 The estimated parameters (β^0\hat{\beta}_0β^ 0 and β^1\


values and predicted values (errors).

hat{\beta}_1β^ 1 ) are found by solving: minimize∑i=1n(Yi−


(β^0+β^1Xi))2\text{minimize} \sum_{i=1}^n (Y_i - (\hat{\
beta}_0 + \hat{\beta}_1 X_i))^2minimize∑i=1n (Yi −(β^ 0 +β^ 1
Xi ))2
5. Evaluate the Model:

values (Yi−Y^iY_i - \hat{Y}_iYi −Y^i ).


 Residuals: The differences between observed and predicted

 R-squared (R2R^2R2): The proportion of variance in the


 Goodness of Fit:

dependent variable that is predictable from the


independent variable. It ranges from 0 to 1, with higher

 Adjusted R-squared: Adjusts R2R^2R2 for the number of


values indicating a better fit.

predictors, providing a more accurate measure for models


with multiple predictors.
6. Check Assumptions:
 Linearity: The relationship between X and Y should be linear.
 Independence: Observations should be independent of each
other.
 Homoscedasticity: The variance of residuals should be
constant across all levels of X.
 Normality: Residuals should be approximately normally
distributed (mainly for hypothesis testing).
7. Interpret the Results:
 Slope (β1\beta_1β1 ): Indicates how much Y changes for a one-

Intercept (β0\beta_0β0 ): Indicates the value of Y when X is


unit change in X.

zero. In some contexts, the intercept may not have a meaningful
interpretation if X=0 is not within the range of observed data.

Applications

 Economics: Predicting GDP growth based on investment levels.


 Healthcare: Predicting patient outcomes based on treatment
variables.
 Marketing: Analyzing how advertising spending affects sales.

What is Multiple Linear Regression?


Multiple Linear Regression (MLR) is a statistical technique used to model
the relationship between one dependent variable and two or more
independent variables. It generalizes simple linear regression to account for
more than one predictor, helping you understand how multiple factors
simultaneously influence an outcome.

Model Equation

The equation for multiple linear regression is: Y=β0+β1X1+β2X2+⋯


+βnXn+ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n +
\epsilonY=β0 +β1 X1 +β2 X2 +⋯+βn Xn +ϵ

Where:

YYY is the dependent variable.


X1,X2,…,XnX_1, X_2, \ldots, X_nX1 ,X2 ,…,Xn are the independent

β0\beta_0β0 is the intercept of the regression plane.


variables (predictors).

β1,β2,…,βn\beta_1, \beta_2, \ldots, \beta_nβ1 ,β2 ,…,βn are the



change in YYY for a one-unit change in each corresponding XiX_iXi .


coefficients (slopes) of the independent variables, indicating the

 ϵ\epsilonϵ is the error term (residual), representing the deviation of


the observed values from the predicted values.
Steps in Multiple Linear Regression

8. Formulate the Model:


 Decide which independent variables to include in the model
based on theory, prior research, or exploratory data analysis.
9. Estimate Parameters:

estimating the coefficients β\betaβ. OLS minimizes the sum of


 Ordinary Least Squares (OLS): The most common method for

values: minimize∑i=1n(Yi−(β^0+β^1Xi1+β^2Xi2+⋯
the squared differences between observed values and predicted

+β^nXin))2\text{minimize} \sum_{i=1}^n (Y_i - (\hat{\beta}_0 +


\hat{\beta}_1 X_{i1} + \hat{\beta}_2 X_{i2} + \cdots + \hat{\
beta}_n X_{in}))^2minimize∑i=1n (Yi −(β^ 0 +β^ 1 Xi1 +β^ 2
Xi2 +⋯+β^ n Xin ))2
10. Evaluate the Model:
 Residuals: Analyze the differences between observed values
and the values predicted by the model.

 R-squared (R2R^2R2): Measures the proportion of the


 Goodness of Fit:

variance in the dependent variable that is predictable from


the independent variables. It indicates how well the model

 Adjusted R-squared: Adjusts R2R^2R2 for the number of


explains the variability of the response data.

predictors in the model. It’s useful for comparing models


with different numbers of predictors.
11. Check Assumptions:
 Linearity: The relationship between the dependent variable and
each independent variable should be linear.
 Independence: Observations should be independent of each
other.
 Homoscedasticity: The residuals should have constant
variance at every level of the independent variables.
 Normality of Residuals: Residuals should be approximately
normally distributed, which is important for hypothesis testing
and constructing confidence intervals.

Coefficients (β\betaβ): Each coefficient represents the change


12. Interpret the Results:

in the dependent variable for a one-unit change in the
corresponding independent variable, holding all other predictors

Intercept (β0\beta_0β0 ): The value of YYY when all


constant.

independent variables are zero. Its interpretation may not always
be meaningful if zero is outside the range of the data.

CODE:
1. LINEAR REGRESSION

 Using Formula

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

loc = "/content/Salary_Data.csv"
df = pd.read_csv(loc)

X = df.iloc[:, 0]
y = df.iloc[:, 1]

mean_X = np.mean(X)
mean_y = np.mean(y)

n = len(X)

numer = 0
denom = 0
for i in range(n):
numer += (X[i] - mean_X) * (y[i] - mean_y)
denom += (X[i] - mean_X) ** 2
b1 = numer / denom
b0 = mean_y - (b1 * mean_X)

print("Intercept b0:", b0)


print("Slope b1:", b1)

plt.scatter(X, y, color='blue', label='Scatter Plot')


plt.plot(X, b0 + b1*X, color='red', label='Regression Line')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.title('Linear Regression Fit')
plt.legend()
plt.show()

 Using SkLearn

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score,


explained_variance_score

import sklearn.metrics as sm

loc = "/content/Salary_Data.csv"

df = pd.read_csv(loc)

#Print first few lines

df.head()

#check for missing values

print(df.isnull().sum())

#drop anmy rows with missing values

df.dropna(inplace=True)

x = df['YearsExperience']

y = df['Salary']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,


random_state=42)

x_train=np.array(x_train).reshape(len(x_train),1)
x_test=np.array(x_test).reshape(len(x_test),1)

y_train=np.array(y_train).reshape(len(y_train),1)

y_test=np.array(y_test).reshape(len(y_test),1)

model=LinearRegression()

model.fit(x_train,y_train)

y_train_pred=model.predict(x_train)

plt.figure()

plt.scatter(x_train,y_train, color='blue', label='True Values')

plt.plot(x_train,y_train_pred, color='red', label='Prediction')

y_test_pred=model.predict(x_test)

plt.figure()

plt.scatter(x_test,y_test, color='green', label='True Values')

plt.plot(x_test,y_test_pred, color='black', label='Prediction')

plt.xlabel("years of Experience")

plt.ylabel("Salary")

print("Mean squared error =", round(mean_squared_error(y_test,


y_test_pred), 2))

print("Explained Variance score =",


round(explained_variance_score(y_test, y_test_pred), 2))

print("R2 score =", round(r2_score(y_test, y_test_pred), 2))

2. MULTIPLE LINEAR REGRESSION

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

np.random.seed(0)
X1 = 2 * np.random.rand(100, 1)
X2 = 3 * np.random.rand(100, 1)
X = np.hstack((X1, X2))
y = 4 + 3*X1 + 2*X2 + np.random.randn(100, 1)

model_X1 = LinearRegression()
model_X1.fit(X1, y)

model_X2 = LinearRegression()
model_X2.fit(X2, y)

y_pred_X1 = model_X1.predict(X1)
y_pred_X2 = model_X2.predict(X2)

plt.figure(figsize=(10, 6))

plt.scatter(X1, y, c='b', label='Actual data (X1)')


plt.plot(X1, y_pred_X1, color='r', label='Regression line (X1)')

plt.scatter(X2, y, c='g', label='Actual data (X2)')


plt.plot(X2, y_pred_X2, color='y', label='Regression line (X2)')

plt.xlabel('X1 and X2')


plt.ylabel('Y')
plt.title('Multiple Linear Regression')
plt.legend()

plt.show()
OUTPUT:
1. LINEAR REGRESSION:
 USING FORMULA:

 USING SKLEARN MODEL :


2. MULTIPLE LINEAR REGRESSION:
CONCLUSION:

In this experiment we learnt how to implement linear and multiple regression

You might also like