PA DA1

DIGITAL ASSIGNMENT 1
BCSE334L-Predictive Analytics
FALL SEM 2024-2025
Slot: E1+TE1
Submitted by-
Arnav Bahuguna
Reg - 21BCE3795
Q. Develop a comprehensive prediction model based on four regression techniques using a real-
time dataset of your choice. Your task includes the following components:
1. Data Preprocessing: Elaborate on the data preprocessing techniques employed. This should
cover:
• Data collection methods and the source of your real-time dataset.
• Handling of missing values, outliers, and any inconsistencies in the dataset.
• Feature selection and extraction processes.
• Data normalization or standardization techniques used.
2. Modelling: Develop prediction models using some regression techniques of your choice. (min 4
techniques).
Ensure that your assignment is well-structured, clearly written, and demonstrates a deep
understanding of regression techniques and their application to real-time datasets. Use high-quality
English and support your explanations with relevant references and citations where appropriate.
1. Data Preprocessing
1.1 Data Collection Methods and Source
For this project, the dataset used is a housing dataset loaded from a CSV file named
Housing.csv. This dataset contains various features relevant to house pricing, such as the
number of bedrooms, bathrooms, square footage of living space, and more. The dataset is
sourced from Kaggle and would be used in regression testing to predict house prices.
1.2 Handling Missing Values, Outliers, and Inconsistencies

Ensuring the data was clean and ready for analysis was a critical first step, so I applied several
preprocessing techniques:
• Missing Values: We begin by checking for missing values using df.isnull().sum(). This function
allows us to see if there are any gaps in the data that needed to be addressed. If any missing
values found, we handle them by either imputing them with a statistical measure like the mean
or by removing the aZected rows or columns. In this case, the dataset didn’t have significant
missing data.
• Outliers: Boxplots were used to identify and visualize outliers across all numerical features.
Outliers are managed by standardizing the data using Standard Scaler and by observing the
spread of values to understand their potential impact on the model. Further details are
covered under 1.4 Data Normalisation below.
• Inconsistencies: Irrelevant or redundant features, such as 'id' and 'date', were removed from
the dataset to prevent them from skewing the analysis and model training.
1.3 Feature Selection and Extraction

The primary goal was to predict the house price, so 'price' was selected as the target variable.
Other features, including 'bedrooms', 'bathrooms', 'sqft_living', etc., were chosen as predictors.
These features were considered due to their potential influence on housing prices. ‘price’ was
chosen as the target variable to be predicted.
To further refine the features:
• Principal Component Analysis (PCA): PCA was applied to the features to reduce
dimensionality while retaining 95% of the variance. This step was crucial for simplifying the
model without losing significant predictive power. It helps in reducing the noise, which can
often lead to overfitting.
1.4 Data Normalization/Standardization

Normalization is an essential preprocessing step to ensure that the features contribute equally to
the model. In this case:
The features were standardized using StandardScaler, which involves rescaling the data so that it
has a mean of 0 and a standard deviation of 1. This ensures that all features contribute equally to
the model and prevents features with larger scales from dominating the learning process.
Standardization is particularly important when features are measured in diZerent units (e.g.,
square footage vs. the number of bedrooms).
• Boxplot before scaling:
• Boxplot after scaling:
We notice that a lot of outliers are eliminated from the training parameters of the dataset.
1.5 Train-Test Split
With the preprocessing complete, the dataset was split into training and testing sets, which were then
used for model training and evaluation.
2. Modeling
In this project, four diZerent regression techniques were employed to build predictive models for
housing prices. Each model was trained on the preprocessed dataset and evaluated using Mean
Absolute Error (MAE) and Root Mean Squared Error (RMSE).
2.1 Linear Regression
Linear Regression is one of the simplest and most widely used predictive modeling techniques. It
assumes a linear relationship between the dependent variable (in this case, house prices) and the
independent variables (features such as number of bedrooms, bathrooms, etc.).
• The linear regression model was trained using the standardized training dataset. The model
learns the relationship between the input features and the target variable by minimizing the
mean squared error between the predicted and actual house prices.
• Evaluation: The model was evaluated on the test set using two key metrics:
o Mean Absolute Error (MAE): This metric measures the average magnitude of the errors
in the predictions, without considering their direction. It provides a straightforward
interpretation of prediction accuracy.
o Root Mean Squared Error (RMSE): RMSE is similar to MAE but gives more weight to
larger errors. It is particularly useful for understanding the model's prediction accuracy,
especially when larger errors are undesirable.
The linear regression model serves as a baseline for comparison with more complex models.
Furthermore, I have used scatterplot to demonstrate that the data show an uphill pattern as you move
from left to right, this indicates a positive relationship between X and Y
2.2 Ridge Regression
Ridge Regression is an extension of linear regression that includes a regularization term, known as
the L2 penalty, to address issues of multicollinearity and prevent overfitting. Multicollinearity occurs
when independent variables are highly correlated, which can lead to unreliable estimates of the
coeZicients in a linear model.
• The ridge regression model was trained on the same standardized dataset as the linear
regression model. The addition of the L2 penalty helps to shrink the coeZicients of less
important features, thereby reducing model complexity and improving generalization to new
data.
• Evaluation: The model's performance was assessed using MAE and RMSE, similar to the linear
regression model. Ridge regression is particularly eZective in situations where the dataset has
highly correlated features, as it can produce more stable and reliable estimates.
2.3 Decision Tree Regression
Decision Tree Regression is a non-linear model that splits the dataset into subsets based on the
values of the input features. Each split is made based on the feature that results in the greatest
reduction in variance for the target variable. The process continues recursively until the tree reaches a
maximum depth or the variance within each subset is suZiciently low.
• The decision tree regressor was trained on the preprocessed dataset. This model is particularly
useful for capturing non-linear relationships between the features and the target variable.
• Evaluation: The decision tree model's predictions were evaluated using MAE and RMSE.
Decision trees can be highly accurate for the training data but are prone to overfitting,
especially if the tree is not properly pruned or constrained. Overfitting occurs when the model
becomes too complex and captures noise in the data rather than the underlying pattern.
2.4 Random Forest Regression
Random Forest Regression is an ensemble learning technique that combines multiple decision trees
to improve predictive accuracy and robustness. Each tree in the random forest is trained on a random
subset of the data, and the final prediction is obtained by averaging the predictions of all the trees.
• The random forest regressor was implemented using the sklearn.ensemble module. This
technique is particularly eZective in reducing the variance of the model, thereby producing
more stable and accurate predictions.
• Evaluation: Similar to the other models, the random forest regressor was evaluated using MAE
and RMSE. Random forests generally outperform individual decision trees by mitigating the risk
of overfitting and providing a more generalizable model.
Link to ipynb
The ipynb notebook is also attached below for ref.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Housing.csv')
df.head()
id date price bedrooms bathrooms

sqft_living \
0 7229300521 20141013T000000 231300.0 2 1.00
1180
1 6414100192 20141209T000000 538000.0 3 2.25
2570
2 5631500400 20150225T000000 180000.0 2 1.00
770
3 2487200875 20141209T000000 604000.0 4 3.00
1960
4 1954400510 20150218T000000 510000.0 3 2.00
1680
sqft_lot floors waterfront view ... grade sqft_above

sqft_basement \
0 5650 1.0 0 0 ... 7 1180
0
1 7242 2.0 0 0 ... 7 2170
400
2 10000 1.0 0 0 ... 6 770
0
3 5000 1.0 0 0 ... 7 1050
910
4 8080 1.0 0 0 ... 8 1680
0
yr_built yr_renovated zipcode lat long sqft_living15 \

0 1955 0 98178 47.5112 -122.257 1340
1 1951 1991 98125 47.7210 -122.319 1690
2 1933 0 98028 47.7379 -122.233 2720
3 1965 0 98136 47.5208 -122.393 1360
4 1987 0 98074 47.6168 -122.045 1800
sqft_lot15
0 5650
1 7639
2 8062
3 5000
4 7503
[5 rows x 21 columns]
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 21613 non-null int64
1 date 21613 non-null object
2 price 21613 non-null float64
3 bedrooms 21613 non-null int64
4 bathrooms 21613 non-null float64
5 sqft_living 21613 non-null int64
6 sqft_lot 21613 non-null int64
7 floors 21613 non-null float64
8 waterfront 21613 non-null int64
9 view 21613 non-null int64
10 condition 21613 non-null int64
11 grade 21613 non-null int64
12 sqft_above 21613 non-null int64
13 sqft_basement 21613 non-null int64
14 yr_built 21613 non-null int64
15 yr_renovated 21613 non-null int64
16 zipcode 21613 non-null int64
17 lat 21613 non-null float64
18 long 21613 non-null float64
19 sqft_living15 21613 non-null int64
20 sqft_lot15 21613 non-null int64
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
df.drop('date', axis=1, inplace=True)
df.isnull().sum()
id 0
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0
dtype: int64
df.drop('id', axis=1, inplace = True)
df[df.columns].plot(kind='box', figsize=(20,10))
<Axes: >
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
ftransform = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition',
'grade',
'sqft_above', 'sqft_basement', 'yr_built',
'yr_renovated',
'zipcode', 'lat', 'long', 'sqft_living15',
'sqft_lot15']
df[ftransform] = scaler.fit_transform(df[ftransform])
df[df.columns].plot(kind='box', figsize=(20,10))
<Axes: >
features = df.drop('price', axis=1)
y = df['price']
from sklearn.decomposition import PCA
pca = PCA(n_components=0.95)
pca_features = pca.fit_transform(features)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(pca_features, y,

test_size=0.3, random_state=101)
Linear Regression
from sklearn.linear_model import LinearRegression
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
LinearRegression()
lin_pred = lin_model.predict(X_test)
from sklearn.metrics import mean_absolute_error, mean_squared_error,

classification_report
print("Mean absolute error: ", mean_absolute_error(lin_pred, y_test))

print("Mean squared error: ", np.sqrt(mean_squared_error(lin_pred,
y_test)))
Mean absolute error: 127249.10489453698
Mean squared error: 205767.4770870467
plt.figure(figsize=(10, 6))
plt.scatter(y_test, lin_pred, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],

color='red', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Predicted vs. Actual Values')
plt.show()
from sklearn.linear_model import Ridge
rdg_model = Ridge()
rdg_model.fit(X_train, y_train)
Ridge()
rdg_preds = rdg_model.predict(X_test)
print("Mean absolute error: ", mean_absolute_error(rdg_preds, y_test))
print("Mean squared error: ", np.sqrt(mean_squared_error(rdg_preds,
y_test)))

plt.scatter(y_test, rdg_preds, color='blue', alpha=0.5)

color='red', lw=2)
plt.show()
from sklearn.tree import DecisionTreeRegressor
dcst_model = DecisionTreeRegressor()
dcst_model.fit(X_train, y_train)
DecisionTreeRegressor()
dcst_preds = dcst_model.predict(X_test)
print("Mean absolute error: ", mean_absolute_error(dcst_preds,

y_test))
print("Mean squared error: ", np.sqrt(mean_squared_error(dcst_preds,
y_test)))

plt.scatter(y_test, dcst_preds, color='blue', alpha=0.5)

color='red', lw=2)
plt.show()
from sklearn.ensemble import RandomForestRegressor
rf_model = RandomForestRegressor()
rf_model.fit(X_train,y_train)
rf_preds = rf_model.predict(X_test)
print("Mean absolute error: ", mean_absolute_error(rf_preds, y_test))

print("Mean squared error: ", np.sqrt(mean_squared_error(rf_preds,
y_test)))

plt.scatter(y_test, rf_preds, color='blue', alpha=0.5)

color='red', lw=2)
plt.show()

PA DA1

Uploaded by

Copyright:

Available Formats

PA DA1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PA DA1

Uploaded by

Copyright:

Available Formats

DIGITAL ASSIGNMENT 1

1.2 Handling Missing Values, Outliers, and Inconsistencies

1.3 Feature Selection and Extraction

1.4 Data Normalization/Standardization

• Boxplot after scaling:

id date price bedrooms bathrooms

sqft_lot floors waterfront view ... grade sqft_above

yr_built yr_renovated zipcode lat long sqft_living15 \

df.drop('date', axis=1, inplace=True)

df.drop('id', axis=1, inplace = True)

from sklearn.preprocessing import StandardScaler

from sklearn.decomposition import PCA

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(pca_features, y,

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error, mean_squared_error,

print("Mean absolute error: ", mean_absolute_error(lin_pred, y_test))

plt.scatter(y_test, lin_pred, color='blue', alpha=0.5)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],

from sklearn.linear_model import Ridge

Mean absolute error: 127247.01883052982

plt.scatter(y_test, rdg_preds, color='blue', alpha=0.5)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],

from sklearn.tree import DecisionTreeRegressor

print("Mean absolute error: ", mean_absolute_error(dcst_preds,

Mean absolute error: 120416.09276681062

plt.scatter(y_test, dcst_preds, color='blue', alpha=0.5)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],

print("Mean absolute error: ", mean_absolute_error(rf_preds, y_test))

Mean absolute error: 85981.71186822875

plt.scatter(y_test, rf_preds, color='blue', alpha=0.5)

plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()],

You might also like