0% found this document useful (0 votes)
6 views7 pages

Overfitting and Underfitting in Python

The document discusses overfitting and underfitting in Python, detailing methods such as comparing training vs validation performance, learning curves, cross-validation, and bias-variance indicators. It includes code examples using libraries like sklearn to demonstrate model training, evaluation, and regularization checks. Key insights include identifying overfitting through accuracy discrepancies and using regularization to improve model performance.

Uploaded by

surendranfer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

Overfitting and Underfitting in Python

The document discusses overfitting and underfitting in Python, detailing methods such as comparing training vs validation performance, learning curves, cross-validation, and bias-variance indicators. It includes code examples using libraries like sklearn to demonstrate model training, evaluation, and regularization checks. Key insights include identifying overfitting through accuracy discrepancies and using regularization to improve model performance.

Uploaded by

surendranfer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Overfitting and Underfitting in Python

August 25, 2025

1 Overfiiting and Underfitting using Python


1. Compare Training vs Validation/Testing Performance
2. Learning Curves (Training vs Validation Error)
3. Cross Validation
4. Bias - Variance Indicators
5. Residual Analysis
6. Regularization Check

1.1 1. Compare Training vs Validation / Testing Performance


[6]: from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits

[8]: #Load Dataset


X, y = load_digits(return_X_y=True)

[10]: # Split Train / Val / Test


X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3,␣
↪random_state=42)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.3,␣


↪random_state=42)

[12]: X.shape

[12]: (1797, 64)

[14]: y.shape

[14]: (1797,)

[16]: X_train.shape

[16]: (1257, 64)

[18]: X_temp.shape

1
[18]: (540, 64)

[20]: X_val.shape

[20]: (378, 64)

[22]: X_test.shape

[22]: (162, 64)

[24]: X

[24]: array([[ 0., 0., 5., …, 0., 0., 0.],


[ 0., 0., 0., …, 10., 0., 0.],
[ 0., 0., 0., …, 16., 9., 0.],
…,
[ 0., 0., 1., …, 6., 0., 0.],
[ 0., 0., 2., …, 12., 0., 0.],
[ 0., 0., 10., …, 12., 1., 0.]])

[26]: y

[26]: array([0, 1, 2, …, 8, 9, 8])

[28]: #Train Model


model = RandomForestClassifier()

[30]: model.fit(X_train, y_train)

[30]: RandomForestClassifier()

[32]: y_pred_train = model.predict(X_train)

[34]: y_pred_test = model.predict(X_test)

[36]: y_pred_val = model.predict(X_val)

[38]: train_accuracy = accuracy_score(y_train, y_pred_train)


val_accuracy = accuracy_score(y_val, y_pred_val)
test_accuracy = accuracy_score(y_test, y_pred_test)

[40]: print("Training Accuracy:", train_accuracy)

Training Accuracy: 1.0

[42]: print("Validation Accuracy:", val_accuracy)

Validation Accuracy: 0.9735449735449735

2
[46]: print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.9691358024691358

[48]: # Compare
print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
print("Test Accuracy:", test_accuracy)

Training Accuracy: 1.0


Validation Accuracy: 0.9735449735449735
Test Accuracy: 0.9691358024691358

1.2 2. Learning Curves (Training vs Validation Error)


[52]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression

[54]: #Learning Curves


train_sizes, train_scores, val_scores = learning_curve(
LogisticRegression(max_iter=2000), X, y, cv=5, scoring = "accuracy",
train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)

[56]: #Mean Errors


train_error = 1 - np.mean(train_scores, axis=1)
val_error = 1 - np.mean(val_scores, axis=1)

[58]: #Plot
plt.plot(train_sizes, train_error, 'o-', label="Training Error")
plt.plot(train_sizes, val_error, 'o-', label="Validation Error")
plt.xlabel("Training Size")
plt.ylabel("Error(1-Accuracy)")
plt.title("Learning Curves")
plt.legend()
plt.grid(True)
plt.show()

3
1.3 3. Cross Validation
[61]: from sklearn.model_selection import cross_validate
cv_results = cross_validate(
LogisticRegression(max_iter=2000), X, y, cv=5,
return_train_score=True, scoring="accuracy"
)

[63]: print("Train Scores:", cv_results["train_score"])


print("Test Scores:", cv_results["test_score"])
print("Mean Train Accuracy:", cv_results["train_score"].mean())
print("Mean Val Accuracy:", cv_results["test_score"].mean())

Train Scores: [1. 1. 1. 1. 1.]


Test Scores: [0.92222222 0.87222222 0.94150418 0.94150418 0.89693593]
Mean Train Accuracy: 1.0
Mean Val Accuracy: 0.9148777468276075

4
1.4 4. Bias Variance Indicators
[66]: mean_train = cv_results['train_score'].mean()
mean_val = cv_results['test_score'].mean()

[68]: if mean_train < 0.7 and mean_val < 0.7:


print("High Bias - Underfitting")
elif mean_train > 0.9 and (mean_train - mean_val) > 0.1:
print("High Variance - Overfitting")
else:
print("Balanced - Good Fit")

Balanced - Good Fit

1.5 5. Residual Analysis (Regression Example)


[71]: import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

[73]: #Train Test Split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42)

[75]: #Fit Regression Model


reg = LinearRegression()

[77]: reg.fit(X_train, y_train)

[77]: LinearRegression()

[79]: #Predictions
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)

[81]: #Residuals
residuals_train = y_train - y_pred_train
residuals_test = y_test - y_pred_test

[85]: # Plot Residuals


plt.scatter(y_pred_train, residuals_train, label="Train", alpha=0.6)
plt.scatter(y_pred_test, residuals_test, label="Test", alpha=0.6, color="red")
plt.axhline(0, color="black", linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Analysis")
plt.legend()
plt.grid(True)

5
plt.show()

Interpretation 1. Random Scattered Around 0 - Good Fit 2. Large Residuals in Test but Small in
Train - Overfitting 3. Systematic Pattern - Underfitting

1.6 6. Regularization Check


[89]: from sklearn.linear_model import LogisticRegression

[91]: # Without Regularization (C very large)


model_no_reg = LogisticRegression(C=1e6, max_iter=2000)
model_no_reg.fit(X_train, y_train)

[91]: LogisticRegression(C=1000000.0, max_iter=2000)

[93]: # With Stronger Regularization (C Small)


model_reg = LogisticRegression(C=0.01, max_iter=2000)
model_reg.fit(X_train, y_train)

[93]: LogisticRegression(C=0.01, max_iter=2000)

6
[97]: # Compare
print("No Reg - Train Acc:", accuracy_score(y_train, model_no_reg.
↪predict(X_train)))

print("No Reg - Val Acc:", accuracy_score(y_val, model_no_reg.predict(X_val)))

No Reg - Train Acc: 1.0


No Reg - Val Acc: 0.9682539682539683

[99]: print("With Reg - Train Acc:", accuracy_score(y_train, model_reg.


↪predict(X_train)))

print("No Reg - Val Acc:", accuracy_score(y_val, model_reg.predict(X_val)))

With Reg - Train Acc: 0.9920445505171042


No Reg - Val Acc: 0.9682539682539683
Interpretation: 1. If Validation Accuracy Improves when Adding Regularization - Model was
Overfitting 2. If Both Train and Val Accuracy Drops - Model May Be Underfitting Already

[ ]:

You might also like