Overfitting and Underfitting in Python
August 25, 2025
1 Overfiiting and Underfitting using Python
1. Compare Training vs Validation/Testing Performance
2. Learning Curves (Training vs Validation Error)
3. Cross Validation
4. Bias - Variance Indicators
5. Residual Analysis
6. Regularization Check
1.1 1. Compare Training vs Validation / Testing Performance
[6]: from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_digits
[8]: #Load Dataset
X, y = load_digits(return_X_y=True)
[10]: # Split Train / Val / Test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3,␣
↪random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.3,␣
↪random_state=42)
[12]: X.shape
[12]: (1797, 64)
[14]: y.shape
[14]: (1797,)
[16]: X_train.shape
[16]: (1257, 64)
[18]: X_temp.shape
1
[18]: (540, 64)
[20]: X_val.shape
[20]: (378, 64)
[22]: X_test.shape
[22]: (162, 64)
[24]: X
[24]: array([[ 0., 0., 5., …, 0., 0., 0.],
[ 0., 0., 0., …, 10., 0., 0.],
[ 0., 0., 0., …, 16., 9., 0.],
…,
[ 0., 0., 1., …, 6., 0., 0.],
[ 0., 0., 2., …, 12., 0., 0.],
[ 0., 0., 10., …, 12., 1., 0.]])
[26]: y
[26]: array([0, 1, 2, …, 8, 9, 8])
[28]: #Train Model
model = RandomForestClassifier()
[30]: model.fit(X_train, y_train)
[30]: RandomForestClassifier()
[32]: y_pred_train = model.predict(X_train)
[34]: y_pred_test = model.predict(X_test)
[36]: y_pred_val = model.predict(X_val)
[38]: train_accuracy = accuracy_score(y_train, y_pred_train)
val_accuracy = accuracy_score(y_val, y_pred_val)
test_accuracy = accuracy_score(y_test, y_pred_test)
[40]: print("Training Accuracy:", train_accuracy)
Training Accuracy: 1.0
[42]: print("Validation Accuracy:", val_accuracy)
Validation Accuracy: 0.9735449735449735
2
[46]: print("Test Accuracy:", test_accuracy)
Test Accuracy: 0.9691358024691358
[48]: # Compare
print("Training Accuracy:", train_accuracy)
print("Validation Accuracy:", val_accuracy)
print("Test Accuracy:", test_accuracy)
Training Accuracy: 1.0
Validation Accuracy: 0.9735449735449735
Test Accuracy: 0.9691358024691358
1.2 2. Learning Curves (Training vs Validation Error)
[52]: import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import LogisticRegression
[54]: #Learning Curves
train_sizes, train_scores, val_scores = learning_curve(
LogisticRegression(max_iter=2000), X, y, cv=5, scoring = "accuracy",
train_sizes=np.linspace(0.1, 1.0, 10), n_jobs=-1
)
[56]: #Mean Errors
train_error = 1 - np.mean(train_scores, axis=1)
val_error = 1 - np.mean(val_scores, axis=1)
[58]: #Plot
plt.plot(train_sizes, train_error, 'o-', label="Training Error")
plt.plot(train_sizes, val_error, 'o-', label="Validation Error")
plt.xlabel("Training Size")
plt.ylabel("Error(1-Accuracy)")
plt.title("Learning Curves")
plt.legend()
plt.grid(True)
plt.show()
3
1.3 3. Cross Validation
[61]: from sklearn.model_selection import cross_validate
cv_results = cross_validate(
LogisticRegression(max_iter=2000), X, y, cv=5,
return_train_score=True, scoring="accuracy"
)
[63]: print("Train Scores:", cv_results["train_score"])
print("Test Scores:", cv_results["test_score"])
print("Mean Train Accuracy:", cv_results["train_score"].mean())
print("Mean Val Accuracy:", cv_results["test_score"].mean())
Train Scores: [1. 1. 1. 1. 1.]
Test Scores: [0.92222222 0.87222222 0.94150418 0.94150418 0.89693593]
Mean Train Accuracy: 1.0
Mean Val Accuracy: 0.9148777468276075
4
1.4 4. Bias Variance Indicators
[66]: mean_train = cv_results['train_score'].mean()
mean_val = cv_results['test_score'].mean()
[68]: if mean_train < 0.7 and mean_val < 0.7:
print("High Bias - Underfitting")
elif mean_train > 0.9 and (mean_train - mean_val) > 0.1:
print("High Variance - Overfitting")
else:
print("Balanced - Good Fit")
Balanced - Good Fit
1.5 5. Residual Analysis (Regression Example)
[71]: import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
[73]: #Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42)
[75]: #Fit Regression Model
reg = LinearRegression()
[77]: reg.fit(X_train, y_train)
[77]: LinearRegression()
[79]: #Predictions
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)
[81]: #Residuals
residuals_train = y_train - y_pred_train
residuals_test = y_test - y_pred_test
[85]: # Plot Residuals
plt.scatter(y_pred_train, residuals_train, label="Train", alpha=0.6)
plt.scatter(y_pred_test, residuals_test, label="Test", alpha=0.6, color="red")
plt.axhline(0, color="black", linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residual Analysis")
plt.legend()
plt.grid(True)
5
plt.show()
Interpretation 1. Random Scattered Around 0 - Good Fit 2. Large Residuals in Test but Small in
Train - Overfitting 3. Systematic Pattern - Underfitting
1.6 6. Regularization Check
[89]: from sklearn.linear_model import LogisticRegression
[91]: # Without Regularization (C very large)
model_no_reg = LogisticRegression(C=1e6, max_iter=2000)
model_no_reg.fit(X_train, y_train)
[91]: LogisticRegression(C=1000000.0, max_iter=2000)
[93]: # With Stronger Regularization (C Small)
model_reg = LogisticRegression(C=0.01, max_iter=2000)
model_reg.fit(X_train, y_train)
[93]: LogisticRegression(C=0.01, max_iter=2000)
6
[97]: # Compare
print("No Reg - Train Acc:", accuracy_score(y_train, model_no_reg.
↪predict(X_train)))
print("No Reg - Val Acc:", accuracy_score(y_val, model_no_reg.predict(X_val)))
No Reg - Train Acc: 1.0
No Reg - Val Acc: 0.9682539682539683
[99]: print("With Reg - Train Acc:", accuracy_score(y_train, model_reg.
↪predict(X_train)))
print("No Reg - Val Acc:", accuracy_score(y_val, model_reg.predict(X_val)))
With Reg - Train Acc: 0.9920445505171042
No Reg - Val Acc: 0.9682539682539683
Interpretation: 1. If Validation Accuracy Improves when Adding Regularization - Model was
Overfitting 2. If Both Train and Val Accuracy Drops - Model May Be Underfitting Already
[ ]: