d.sce project (2)

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 28

EMERALD VALLEY PUBLIC SCHOOL

ACADEMIC YEAR: 2024 – 25

PROJECT REPORT ON

Loan Eligibility prediction Using


Machine Learning

Submitted to:
Mrs.Priya P B.Sc., M.C.A, B.Ed. M.Phil
PGT (CS)
Emerald Valley Public School,
Salem – 636008
Tamilnadu
EMERALD VALLEY PUBLIC SCHOOL

CERTIFICATE

This is to certify that SRIRAM R, Roll. No. :


has successfully completed the project work entitled “Loan
Eligibility prediction Using Machine Learning” in the subject
Data Science(844) laid down in the regulations of CBSE for the
purpose of Practical Examination in class XII to be held in
Emerald Valley Public School, Yercaud Foothills, Salem –
636008, during the academic year 2024 – 25

Priya P
Name :
Signature :
Date :
First and foremost, I owe my wholehearted thanks to my parents for their love,
encouragement and moral support for completing this project.

I sincerely appreciate our Principal Mr. K. Manimaran for permitting access to the
well-equipped lab and the resources required for the project.

I am deeply appreciative of my project mentor, Priya P, for offering invaluable


guidance and motivation throughout the project. She carefully monitored my
progress, clarified my uncertainties, and provided constructive feedback that
improved the quality of my project.

My special thanks to my classmates who were incredibly helpful. They assisted


me in various stages of the project by providing useful insights, engaging in
brainstorming sessions, and providing support .

The encouragement from my teacher, principal and friends was invaluable. I will
always remain grateful for their support.
Loan eligibility prediction Using

Machine Learning

PROJECT DONE BY : SRIRAM R

CONTENT
SERIAL
DESCRIPTION PAGE NO.
NO.

1 PROBLEM DEFINITION

2 REQUIREMENTS

3 INTRODUCTION

4 EXPLARATORY DATA ANALYSIS

5 SOURCE CODE

6 PREDICTION

7 VISUALIZATION

8 CONCLUSION

9 BIBILIOGRAPHY
Loan Eligibility prediction Using
Machine Learning
PROBLEM DEFINITION
The objective of this project is to build a predictive machine
learning model that can determine whether an applicant will get
a loan approved based on several features of the applicant, such
as personal background information (e.g., gender, marital status,
income) and loan-related information (e.g., loan amount
requested).
Banks and financial institutions need efficient and accurate
methods to decide whether an applicant should be approved for
a loan. Traditional methods can be slow and subjective, leading
to potential inconsistencies in decision-making. A machine
learning-based system can help automate and optimize this
process by making predictions based on historical data.
HARDWARE AND SOFTWARE
REQUIREMENTS

HARDWARE REQUIRED
⮚ Printer, to print the required documents of the project
⮚ Drive
⮚ Processor : Inter i5
⮚ Ram : 4GM and above
⮚ Hard Disk : 1 TB

SOFTWARE REQUIRED
⮚ Operating System : Windows 11
⮚ Jupyter notebook
⮚ Python
⮚ Visual Studio Code
⮚ MS word (for preparing and presenting the project)
INTRODUCTION
Introduction to Machine Learning

Machine learning, a subfield of artificial intelligence (AI), offers a promising


solution by enabling systems to make predictions or decisions based on historical
data. This project aims to leverage machine learning to predict the eligibility of
loan approval for applicants based on their background data. The predictive model
can help financial institutions assess applicants more efficiently and reduce the
subjectivity and biases inherent in human decision-making.

At its core, machine learning involves the development of algorithms that can
analyze data, learn from it, and then make predictions or decisions based on that
learning. Machine learning has become a cornerstone of AI, and it is used in a
variety of fields such as healthcare, finance, e-commerce, entertainment, and more.

Key Concepts in Machine Learning

1. Data: The foundation of machine learning. It consists of input features (also


called variables or attributes) and labels (the outcome you want to predict).
The quality and quantity of data are critical for the success of machine
learning models.
2. Algorithms: Machine learning algorithms are the methods used to find
patterns in the data. There are different types of algorithms based on the
problem you're trying to solve.
3. Model: A model is the output of a machine learning algorithm after it has
been trained on data. It represents the patterns or relationships the algorithm
has learned and is used to make predictions or decisions on new data.
4. Training: The process of feeding data into a machine learning algorithm to
help it learn the underlying patterns.
5. Testing: After a model has been trained, it is evaluated on new, unseen data
(test data) to assess its performance and ability to generalize to real-world
situations.

Types of Machine Learning

Machine learning is typically classified into three major categories:

1. Supervised Learning:
o In supervised learning, the algorithm is trained on a labeled dataset,
meaning the input data is paired with the correct output (or label).
o The goal is for the model to learn a mapping from inputs to outputs,
so it can predict the output for new, unseen inputs.
o Examples: Classification (e.g., spam email detection) and Regression
(e.g., predicting house prices based on features like size, location,
etc.).
2. Unsupervised Learning:
o In unsupervised learning, the algorithm is given data without labels
and must find the underlying structure or patterns in the data on its
own.
o The goal is often to discover hidden structures like clusters or
associations in the data.
o Examples: Clustering (e.g., grouping similar customers based on
purchasing behavior) and Dimensionality Reduction (e.g., reducing
the number of variables in a dataset).
Data Requirements

o Type: Nominal (e.g., Male, Female)


o Description: The gender of the applicant.
o Example: Male, Female
o Marital Status (Categorical):
o Type: Nominal (e.g., Married, Not Married)
o Description: The marital status of the applicant.
o Example: Married, Not Married
o Applicant Income (Numerical):
o Type: Continuous (e.g., monthly or annual income in dollars/rupees)
o Description: The annual income of the applicant.
o Example: 35000, 65000, 50000
o Co applicant Income (Numerical, optional):
o Type: Continuous (e.g., monthly or annual income in dollars/rupees)
o Description: The income of the co applicant, if any (e.g., spouse,
family member).
o Example: 30000, 20000, None (if no co applicant)
o This feature could be optional, and if present, it could be used in
combination with the applicant's income.
o Loan Amount (Numerical):
o Type: Continuous (e.g., loan amount requested in dollars/rupees)
o Description: The total amount of loan that the applicant is requesting.
o Example: 200000, 500000, 150000
o Loan Term (Categorical or Numerical):
o Type: Ordinal or Continuous (e.g., 5 years, 10 years)
o Description: The tenure (duration) for which the loan is requested.
o Example: 5 years, 10 years
o Credit History (Categorical or Binary):
o Type: Nominal or Binary (e.g., Good Credit History, Bad Credit
History)
o Description: The credit score or credit history of the applicant (if
available). A value like 1 (Good) or 0 (Bad) or a categorical scale can
be used.
o Example: Good, Poor, 0 (no credit history), 1 (positive history)
o  Education (Categorical):
o Type: Nominal (e.g., Graduate, Not Graduate)
o Description: The education level of the applicant.
o Example: Graduate, Not Graduate
o Employment Status (Categorical):
o Type: Nominal (e.g., Employed, Self-Employed, Unemployed)
o Description: The employment status of the applicant.
o Example: Employed, Self-Employed, Unemployed
o Property Area (Categorical):
o Type: Nominal (e.g., Urban, Semi urban, Rural)
o Description: The area in which the applicant lives.
o Example: Urban, Semi urban, Rural
o Dependents (Categorical or Numerical):
o Type: Ordinal or Numerical (e.g., number of dependents)
o Description: The number of dependents or people financially
supported by the applicant
Types of Models Used:

In this project, we use the Support Vector Classifier (SVC) model to predict loan
eligibility. SVC is a supervised machine learning algorithm that belongs to the
Support Vector Machine (SVM) family. It is widely used for classification tasks
due to its ability to find the optimal decision boundary (or hyperplane) that
separates classes.

For this task, the classes are loan approved and loan rejected, based on applicant
data like gender, marital status, income, etc.

Why SVC?

● Effectiveness with high-dimensional data: SVC performs well when there


are many features (input variables), as it works by finding the best separating
hyperplane in a high-dimensional space.
● Non-linear relationships: The RBF (Radial Basis Function) kernel used
in SVC helps the model handle non-linear decision boundaries, which is
useful when the relationship between the features and the target is complex.
● Robustness to overfitting: SVC is less prone to overfitting, especially when
the right kernel and parameters are chosen.

How it Works:

1. Training: SVC is trained on the preprocessed data, where it learns to


classify loan approval based on applicant features.
2. Evaluation: The model's performance is evaluated using metrics like ROC
AUC, accuracy, precision, and the confusion matrix, which help assess its
ability to correctly classify loan approvals and rejections.
DATA SET
1.Loan data

Link: https://media.geeksforgeeks.org/wp-content/uploads/20240903120527/
loan_data.csv

Exploratory Data Analysis (EDA)


1. Data Collection:

● Obtain the loan eligibility dataset, typically available as a CSV file. This
dataset contains various features of loan applicants, such as their gender,
marital status, income, credit history, etc., and the target variable,
Loan_Status (approved or not). The dataset can be sourced from platforms
like Kaggle or any public repository.

2. Data Preprocessing:

● Handling Missing Data:

Identify columns with missing values (e.g., LoanAmount, ApplicantIncome) and


decide whether to fill missing values with statistical measures (mean, median, or
mode) or remove rows/columns with excessive missing data.

● Categorical Encoding:
Convert categorical features like Gender, Married, Education, and Loan_Status
into numerical format using techniques such as Label Encoding or One-Hot
Encoding.

● Feature Engineering:

Create new features that may help in prediction, such as combining SibSp and
Parch to form a new feature like FamilySize if relevant.

● Scaling/Normalization:

Scale continuous features (e.g., ApplicantIncome, LoanAmount) using


StandardScaler to ensure that the model isn't biased due to the varying scales of
the input features.

3. Exploratory Data Analysis (EDA):

● Data Shape and Structure:

Check the shape of the dataset using df.shape to understand the number of rows
and columns, and use df.info() to inspect column types and identify any missing
values or erroneous data.

● Descriptive Statistics:

Use df.describe() to get summary statistics (mean, median, standard deviation) for
the numerical columns, helping to understand the central tendency and distribution
of data.
4. Data Visualization:

● Pie Chart for Target Variable:

Visualize the distribution of Loan_Status (approved vs. rejected) using a pie chart
to understand the class balance in the dataset.

● Count Plots for Categorical Variables:

Plot count plots for categorical features such as Gender, Marital Status, and
Education using seaborn.countplot() to analyze the distribution of these features
with respect to loan approval.

● Distribution Plots for Numerical Variables:

Use seaborn.distplot() to visualize the distribution of continuous variables like


ApplicantIncome and LoanAmount, helping identify any skewness or outliers.

5. Analyzing Relationships Between Features:

● Grouping Data:

Group the data by categorical variables (e.g., Gender, Married) and compute the
mean of LoanAmount to observe if certain groups tend to request larger loans.

● Correlation Analysis:

Create a heatmap to visualize correlations between numerical features like


ApplicantIncome, LoanAmount, and Credit_History to check for
multicollinearity or strong relationships between predictors.
6. Outlier Detection:

● Boxplots for Outliers:

Use seaborn.boxplot() to identify extreme outliers in numerical columns like


ApplicantIncome and LoanAmount, which could negatively affect model
performance. Consider removing or capping extreme values.

7. Handling Class Imbalance:

● Check Target Distribution:

Examine the distribution of Loan_Status using count plots. If the data is


imbalanced (e.g., more loan rejections), apply techniques like
RandomOverSampler to balance the dataset.

8. Feature Selection:

● Drop Irrelevant Features:

Remove columns that are irrelevant for predicting loan approval (e.g., ID, columns
with too many missing values, or columns with no meaningful contribution to the
model).

● Correlation Filtering:
Remove highly correlated features (e.g., features with correlation > 0.9) to reduce
multicollinearity, which could hinder the model's ability to learn effectively.

9. Data Splitting:

● Split Data into Features and Target:

Separate the dataset into input features (X) and target variable (Y, which is
Loan_Status).

● Training and Validation Split:

Split the data into training and validation sets (e.g., 80% for training, 20% for
validation) using train_test_split() to ensure proper model validation.

10. Model Development (Post-EDA):

● Data Balancing:

If required, apply RandomOverSampling or other balancing methods to mitigate


the effects of class imbalance before training the model.

● Feature Scaling:

Apply StandardScaler to normalize features, ensuring consistent scaling for


models that are sensitive to input feature magnitudes.
SOURCE
CODE
1. Data Collection:
python
import pandas as pd

# Load the dataset


df = pd.read_csv('loan_data.csv')

# View the first few rows of the dataset


df.head()

2. Data Preprocessing:

Handling Missing Data:


python
# Check for missing values
df.isnull().sum()

# Fill missing values for LoanAmount and other numeric columns with median
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].median())

# Fill missing values for categorical columns with the most frequent value
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Married'] = df['Married'].fillna(df['Married'].mode()[0])

Categorical Encoding:
python
from sklearn.preprocessing import LabelEncoder

# Encode categorical variables


label_encoder = LabelEncoder()
df['Gender'] = label_encoder.fit_transform(df['Gender'])
df['Married'] = label_encoder.fit_transform(df['Married'])
df['Education'] = label_encoder.fit_transform(df['Education'])
df['Self_Employed'] = label_encoder.fit_transform(df['Self_Employed'])
df['Loan_Status'] = label_encoder.fit_transform(df['Loan_Status'])

Feature Engineering:
python
# Creating a new feature FamilySize by combining SibSp and Parch
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

Scaling/Normalization:
python
Copy code
from sklearn.preprocessing import StandardScaler

# Scale the continuous features using StandardScaler


scaler = StandardScaler()
df[['ApplicantIncome', 'LoanAmount']] =
scaler.fit_transform(df[['ApplicantIncome', 'LoanAmount']])
3. Exploratory Data Analysis (EDA):

Data Shape and Structure:


python
# Get the shape of the dataset
df.shape

# Get info about the dataset


df.info()

# Get descriptive statistics for the numerical features


df.describe()

Visualizing Loan_Status Distribution:


python
import matplotlib.pyplot as plt

# Pie chart for Loan_Status distribution


loan_status = df['Loan_Status'].value_counts()
plt.pie(loan_status.values, labels=loan_status.index, autopct='%1.1f%%')
plt.title('Loan Approval Status')
plt.show()

Count Plots for Categorical Variables:


python
import seaborn as sb

# Count plots for Gender and Marital Status with Loan_Status as hue
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['Gender', 'Married']):
plt.subplot(1, 2, i+1)
sb.countplot(data=df, x=col, hue='Loan_Status')
plt.tight_layout()
plt.show()

Distribution Plots for Numerical Variables:


python
# Distribution of ApplicantIncome and LoanAmount
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['ApplicantIncome', 'LoanAmount']):
plt.subplot(1, 2, i+1)
sb.histplot(df[col], kde=True)
plt.tight_layout()
plt.show()

4. Analyzing Relationships Between Features:

Grouping Data:
python
# Group by Gender and calculate mean loan amount
df.groupby('Gender').mean(numeric_only=True)['LoanAmount']

# Group by Married and Gender and calculate mean loan amount


df.groupby(['Married', 'Gender']).mean(numeric_only=True)['LoanAmount']
Correlation Heatmap:
python
# Plot correlation heatmap for numerical features
sb.heatmap(df.corr() > 0.8, annot=True, cbar=False)
plt.show()

5. Outlier Detection:

Boxplots for Outliers:


python
# Plot boxplots for ApplicantIncome and LoanAmount
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['ApplicantIncome', 'LoanAmount']):
plt.subplot(1, 2, i+1)
sb.boxplot(df[col])
plt.tight_layout()
plt.show()

6. Handling Class Imbalance:


python
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split

# Separate features and target


X = df.drop('Loan_Status', axis=1)
y = df['Loan_Status']

# Split data into train and test sets


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,
random_state=10)

# Apply RandomOverSampling to balance the dataset


ros = RandomOverSampler(sampling_strategy='minority', random_state=0)
X_train, y_train = ros.fit_resample(X_train, y_train)

# Verify the shape after oversampling


X_train.shape, y_train.shape

7. Model Training and Evaluation:

Model Training (Support Vector Classifier - SVC):


python
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score, confusion_matrix,
classification_report

# Initialize SVC model


model = SVC(kernel='rbf', random_state=10)

# Train the model


model.fit(X_train, y_train)

# Evaluate on training data


train_accuracy = roc_auc_score(y_train, model.predict(X_train))
print(f'Training ROC AUC Score: {train_accuracy}')
# Evaluate on validation data
val_accuracy = roc_auc_score(y_val, model.predict(X_val))
print(f'Validation ROC AUC Score: {val_accuracy}')

Confusion Matrix:
python
# Generate confusion matrix for validation set
cm = confusion_matrix(y_val, model.predict(X_val))

# Plot confusion matrix


plt.figure(figsize=(6, 6))
sb.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Classification Report:
python
# Get the classification report
print(classification_report(y_val, model.predict(X_val)))
Prediction
input_data = ["your loan data"]
#convert text to feature vectors
input_data_features = feature_extraction.transform(input_data)
# making prediction
prediction = model.predict(input_data_features)
print(prediction)
if (prediction[0]==1): print('eligible’)
else: print('not eligible’)
THE OUTPUT OF THE PROGRAM WOULD TELL US IF THE PRESON
WITH THE GIVEN LOAN DATA IS ELIGIBLE OR NOT

visualization
Pie chart for Loan Status column:

count bars based on the ‘Loan _ Status’ categories:


To find out the outliers in the columns, we can use boxplot :

we plot the confusion matrix using the plot_confusion_matrix

function from the sklearn.metrics.plot_confusion_matrix


submodule:
Conclusion:

In this project, we developed a machine learning model to


predict loan eligibility based on factors like gender,
marital status, income, and loan amount. After
preprocessing the data, including handling missing values
and balancing the dataset, we applied a Support Vector
Classifier (SVC) to build the prediction model.

The model achieved reasonable performance on the


training data, with evaluation metrics like the ROC AUC
score helping to assess its effectiveness. While the dataset
was small and had limited features, it provided valuable
insights into the factors influencing loan approvals, such
as gender and marital status.

Future improvements could include incorporating more


features and exploring other machine learning models to
enhance prediction accuracy.
Bibliography:
1.Source code
www.geeksforgeeks.com
2.images
www.geeksforgeeks.com
Few images used in this document were screenshotted and
pasted using the snipping tool in the personal computer

You might also like