d.sce project (2)
d.sce project (2)
d.sce project (2)
PROJECT REPORT ON
Submitted to:
Mrs.Priya P B.Sc., M.C.A, B.Ed. M.Phil
PGT (CS)
Emerald Valley Public School,
Salem – 636008
Tamilnadu
EMERALD VALLEY PUBLIC SCHOOL
CERTIFICATE
Priya P
Name :
Signature :
Date :
First and foremost, I owe my wholehearted thanks to my parents for their love,
encouragement and moral support for completing this project.
I sincerely appreciate our Principal Mr. K. Manimaran for permitting access to the
well-equipped lab and the resources required for the project.
The encouragement from my teacher, principal and friends was invaluable. I will
always remain grateful for their support.
Loan eligibility prediction Using
Machine Learning
CONTENT
SERIAL
DESCRIPTION PAGE NO.
NO.
1 PROBLEM DEFINITION
2 REQUIREMENTS
3 INTRODUCTION
5 SOURCE CODE
6 PREDICTION
7 VISUALIZATION
8 CONCLUSION
9 BIBILIOGRAPHY
Loan Eligibility prediction Using
Machine Learning
PROBLEM DEFINITION
The objective of this project is to build a predictive machine
learning model that can determine whether an applicant will get
a loan approved based on several features of the applicant, such
as personal background information (e.g., gender, marital status,
income) and loan-related information (e.g., loan amount
requested).
Banks and financial institutions need efficient and accurate
methods to decide whether an applicant should be approved for
a loan. Traditional methods can be slow and subjective, leading
to potential inconsistencies in decision-making. A machine
learning-based system can help automate and optimize this
process by making predictions based on historical data.
HARDWARE AND SOFTWARE
REQUIREMENTS
HARDWARE REQUIRED
⮚ Printer, to print the required documents of the project
⮚ Drive
⮚ Processor : Inter i5
⮚ Ram : 4GM and above
⮚ Hard Disk : 1 TB
SOFTWARE REQUIRED
⮚ Operating System : Windows 11
⮚ Jupyter notebook
⮚ Python
⮚ Visual Studio Code
⮚ MS word (for preparing and presenting the project)
INTRODUCTION
Introduction to Machine Learning
At its core, machine learning involves the development of algorithms that can
analyze data, learn from it, and then make predictions or decisions based on that
learning. Machine learning has become a cornerstone of AI, and it is used in a
variety of fields such as healthcare, finance, e-commerce, entertainment, and more.
1. Supervised Learning:
o In supervised learning, the algorithm is trained on a labeled dataset,
meaning the input data is paired with the correct output (or label).
o The goal is for the model to learn a mapping from inputs to outputs,
so it can predict the output for new, unseen inputs.
o Examples: Classification (e.g., spam email detection) and Regression
(e.g., predicting house prices based on features like size, location,
etc.).
2. Unsupervised Learning:
o In unsupervised learning, the algorithm is given data without labels
and must find the underlying structure or patterns in the data on its
own.
o The goal is often to discover hidden structures like clusters or
associations in the data.
o Examples: Clustering (e.g., grouping similar customers based on
purchasing behavior) and Dimensionality Reduction (e.g., reducing
the number of variables in a dataset).
Data Requirements
In this project, we use the Support Vector Classifier (SVC) model to predict loan
eligibility. SVC is a supervised machine learning algorithm that belongs to the
Support Vector Machine (SVM) family. It is widely used for classification tasks
due to its ability to find the optimal decision boundary (or hyperplane) that
separates classes.
For this task, the classes are loan approved and loan rejected, based on applicant
data like gender, marital status, income, etc.
Why SVC?
How it Works:
Link: https://media.geeksforgeeks.org/wp-content/uploads/20240903120527/
loan_data.csv
● Obtain the loan eligibility dataset, typically available as a CSV file. This
dataset contains various features of loan applicants, such as their gender,
marital status, income, credit history, etc., and the target variable,
Loan_Status (approved or not). The dataset can be sourced from platforms
like Kaggle or any public repository.
2. Data Preprocessing:
● Categorical Encoding:
Convert categorical features like Gender, Married, Education, and Loan_Status
into numerical format using techniques such as Label Encoding or One-Hot
Encoding.
● Feature Engineering:
Create new features that may help in prediction, such as combining SibSp and
Parch to form a new feature like FamilySize if relevant.
● Scaling/Normalization:
Check the shape of the dataset using df.shape to understand the number of rows
and columns, and use df.info() to inspect column types and identify any missing
values or erroneous data.
● Descriptive Statistics:
Use df.describe() to get summary statistics (mean, median, standard deviation) for
the numerical columns, helping to understand the central tendency and distribution
of data.
4. Data Visualization:
Visualize the distribution of Loan_Status (approved vs. rejected) using a pie chart
to understand the class balance in the dataset.
Plot count plots for categorical features such as Gender, Marital Status, and
Education using seaborn.countplot() to analyze the distribution of these features
with respect to loan approval.
● Grouping Data:
Group the data by categorical variables (e.g., Gender, Married) and compute the
mean of LoanAmount to observe if certain groups tend to request larger loans.
● Correlation Analysis:
8. Feature Selection:
Remove columns that are irrelevant for predicting loan approval (e.g., ID, columns
with too many missing values, or columns with no meaningful contribution to the
model).
● Correlation Filtering:
Remove highly correlated features (e.g., features with correlation > 0.9) to reduce
multicollinearity, which could hinder the model's ability to learn effectively.
9. Data Splitting:
Separate the dataset into input features (X) and target variable (Y, which is
Loan_Status).
Split the data into training and validation sets (e.g., 80% for training, 20% for
validation) using train_test_split() to ensure proper model validation.
● Data Balancing:
● Feature Scaling:
2. Data Preprocessing:
# Fill missing values for LoanAmount and other numeric columns with median
df['LoanAmount'] = df['LoanAmount'].fillna(df['LoanAmount'].median())
# Fill missing values for categorical columns with the most frequent value
df['Gender'] = df['Gender'].fillna(df['Gender'].mode()[0])
df['Married'] = df['Married'].fillna(df['Married'].mode()[0])
Categorical Encoding:
python
from sklearn.preprocessing import LabelEncoder
Feature Engineering:
python
# Creating a new feature FamilySize by combining SibSp and Parch
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
Scaling/Normalization:
python
Copy code
from sklearn.preprocessing import StandardScaler
# Count plots for Gender and Marital Status with Loan_Status as hue
plt.subplots(figsize=(15, 5))
for i, col in enumerate(['Gender', 'Married']):
plt.subplot(1, 2, i+1)
sb.countplot(data=df, x=col, hue='Loan_Status')
plt.tight_layout()
plt.show()
Grouping Data:
python
# Group by Gender and calculate mean loan amount
df.groupby('Gender').mean(numeric_only=True)['LoanAmount']
5. Outlier Detection:
Confusion Matrix:
python
# Generate confusion matrix for validation set
cm = confusion_matrix(y_val, model.predict(X_val))
Classification Report:
python
# Get the classification report
print(classification_report(y_val, model.predict(X_val)))
Prediction
input_data = ["your loan data"]
#convert text to feature vectors
input_data_features = feature_extraction.transform(input_data)
# making prediction
prediction = model.predict(input_data_features)
print(prediction)
if (prediction[0]==1): print('eligible’)
else: print('not eligible’)
THE OUTPUT OF THE PROGRAM WOULD TELL US IF THE PRESON
WITH THE GIVEN LOAN DATA IS ELIGIBLE OR NOT
visualization
Pie chart for Loan Status column: