0% found this document useful (0 votes)
15 views60 pages

PFDA (Programming For Data Analysis) APU

The document outlines a project focused on developing a credit score classification system using R programming, aiming to enhance financial decision-making by predicting customer credit scores. It includes sections on research questions, data analysis types, and the use of various R packages for data manipulation and visualization. The project emphasizes the importance of effective classification algorithms and data analytics techniques to assess credit risk in the banking sector.

Uploaded by

exploitzeroday45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views60 pages

PFDA (Programming For Data Analysis) APU

The document outlines a project focused on developing a credit score classification system using R programming, aiming to enhance financial decision-making by predicting customer credit scores. It includes sections on research questions, data analysis types, and the use of various R packages for data manipulation and visualization. The project emphasizes the importance of effective classification algorithms and data analytics techniques to assess credit risk in the banking sector.

Uploaded by

exploitzeroday45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

I

II
III

Table Contents
Introduction......................................................................................................................................1

Aim...............................................................................................................................................1

Scope............................................................................................................................................1

Objectives.....................................................................................................................................1

Research Questions..........................................................................................................................2

Additional Research Questions........................................................................................................2

R programming................................................................................................................................3

Introduction To R programming..................................................................................................3

Packages in R...............................................................................................................................3

Data Analysis and Its Types.............................................................................................................5

Descriptive Analysis....................................................................................................................5

Diagnostic Analysis......................................................................................................................5

Predictive Analysis.......................................................................................................................5

Prescriptive Analysis....................................................................................................................5

Exploratory Data Analysis...............................................................................................................6

Loading the data...........................................................................................................................6

Exploring the data........................................................................................................................7

Data Cleaning...............................................................................................................................9

Data Manipulation......................................................................................................................10

Data Transformation...................................................................................................................11

Data Visualization......................................................................................................................11

Questions and Analysis..................................................................................................................12

1. Does the number of bank accounts a person holds influence how frequently they delay
payments?...................................................................................................................................12
IV

2. Does the frequency of credit card usage significantly differ between individuals with low
and high credit scores?...............................................................................................................15

3. What is the distribution of credit cards among the customers aged in between 20-30?....18

4. Is there a significant difference in annual income between customers with standard credit
scores and poor credit scores?....................................................................................................21

5. Is there a significant difference in Credit Utilization Ratio among Occupation categories?


24

6. What is the correlation between Credit Utilization Ratio and Outstanding Debt?............26

7. Does the interest rate significantly vary based on the number of credit cards?.................28

8. Is there an association between the “Num_Credit_Inquiries" and the "Credit_Score"


category? (Chi-square test)........................................................................................................30

9. Is there a correlation between the Monthly Balance and Credit Utilization Ratio of bank
customers?..................................................................................................................................33

10. How is Monthly In-Hand Salary distributed across customers?...................................35

11. Do individuals with greater number of loans generally have a greater credit
utilization ratio?.........................................................................................................................37

12. What is the percentage breakdown of customers within each Credit Score category?. 39

13. Does the number of bank accounts differ significantly across different occupation
types? (Kruskal-Wallis test).......................................................................................................40

14. Do customers with a higher number of delayed payments have significantly different
annual incomes compared to others?.........................................................................................42

15. Is there a significant correlation between the age of individuals and their total monthly
EMI payments?..........................................................................................................................45

Conclusion.....................................................................................................................................47

References......................................................................................................................................49

Appendix........................................................................................................................................51

Workload Matrix........................................................................................................................51
V
VI

List of Figure

Figure 1:Exploratory Data Analysis………………………………………………………...


6

Figure 2: Loading necessary packages............................................................................................7


Figure 3:Reading the dataset...........................................................................................................7
Figure 4:Exploring the data.............................................................................................................7
Figure 5:Output of head (Credit_score_data_clean)........................................................................8
Figure 6:Display the Structure of the data.......................................................................................8
Figure 7:Generate a summary of the dataset...................................................................................8
Figure 8:Data Cleaning....................................................................................................................9
Figure 9: After Data Clean...............................................................................................................9
Figure 10: Convert Annual_Income and Outstanding _Debt to numeric......................................10
Figure 11: Find Customers with high Outstanding_Debt..............................................................10
Figure 12: Displaying summary of High_Debt_Customers..........................................................10
Figure 13: Transformation of Data to Long Format for Loan Details...........................................11
Figure 14: Calculating Total Delayed Payment Impact.................................................................11
Figure 15: Data Visualizations Example........................................................................................11
Figure 16: Output of data Visualization.........................................................................................12
Figure 17: Q.1.1Statistical Analysis Source Coder.......................................................................12
Figure 18: Q.1.2 Visualization Source Code.................................................................................13
Figure 19: Q.1.1 Statistic Analysis Output....................................................................................13
Figure 20: Q.1.2 Visualization output............................................................................................14
Figure 21: Q.2.1 Statistic Analysis Code.......................................................................................15
Figure 22: Q.2.2. Visualization Code............................................................................................15
Figure 23: Q.2.1. Statistical Analysis Output................................................................................16
Figure 24: Q.2.2. Visualization Output..........................................................................................17
Figure 25:Q.3.1. Statistical Analysis Code....................................................................................18
Figure 26:Q.3.2. Visualization Code.............................................................................................19
Figure 27: 2.1. Statistical Analysis Output....................................................................................20
Figure 28: Q.3.2. Visualization Output..........................................................................................21
VII

Figure 29: Q.4.1 Statistic Analysis Code......................................................................................21


Figure 30: Q.4.2. Visualization Code...........................................................................................22
Figure 31: Q.4.1 Statistical Analysis Output................................................................................22
Figure 32: Q.4.2. Visualization Output.........................................................................................23
Figure 33: Q.5.1. Statistical Analysis Code...................................................................................24
Figure 34: Q.5.2. Visualization Source Code................................................................................24
Figure 35: Q.5.1. Statistical Analysis Output................................................................................25
Figure 36: Q.5.2. Visualization Output..........................................................................................25
Figure 37: Q.6.2. Visualization Code............................................................................................26
Figure 38: Q.6.2. Visualization Output..........................................................................................27
Figure 39: Q.7.1. Statistical Analysis Code..................................................................................28
Figure 40: Q.7.2. Visualization Code............................................................................................28
Figure 41: Q.7.1. Statistical Analysis Output................................................................................29
Figure 42: Q.7.2. Visualization Output..........................................................................................29
Figure 43: Q.8.1. Statistical Analysis Code...................................................................................30
Figure 44: Q.8.2. Visualization Source Code...............................................................................30
Figure 45: Q.8.1. Statistical Analysis Output................................................................................31
Figure 46: Q.8.2. Visualization Output.........................................................................................32
Figure 47: Q.9.1. Statistical Analysis Code...................................................................................33
Figure 48: Q.9.1. Statistical Analysis Output................................................................................33
Figure 49: Q.9.2. Visualization Source Code...............................................................................34
Figure 50: Q.9.2. Visualization Output.........................................................................................34
Figure 51: Q.10.1. Statistical Analysis Source Code.....................................................................35
Figure 52: Q.10.1. Statistical Analysis Output.............................................................................35
Figure 53:Q.10.2. Visualization Output.........................................................................................36
Figure 54: Q.10.2. Visualization Source Output............................................................................36
Figure 55: Q.11.1 Statistical Analysis Source Code......................................................................37
Figure 56: Q.11.1. Statistical Analysis Output..............................................................................37
Figure 57: Q. 11.2. Visualization Source Code.............................................................................38
Figure 58: Q.11.2. Visualization Output.......................................................................................38
Figure 59: Q.12.1. Statistical Analysis Source Code.....................................................................39
VIII

Figure 60: Q.12.1. Statistical Analysis Output..............................................................................39


Figure 61: Q.12.2 Visualization Source Code:..............................................................................39
Figure 62: Q.12.2. Visualization Output........................................................................................40
Figure 63: Q.13.1. Statistical Analysis Code.................................................................................40
Figure 64: Q.13.2. Visualization Source Code..............................................................................41
Figure 65:Q.13.1. Statistical Analysis Output...............................................................................41
Figure 66: Q.13.2 Visualization Output.........................................................................................42
Figure 67:Q.14.1. Statistical Analysis Code..................................................................................43
Figure 68:Q.14.2. Visualization code............................................................................................43
Figure 69: Q.14.1. Statistical Analysis Output..............................................................................44
Figure 70:Q.14.2. Visualization Output........................................................................................45
Figure 71: Q.15.1. Statistical Analysis Code.................................................................................45
Figure 72: Q.15.2. Visualization Code..........................................................................................46
Figure 73:Q.15.1. Statistical Analysis Output...............................................................................46
Figure 74: Q.2.2. Visualization Output..........................................................................................47
1

Introduction

This project covers the analysis and development of a credit score classification system
using the R programming languages. Credit scores are important in financial decision-making
because they allow banks and other financial organizations to assess a customer’s credit
worthiness. This categorization system will be developed using various data analytics techniques
such as data exploration. This project aims to predict customer credit scores using past financial
data. The findings will offer important perspectives for evaluating risks and making decisions in
the banking sector (Pebesma & Graler, 2025).

Aim

The main aim of this assignment is to create an effective, dependable and precise Credit
Score Classification model by using R programming. This model will clarify customers
according to their reliability, assisting banks and other institutions in making well decision while
reducing the risk of default. This assignment will explore the effectiveness of different
classification algorithms to identify the best model for assessing credit risk. Furthermore, it aims
to emphasize important financial metrics that impact credit scores, ensuring about the model is
clear and understandable.

Scope

This assignment focuses on applying data analytics methods to clarify customers


according to their creditworthiness. It involves data collection, cleaning, exploratory data
analysis, and model development using algorithms like decision trees, random forest and logistic
regression. The effectiveness of this model will be evaluated with metrics such as accuracy,
recall and precision. Furthermore, data visualization method will be used to convey insights
efficiently. The research tackles issues like imbalanced datasets, overfitting, ensuring a strong
and interpretable classification model which can assist financial institutions in better credit risk
assessment.
2

Objectives

 To explore and analyse a set of data and reconstruct it into meaningful representations for
decision making.

 To identify the factors that differentiate credit score of customers and provide useful
recommendations to stakeholders.

 To utilize different data analysis and analyze the findings through statistical charts and gr
aphs, ensuring it is clear and valuable insights.

Research Questions

The analysis main proposed is to analyze the given dataset and answer the following
research questions through their respective analyses. The research question for this analysis are
given below, along with the respective analysis.

1. Does the number of bank accounts a person holds influence how frequently they
delay payments?
2. Does the frequency of credit card usage significantly differ between individuals with
low and high credit scores?
3. What is the distribution of credit cards among the customers aged in between 20-30?
4. Is there a significant difference in annual income between customers with standard
credit scores and poor credit scores?
5. Is there a significant difference in Credit Utilization Ratio among Occupation
categories?
6. What is the correlation between Credit Utilization Ratio and Outstanding Debt?
7. Does the interest rate significantly vary based on the number of credit cards?
8. Is there an association between the “Num_Credit_Inquiries" and the "Credit_Score"
category?
9. Is there a correlation between the Monthly Balance and Credit Utilization Ratio of
bank customers?
10. How is Monthly In-Hand Salary distributed across customers?
3

Additional Research Questions

11. Do individuals with greater number of loans generally have a greater credit
utilization ratio?
12. What is the percentage breakdown of customers within each Credit Score category?
13. Does the number of bank accounts differ significantly across different occupation
types? (Kruskal-Wallis test)
14. Do customers with a higher number of delayed payments have significantly different
annual incomes compared to others? Mann-Whitney U
15. Is there a significant correlation between the age of individuals and their total
monthly EMI payments?

R programming

Introduction To R programming

An interpreted programming language called R is frequently used for data analysis,


statistical computing, and graphical presentation (r-programming-language-introduction, 2024).
R’s data cleansing, importing, and visualization features make it particularly pertinent for data
science experts. It can be used for reduction, classification, and grouping as well as visuals like
histogram, scatterplot and boxplot (Coursera Staff, 2025).

Packages in R

R Packages consist of R functions, compiled code, and sample data. Within the R
environment, they reside in a directory named “library”. R automatically installs a list of
packages when it is installed. Other packages are installed later, as and when they are required
for a specific purpose. When we first open the R console, there are only the default packages
available as default. Some of them are describe below:

 dplyr:
Hadley Wickham built dplyr, a popular R packages that transforms the way data
manipulation tasks are conducted. The packages include a strong and user-friendly syntax for
data manipulation, making it easier for R users to explore, transform, and summarize datasets
(Shah, 2023).
 Tidyr
4

This package simplifies data management by altering and reshaping it. The tidyr
packages is a key component of the tidyverse packages, focusing on data modification and
display in visual form. To use this package first installed in the R environment and import on
top of the file.

 Readr

The primary goal of readr is to provide a rapid and simple way to load tabular data into R
and we used read.csv () function.

 ggplot2

The ggplot2 is a popular data library for visualization in R and it is based on the
“Grammar of Graphics” ideas, which offers a formal and systematic technique for
constructing and analyzing data visualizations. The ggplot2 enables users to create a wide
range of high-quality and configurable statistical charts, making it an effective tool for data
exploration and reporting (Devashree, 2024).

 Stats

This package, installed with base R, contains a board range of regularly used statistical
techniques, including chi-square tests and number of related tests and statistical test like t-
test, correlation, ANOVA and linear regression models (R Core Team and Contributors
Worlwide ).

 Car
The package name, “car,” is an abbreviation of Companion to Applied
Regression. The packages are not used to perform Applied Regression techniques; rather,
it is used as a companion by providing a collection of function to perform test, produce
visualization, and data transformation (Baiju et al., 2020).
 Scale

A new R package to help/prettify ggplot scales which can be rather irritating and time-
consuming to adjust. It really helpful to generate log scale ticks (su, 2019).

 Effsize
5

This package offers functions for calculation of standardized effect size in experiments
(Cohen d, Hedges g). The calculation methods have been optimized to enable fast
computation even for humongous data sets.

 VCD (Visualizing Categorical Data)

The vcd package in R has facilities for the analysis and visualization of categorical data.
It provides functions for creating mosaic plots, association plots, and other graphical methods
to study the associations between categorical variables.

Data Analysis and Its Types

It is defined as the process of turning unprocessed data into useful insights for defensible
decision-making. Our assignment dataset involves various steps to analyze the data.

Descriptive Analysis

The most basic kind of analytics is descriptive analytics, which serves as the basis for all
other forms. It enables to extract patterns from unprocessed data and provide a concise
explanation of what has occurred or is occurring (Cote, 2021).

Diagnostic Analysis

The technique of using data to identify the reasons behind correlation and patterns
between variables is known as diagnostic analytics (Cote, 2021). It can be performed with
statistical software (like Microsoft Excel), or by hand.

Predictive Analysis

A subfield of sophisticated analytics called predictive analytics uses statistical models,


deep learning algorithms, and previous data to predict future events. It is essential for identifying
trends in data to forecast future occurrences, allowing companies to make data-driven choices
that improve operational plans, reduce risks, and optimize profits (Prometheus Group, 2024).

Prescriptive Analysis

The fourth pillar of contemporary analytics is prescriptive analytics. It is a type of truly


directed analytics in which data directs or prescribes a course of action. The main goal of
6

prescriptive analytics is to simplify the decision-making process by eliminating the need for
educated guesses or assessments in data analytics.

Exploratory Data Analysis

The goal of exploratory data analysis is to gain a through understanding of the data and
discover its various features, frequently through visual aids. This makes it easy to interpret the
data and spot useful trends. It is critical to discover data trends and determine which aspects are
essential to final output and which are not. Additionally, there might be relationships between
some variables and others (Biswal, 2024).

Figure 1

Exploratory Data Analysis

Loading the data

Importing, moving, or combining data from several sources into a single storage system
is referred to as data loading. This procedure guarantees accurate and secure data transfer
between systems (Sean, 2024). Data is loaded variously first import the necessary libraries on the
top of the file by using library () and set the work environment by using related such as read.csv.
7

Figure 2

Loading necessary packages

Figure 3

Reading the dataset

Exploring the data

The first stage of data analysis is data exploration, during which one delves into a dataset
to gain an understanding of its contents (What is Exploratory Data Analysis, 2025). It includes
various in-built functions such as str (), head (), and summary () to find structure, anomalies, and
other features that might merit additional investigation.

Figure 4

Exploring the data


8

Figure 5

Output of head (Credit_score_data_clean)

Figure 6

Display the Structure of the data

Figure 7

Generate a summary of the dataset


9

Data Cleaning

The process of finding and fixing (or eliminating) mistakes, and inaccuracies within a
dataset. It is referred to as cleaning up data or data scrubbing. Accurate, consistent, and
dependable data is necessary for efficient analysis and decision-making, and this critical step in
the data management process (Data Cleaning, 2024)

Figure 8:

Data Cleaning

Figure 9

After Data Clean


10

Data Manipulation

The process of organizing or arranging data to facilitate interpretation is known as data


manipulation. Data manipulation language (DML) is a sort of database language that is usually
needed for data manipulation. DML is a kind of programming language that lets you change data
inside a database application to rearrange it. For instance, to find customers with high
outstanding debt.

Figure 10

Convert Annual_Income and Outstanding _Debt to numeric

Figure 11

Find Customers with high Outstanding_Debt

Figure 12

Displaying summary of High_Debt_Customers


11

Data Transformation

Data transformation is an important phase of data integration since it converts


unstructured data into structured data. Data transformation increases data usability and quality as
well as conformity with target (Hayes, 2024).

Figure 13

Transformation of Data to Long Format for Loan Details

Figure 14

Calculating Total Delayed Payment Impact

Data Visualization

Data Visualization is the way of displaying data in visual form such as chart, graphs, and
maps, which make easy to understand data (Coursera Staff, 2025). It also consists scatter plots,
pie charts, boxplots, and histograms.

Figure 15

Data Visualizations Example


12

Figure 16

Output of data Visualization

Questions and Analysis

1. Does the number of bank accounts a person holds influence how frequently they
delay payments?

Figure 17

Q.1.1Statistical Analysis Source Coder

We have used linear regression to ascertain whether there is a correlation between the
quantity of bank accounts and the quantity of late payments. The quantity of late payments can
be predicted based on the numbers of bank accounts. The arguments in the lm() function are the
independent variable and the dependent variable. Using summary data, we compute p-value, R-
squared, slope, and intercept.
13

Figure 18

Q.1.2 Visualization Source Code

We used to ggplot2 to generate two important plots: a scatter plot with a regression line
and a boxplot. The geom_point() function graphs a scatter plot of the data points. Each point on
the scatter plot represents a single observation. The geom_smooth() function includes a
regression line on the plot to show a trend or relationship between the two variables; lab() gives
the title and axis labels that are descriptive; and theme_minimal() designs the plot theme.
Furthermore, geom_boxplot() creates boxplots for each group by bank account number, with
whiskers representing the range of values, and stat_summary() adds a blue dot representing the
mean late payment for each group.

Figure 19

Q.1.1 Statistic Analysis Output


14

The regression equation is Delayed Payments = 23.2860 – 0.4027 * Bank Accounts,


implying that increasing bank accounts somewhat reduces delayed payments. Here, p is 0.949,
the coefficient is -0.4027, and the R-squared value is 8.292×10-6. However, p-value is greater
than 0.05 so, the relationship is no statistically significant. Finally, bank accounts do not strongly
predict delayed payments; other characteristics have a greater influence on delayed payments.

Figure 20

Q.1.2 Visualization output

The plots show no strong associations between late payment and the quantity of bank
accounts. Most points lie around zero with some stringent outliers. The regression line is
horizontal, indicating no effect. This can be further shown in the boxplot, where little difference
exists among groups. Generally, the quantity of bank accounts has no noticeable impact on late
payment, although outliers may affect findings.
15

2. Does the frequency of credit card usage significantly differ between individuals with
low and high credit scores?

Figure 21

Q.2.1 Statistic Analysis Code

Here we preprocess the dataset by eliminating rows with no values and separating credit
scores into two categories: low and high. After that, by using chaining filtered the data that are
extreme outliers. Levene’s test is used to ensure variance homogeneity, followed by a Welch’s t-
test to compare the number of credit cards across low and high credit score categories. Finally,
Chnen’s d is to determine the magnitude of the difference, and a boxplot depicts the distribution
of credit score group.

Figure 22

Q.2.2. Visualization Code


16

Here we used ggplot2 to generate a boxplot of credit card numbers for low and high
credit score categories and geom_boxplot() function is used to generate the plot, with the x-axis
representing credit score and the y-axis representing the number of credit cards.

Figure 23

Q.2.1. Statistical Analysis Output

According to the findings, those with low credit scores have more credit cards than
people with excellent credit. Levene’s test suggests unequal variances between groups, therefore
Welch’s t-test appropriate. The t-test reveals a statistically significant difference in the average
number of credit cards used. The low credit score group used 6.64 cards, while the good credit
score group used 5.10 cards. The Cohen’s d value of -0.745 indicates a medium effect size,
which means the difference is practically significant but not too big. Thus, people with lower
credit ratings have more credit cards, and the difference matters in real life.
17

Figure 24

Q.2.2. Visualization Output

The boxplot illustrates that low credit-score individuals have more median credit cards
and a wider range than high credit-score individuals. The low group has a wider spread, whereas
the high group has a smaller and more consistent median, which supports the t-test results.
18

3. What is the distribution of credit cards among the customers aged in between 20-
30?

Figure 25

Q.3.1. Statistical Analysis Code

Here, we use t-test to compare two different age categories, i.e., <=25, - >25 along with
sample numbers of credit cards. This does first by calculating the mean, median, standard
deviation, minimum, and maximum number of credit cards. It then calculates the correlation
between age and the number of credit cards, keeping in mind all the missing values. The end
result is to segment individuals into two age groups and conduct an independent samples t-test to
compare the mean difference in the number of credit cards held by them. The code thus prints
descriptive statistics, correlation coefficients, and t-test results.

Figure 26

Q.3.2. Visualization Code


19

We filter customers age 20 to 30 with total credit cards between 8 and 1300. The next
step is to group these filtered data according to age and count the number of individuals within
each age group. From the output of age group counts, a bar chart is made to visualize this age
distribution. The cleaned dataset is saved into a CSV file, printed summary statistics along with
column names, structure of cleaned data into the console.

Figure 27

Q. 2.1. Statistical Analysis Output


20

The above analysis a database with the age of customers, which can be anything between
14 and the absurd figure of 8698; the customer monthly salary could either be as less as 303.6 or
as high as 15204.6; also, many more banking; credit card, loan, and payment-related variables
needed to be observed.

Figure 28

Q.3.2. Visualization Output

The above bar graph shows the number of credit card distribution between age group 20
to 30. The chart shows a fairly uniform distribution along the considered ages, with numbers
varying nearly around 340-400 people per age. This chart shows the normal distribution.
21

4. Is there a significant difference in annual income between customers with standard


credit scores and poor credit scores?

Figure 29

Q.4.1 Statistic Analysis Code

We perform an independent sample t-test to compare annual income between two groups:
'Standard' credit scores and 'Poor' credit scores and subset the data to create two vectors
consisting of the annual incomes for good_credit and bad_credit for each group. The means of
the two vectors are compared using the t.test() function, which can help to determine if there's
any statistical significance in the incomes between these two groups depending on credit scores.

Figure 30

Q.4.2. Visualization Code


22

We proceed t-test to compare mean annual incomes across "Standard" and "Poor" credit
groups. And used density plot shows the annual income distribution across credit score
categories (standard and poor).

Figure 31

Q.4.1 Statistical Analysis Output

The output shows that for those individuals marked as having "Good" credit, annual
incomes range from 7006 to 24188807, medians being 37159 and means 187762, while incomes
range from 7006 to 23912939 for those with "Poor" credit, with medians being 32195 and means
154497. The "Good" credit group is thus more favorably inclined toward a higher average
income.
23

Figure 32

Q.4.2. Visualization Output

The above density plot illustrates the relationship between the annual income and credit
score as “Poor” and “Standard”. It shows that the trend that a person with standard credit scores
tends to have a much greater annual income as compared to others with poor credit scores. In
certain areas, lower-income individuals may overlap with higher-income individuals. However,
most of the distribution indicates that income has a positive correlation with credit score.
Therefore, the plot indicates that poor credit for lower level of income, while standard credit
belongs to the higher level of income.
24

5. Is there a significant difference in Credit Utilization Ratio among Occupation


categories?

Figure 33

Q.5.1. Statistical Analysis Code

The ANOVA test is conducted to check if there are any statistically significant differences
in the mean credit utilization ratio across the various occupations.

Figure 34

Q.5.2. Visualization Source Code

This code implements a box plot demonstrating the relationship between occupation and
credit utilization ratio. The first step is to create a data frame containing two columns: occupation
(as a factor) and credit_utilization_ratio. Then, a box plot is used to display the credit utilization
ratio distribution by occupation using ggplot2.
25

Figure 35

Q.5.1. Statistical Analysis Output

ANOVA test was done to understand the respective relationship between occupation
(Artist, Doctor, Engineer and Teacher) and credit utilization ratio. Descriptive statistics showed
varying means and medians of credit utilization modeled by occupation, with the variability
highest among doctors. A greater p-value of 0.903 shows that the observed differences in credit
utilization ratio between occupations are likely due to chance.

Figure 36

Q.5.2. Visualization Output


26

The above box plot shows how credit utilization ratio varies across different occupations:
Artist, Doctor, Engineer, and Teacher. Engineers have the lowest and most consistent credit
utilization. Doctors show the widest range of credit use, suggesting diverse financial behaviors
within this group. Artists have a consistently moderate utilization, while Teachers fall in
between, showing moderate variability. Essentially, the graph visually compares how
responsibly (or how much) people in these professions use their available credit.

6. What is the correlation between Credit Utilization Ratio and Outstanding Debt?

Figure 37

Q.6.2. Visualization Code

This code filters the Credit_score_data dataset only for records whose Credit Utilization
Ratios are between 20 and 25 and the range of Outstanding Debt is from 500 to 700 including
both. It checks whether data exists after the filtering and stops the execution. If data remains,
then that is to be used in calculating the correlation between Credit Utilization Ratio and
Outstanding Debt in the filtered dataset, which will then be rounded off and printed. The scatter
plot is used to visualize the relationship between Credit Utilization ratio and Outstanding Debt.
27

Figure 38

Q.6.2. Visualization Output

This scatter plot indicates a weak positive correlation between the credit utilization ratio
and outstanding debt. Although there is a slight tendency for high debt amounts to be associated
with high credit utilization, the wide scatter of data points indicates that the trend is weak and
that other factors are likely to be strong determinants of outstanding debt amounts. Hence, the
weak nature of this correlation suggests that the credit utilization ratio is not a reliable predictor
of outstanding debt.
28

7. Does the interest rate significantly vary based on the number of credit cards?

Figure 39

Q.7.1. Statistical Analysis Code

T-test was performed to find any difference in mean interest rates between those two
groups, to determine if there is a statistically significant difference in interest rates based on the
number of credit cards held.

Figure 40

Q.7.2. Visualization Code

This code uses line chart to determine and depict the relationship between the number of
credit cards and the average interest rate. It keeps grouped data according to the number of credit
cards and calculates their mean interest rate. This produces a line graph with points, plotting the
average interest rate against the number of credit cards.
29

Figure 41

Q.7.1. Statistical Analysis Output

The Welch Two Sample t-test shows that interest rates were significantly different
between credit cards groups labeled "High" and "Low" (p-value = 0.03123). The mean interest
rate was greater for the "High" group (75.65) than for the "Low" group (69.29). The 95%
confidence interval calculated for the mean differences ranged between 0.57 and 12.14.

Figure 42

Q.7.2. Visualization Output

The above line chart shows a strong correlation between the number of credit cards a
person has and the average interest rate they should pay. When someone has very few credit
cards (close to zero), the interest rate is consistently low. However, as the number of credit cards
increases, the average interest rate becomes much more volatile and experiences significant,
30

sharp spikes. This suggests that having many credit cards is associated with higher and less
predictable interest rates, potentially indicating greater financial risk.

8. Is there an association between the “Num_Credit_Inquiries" and the


"Credit_Score" category? (Chi-square test)

Figure 43

Q.8.1. Statistical Analysis Code

This code read the dataset, removing missing data, and scales the column
Num_Credit_Inquries to numerical form before classifying it into ranges (0-1,2-3). A
contingency table us created to report the correlation between Credit Score Inquires, and a Chi-
Square test is conducted to determine if these variables are significantly correlated.

Figure 44

Q.8.2. Visualization Source Code

We used the geom_bar () function to create a bar plot to display the distribution of credit
scores between various credit inquiry groups. The avoiding position allows the comparison of the
various groups. The bar plot aids in understanding the relationship established by the Chi-Square
test.
31

Figure 45

Q.8.1. Statistical Analysis Output

The output shows a p-values (< 2.2e-16) is less than 0.05 that means there is a
statistically significant association between the number of “Num_Credit_Inquires” and
“Credit_Score”.
32

Figure 46

Q.8.2. Visualization Output

The above bar plot represents a strong relationship between the number of credit inquiries
group and credit scores. Here each inquiry (0-3) tend to have good credit scores, while those with
more inquiries especially those above 10 often have poor credit scores which means frequent
credit inquiries may negatively impact creditworthiness, as seen by the higher proportion of poor
scores in the higher inquiry groups and the Chi-Square test confirm that a significant association
between credit scores categories and credit inquires.
33

9. Is there a correlation between the Monthly Balance and Credit Utilization Ratio of
bank customers?

Figure 47

Q.9.1. Statistical Analysis Code

This code calculates the correlation between monthly_balance and credit_utilization_ratio


by using cor () function with the dplyr package’s piping (%>%). This function calculates the
correlation between two variables by using =” complete.obs” argument to ensure that only
complete (non-missing) observations are included.

Figure 48

Q.9.1. Statistical Analysis Output

The correlation coefficient is 0.2514864, which shows weak positive correlation between
Monthly_Balance and Credit_Utilization_Ratio. It means when monthly balance increases, the
credit utilization ratio also increases slightly, although the relationship is not very strong. The
value closer to 0 indicates that the two variables are not strongly related.
34

Figure 49

Q.9.2. Visualization Source Code

This source code for scatterplot visualizes the relationship between Monthly_Balance and
Credit_Utilization_Ratio where each points represents data entry and a red regression line is
added to observe the trend.

Figure 50

Q.9.2. Visualization Output


35

This scatter plot visualizes a dispersed pattern of data points and red regression line has a
upward slope which indicates that the relationship between the two variables is weak positive
correlation.

10. How is Monthly In-Hand Salary distributed across customers?

Figure 51

Q.10.1. Statistical Analysis Source Code

This code calculates the mean, median and standard deviation of Monthly In-Hand Salary by
utilizing summarize () and it removes the missing values to correct calculations.

Figure 52

Q.10.1. Statistical Analysis Output

The descriptive statistics for Monthly In-Hand Salary show Mean (Average salary) =
4194.171, Median (Middle value) = 3093.745 and Standard Deviation (Spread of salary data) =
3183.686.
36

Figure 53

Q.10.2. Visualization Output

Figure 54

Q.10.2. Visualization Source Output

The histogram diagram shows the distribution of Monthly In-Hand Salary. Most
customers' salaries are concentrated around 10000-15000. The tail is longer in the right side so it
is positive and right skewed.
37

11. Do individuals with greater number of loans generally have a greater credit
utilization ratio?

Figure 55

Q.11.1 Statistical Analysis Source Code

This code performs linear regression by using lm () where Credit_Utilization_Ratio is


dependent variable and Num_of_loan is independent variable. It helps us to know having more
loans affect credit utilization or not.

Figure 56

Q.11.1. Statistical Analysis Output


38

linear regression model shows Regression Equation representing how Num_of_Loan


affects Credit_Utilization_Ratio. And P-value: Tells us if the relationship is statistically
significant.

Figure 57

Q. 11.2. Visualization Source Code

the code scatter plot is created, where Num_of_Loan is on the x-axis as an independent
variable and Credit_Utilization_Ratio is on the y-axis as a dependent variable and the regression
line is added to show the trend.
39

Figure 58

Q.11.2. Visualization Output

The scatter plot shows an upward trend, showing that people having more loans have higher
credit utilization.

12. What is the percentage breakdown of customers within each Credit Score category?

Figure 59

Q.12.1. Statistical Analysis Source Code

This code Credit_Score, counts the number of customers and calculates the percentage of
total customers per category.
40

Figure 60

Q.12.1. Statistical Analysis Output

The output shows the percentage of customers in each credit score category such as Good:
17.828%, Poor: 28.998% and Standard: 53.174%.

Figure 61

Q.12.2 Visualization Source Code

A bar chart is used to show the percentage distribution, where each bar represents a credit score
category, and used different colors for distinction.
41

Figure 62

Q.12.2. Visualization Output

The bar chart shows the percentage distribution of customers by credit score category,
with a larger bar for Standard, and smaller bars for Good and Poor.

13. Does the number of bank accounts differ significantly across different occupation
types? (Kruskal-Wallis test)

Figure 63

Q.13.1. Statistical Analysis Code


42

The Kruskal-Walli’s test is a non-parametric test that determines whether there are any
significant differences in medium numbers of bank accounts by occupation. Since the data might
not be normally distributed, this is a suitable alternative to ANOVA.
Figure 64

Q.13.2. Visualization Source Code

The density plot illustrates the number of bank accounts per occupation. In order to cope
with skewness and great variability of values, the x-axis is transformed by a logarithm function
(scale_x_log10 ()). Varying colors are used to distinguish varying occupation groupings, and
transparency (alpha = 0.4) helps interpret overlapping distributions.
Figure 65

Q.13.1. Statistical Analysis Output

The Kruskal-Wallis chi-squared (72.417), degree of freedom is 15 and a P-value (1.652e-


09) that means the strong evidence against the H0. Greater value indicates a highest difference
between them. P-value is less than 0.05 which means there is a statistically significant difference
in the number of bank accounts across several occupations. Therefore, reject the null hypothesis,
which means at least one occupation has a different median number of bank accounts compared
to others.
43

Figure 66

Q.13.2 Visualization Output

From the above visualization graph shows the density plot of a number of Bank accounts
across occupations. Different colors represent different occupations, helping to compare of how
the number of bank accounts varies across professions. The x-axis represents a number of bank
accounts and the y-axis represents density. The density plot shows that the numbers of bank
accounts is indeed associated with occupation type.

14. Do customers with a higher number of delayed payments have significantly different
annual incomes compared to others?
44

Figure 67

Q.14.1. Statistical Analysis Code

This code explains the statistical analysis and perform Mann-Whitney U test to analyze
whether there’s a significant difference in annual income between people with “More than 10
Delays” in payments and those with fewer delays.

Figure 68

Q.14.2. Visualization code

This code explain how uses violin and boxplots to shows various income distribution
across different delayed payment across different delayed payment groups and log scaling to
handle skewed income data. Here alpha sets transparency, trim () is used to removes longs tails
outliers.
45

Figure 69

Q.14.1. Statistical Analysis Output

This output shows p-value is 2.2e-16 which is less than 0.05 means the result is highly
significant and it means people with “More than 10 Delays” in payments have significantly
different “Annual_Income” compared to those with “Delayed_Payments”.
46

Figure 70

Q.14.2. Visualization Output

This Violin plots shows that there are variations the distribution of Annual_Income
among the various delayed payment groups, implying a possible link between payment delays
and income level. This visual evidence supports the statistical result of the Mann-Whiteny U test,
confirming that the differences are statically significant.

15. Is there a significant correlation between the age of individuals and their total
monthly EMI payments?

Figure 71

Q.15.1. Statistical Analysis Code


47

This code explains the Spearman’s rank correlation. It finds if the “Age” versus “Total
EML per Month” correlation could be explained in terms of a monotonic function. A cat ()
function computed Spearman’s correlation coefficient to the console.

Figure 72

Q.15.2. Visualization Code

Here geom_point () function creates a scatter plot of data points and to add linear
regression used geom_smoth () and lm () where scatter plot represent the relationship and line
show the trend.

Figure 73

Q.15.1. Statistical Analysis Output

Here correlation (-0.07556284) shows it is a very weak and slightly negative correlation
between Age and Total EML per Months.
48

Figure 74

Q.2.2. Visualization Output

In general, the scatter plot graphically highlights the poor correlation between Age and Total
EML per Month and also identifies potential Data Quality issues that could be enhanced for a
more accurate study.

Conclusion

In conclusion, in this assignment we explored the dataset related to bank customers credit
information, where we focused on the understanding of factors that affect customers credit scores
by examining dataset that include customers biographic details and credit usage pattern. By using
R programming, we got to know about many data analysis methods such as data exploration,
transformation and visualization. Key factors, including age, occupation and credit utilization
ratio, were identified as significant determinants of a customer’s credit score. By implementing
hypothesis testing and descriptive analysis we confirmed this output, improving the
understanding of how demographic and credit behavior affect credit ratings.
49

Moreover, we implemented advanced methods to enhance the findings even more,


including machine learning models for better classification. These improvements resulted in
more precise insights and suggestions for decision-makers within the banking sector. The
analysis encountered limitation like possible data biases and the necessity for external data to
enhance the model’s strength. Upcoming research may tackle these limitations by examining a
wider range of datasets and utilizing more advanced algorithms, allowing a deeper
comprehension of credit behavior and equipping financial institutions with improved tools.
50

References

(2024, September 19). Retrieved from Prometheus Group:


https://www.prometheusgroup.com/resources/posts/what-is-predictive-analytics

Baiju, A., Lindsay, H., & John, M. S. (2020, November 05). Midterm: CAR Package Overview.
Retrieved from RPubs: https://rpubs.com/mjs3pf/carpackage#:~:text=%E2%80%9Ccar
%E2%80%9D%2C%20the%20name%20of,creates%20visualizations%2C%20and
%20transform%20data.

Biswal, A. (2024, December 06). Retrieved from simplilearn.com:


https://www.simplilearn.com/tutorials/data-analytics-tutorial/exploratory-data-
analysis#what_is_exploratory_data_analysis

Cote, C. (2021, November 09). Retrieved from https://online.hbs.edu/blog/post/descriptive-


analytics

Cote, C. (2021, November 18). Retrieved from https://online.hbs.edu/blog/post/diagnostic-


analytics

Coursera Staff. (2025, January 16). Data Visualization. Retrieved from Coursera:
https://www.coursera.org/articles/data-visualization

Coursera Staff. (2025, January 14). what-is-r-programming. Retrieved from


https://www.coursera.org/: https://www.coursera.org/articles/what-is-r-programming

Data Cleaning. (2024, June 20). Retrieved from GeeksforGeeks:


https://www.geeksforgeeks.org/what-is-data-cleaning/

Devashree. (2024, September 27). A comprehensive Guide on ggplot2 in R. Retrieved from


https://www.analyticsvidhya.com/blog/2022/03/a-comprehensive-guide-on-ggplot2-in-r/

Hayes, M. (2024, June 19). What is data transformation. Retrieved from IBM:
https://www.ibm.com/think/topics/data-transformation

R Core Team and Contributors Worlwide . (n.d.). stats-package: The R Stats Package. Retrieved
from
51

https://stats.oarc.ucla.edu/stat/data/intro_r/intro_r_interactive_flat.html#:~:text=dat_csv
%20data%20set.-,Statistical%20analysis%20in%20R,correlation%20and%20covariance

r-programming-language-introduction. (2024, May 26). Retrieved from


https://www.geeksforgeeks.org/: https://www.geeksforgeeks.org/r-programming-
language-introduction/

Sean. (2024, October 31). Retrieved from https://www.fanruan.com/en/glossary/big-data/data-


loading

Shah, E. (2023, August 21). dplyr Package in R Programming. Retrieved from Scaler.com:
https://www.scaler.com/topics/dplyr-package-in-r/

su, G. (2019, January 22). A Comprehensive List of Handy R Packages. Retrieved from Medium:
https://medium.com/towards-data-science/a-comprehensive-list-of-handy-r-packages-
e85dad294b3d

What is Exploratory Data Analysis. (2025, January 13). Retrieved from GeeksforGeeks:
https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/
52

Appendix

Workload Matrix

You might also like