PFDA (Programming For Data Analysis) APU
PFDA (Programming For Data Analysis) APU
II
III
Table Contents
Introduction......................................................................................................................................1
Aim...............................................................................................................................................1
Scope............................................................................................................................................1
Objectives.....................................................................................................................................1
Research Questions..........................................................................................................................2
R programming................................................................................................................................3
Introduction To R programming..................................................................................................3
Packages in R...............................................................................................................................3
Descriptive Analysis....................................................................................................................5
Diagnostic Analysis......................................................................................................................5
Predictive Analysis.......................................................................................................................5
Prescriptive Analysis....................................................................................................................5
Data Cleaning...............................................................................................................................9
Data Manipulation......................................................................................................................10
Data Transformation...................................................................................................................11
Data Visualization......................................................................................................................11
1. Does the number of bank accounts a person holds influence how frequently they delay
payments?...................................................................................................................................12
IV
2. Does the frequency of credit card usage significantly differ between individuals with low
and high credit scores?...............................................................................................................15
3. What is the distribution of credit cards among the customers aged in between 20-30?....18
4. Is there a significant difference in annual income between customers with standard credit
scores and poor credit scores?....................................................................................................21
6. What is the correlation between Credit Utilization Ratio and Outstanding Debt?............26
7. Does the interest rate significantly vary based on the number of credit cards?.................28
9. Is there a correlation between the Monthly Balance and Credit Utilization Ratio of bank
customers?..................................................................................................................................33
11. Do individuals with greater number of loans generally have a greater credit
utilization ratio?.........................................................................................................................37
12. What is the percentage breakdown of customers within each Credit Score category?. 39
13. Does the number of bank accounts differ significantly across different occupation
types? (Kruskal-Wallis test).......................................................................................................40
14. Do customers with a higher number of delayed payments have significantly different
annual incomes compared to others?.........................................................................................42
15. Is there a significant correlation between the age of individuals and their total monthly
EMI payments?..........................................................................................................................45
Conclusion.....................................................................................................................................47
References......................................................................................................................................49
Appendix........................................................................................................................................51
Workload Matrix........................................................................................................................51
V
VI
List of Figure
Introduction
This project covers the analysis and development of a credit score classification system
using the R programming languages. Credit scores are important in financial decision-making
because they allow banks and other financial organizations to assess a customer’s credit
worthiness. This categorization system will be developed using various data analytics techniques
such as data exploration. This project aims to predict customer credit scores using past financial
data. The findings will offer important perspectives for evaluating risks and making decisions in
the banking sector (Pebesma & Graler, 2025).
Aim
The main aim of this assignment is to create an effective, dependable and precise Credit
Score Classification model by using R programming. This model will clarify customers
according to their reliability, assisting banks and other institutions in making well decision while
reducing the risk of default. This assignment will explore the effectiveness of different
classification algorithms to identify the best model for assessing credit risk. Furthermore, it aims
to emphasize important financial metrics that impact credit scores, ensuring about the model is
clear and understandable.
Scope
Objectives
To explore and analyse a set of data and reconstruct it into meaningful representations for
decision making.
To identify the factors that differentiate credit score of customers and provide useful
recommendations to stakeholders.
To utilize different data analysis and analyze the findings through statistical charts and gr
aphs, ensuring it is clear and valuable insights.
Research Questions
The analysis main proposed is to analyze the given dataset and answer the following
research questions through their respective analyses. The research question for this analysis are
given below, along with the respective analysis.
1. Does the number of bank accounts a person holds influence how frequently they
delay payments?
2. Does the frequency of credit card usage significantly differ between individuals with
low and high credit scores?
3. What is the distribution of credit cards among the customers aged in between 20-30?
4. Is there a significant difference in annual income between customers with standard
credit scores and poor credit scores?
5. Is there a significant difference in Credit Utilization Ratio among Occupation
categories?
6. What is the correlation between Credit Utilization Ratio and Outstanding Debt?
7. Does the interest rate significantly vary based on the number of credit cards?
8. Is there an association between the “Num_Credit_Inquiries" and the "Credit_Score"
category?
9. Is there a correlation between the Monthly Balance and Credit Utilization Ratio of
bank customers?
10. How is Monthly In-Hand Salary distributed across customers?
3
11. Do individuals with greater number of loans generally have a greater credit
utilization ratio?
12. What is the percentage breakdown of customers within each Credit Score category?
13. Does the number of bank accounts differ significantly across different occupation
types? (Kruskal-Wallis test)
14. Do customers with a higher number of delayed payments have significantly different
annual incomes compared to others? Mann-Whitney U
15. Is there a significant correlation between the age of individuals and their total
monthly EMI payments?
R programming
Introduction To R programming
Packages in R
R Packages consist of R functions, compiled code, and sample data. Within the R
environment, they reside in a directory named “library”. R automatically installs a list of
packages when it is installed. Other packages are installed later, as and when they are required
for a specific purpose. When we first open the R console, there are only the default packages
available as default. Some of them are describe below:
dplyr:
Hadley Wickham built dplyr, a popular R packages that transforms the way data
manipulation tasks are conducted. The packages include a strong and user-friendly syntax for
data manipulation, making it easier for R users to explore, transform, and summarize datasets
(Shah, 2023).
Tidyr
4
This package simplifies data management by altering and reshaping it. The tidyr
packages is a key component of the tidyverse packages, focusing on data modification and
display in visual form. To use this package first installed in the R environment and import on
top of the file.
Readr
The primary goal of readr is to provide a rapid and simple way to load tabular data into R
and we used read.csv () function.
ggplot2
The ggplot2 is a popular data library for visualization in R and it is based on the
“Grammar of Graphics” ideas, which offers a formal and systematic technique for
constructing and analyzing data visualizations. The ggplot2 enables users to create a wide
range of high-quality and configurable statistical charts, making it an effective tool for data
exploration and reporting (Devashree, 2024).
Stats
This package, installed with base R, contains a board range of regularly used statistical
techniques, including chi-square tests and number of related tests and statistical test like t-
test, correlation, ANOVA and linear regression models (R Core Team and Contributors
Worlwide ).
Car
The package name, “car,” is an abbreviation of Companion to Applied
Regression. The packages are not used to perform Applied Regression techniques; rather,
it is used as a companion by providing a collection of function to perform test, produce
visualization, and data transformation (Baiju et al., 2020).
Scale
A new R package to help/prettify ggplot scales which can be rather irritating and time-
consuming to adjust. It really helpful to generate log scale ticks (su, 2019).
Effsize
5
This package offers functions for calculation of standardized effect size in experiments
(Cohen d, Hedges g). The calculation methods have been optimized to enable fast
computation even for humongous data sets.
The vcd package in R has facilities for the analysis and visualization of categorical data.
It provides functions for creating mosaic plots, association plots, and other graphical methods
to study the associations between categorical variables.
It is defined as the process of turning unprocessed data into useful insights for defensible
decision-making. Our assignment dataset involves various steps to analyze the data.
Descriptive Analysis
The most basic kind of analytics is descriptive analytics, which serves as the basis for all
other forms. It enables to extract patterns from unprocessed data and provide a concise
explanation of what has occurred or is occurring (Cote, 2021).
Diagnostic Analysis
The technique of using data to identify the reasons behind correlation and patterns
between variables is known as diagnostic analytics (Cote, 2021). It can be performed with
statistical software (like Microsoft Excel), or by hand.
Predictive Analysis
Prescriptive Analysis
prescriptive analytics is to simplify the decision-making process by eliminating the need for
educated guesses or assessments in data analytics.
The goal of exploratory data analysis is to gain a through understanding of the data and
discover its various features, frequently through visual aids. This makes it easy to interpret the
data and spot useful trends. It is critical to discover data trends and determine which aspects are
essential to final output and which are not. Additionally, there might be relationships between
some variables and others (Biswal, 2024).
Figure 1
Importing, moving, or combining data from several sources into a single storage system
is referred to as data loading. This procedure guarantees accurate and secure data transfer
between systems (Sean, 2024). Data is loaded variously first import the necessary libraries on the
top of the file by using library () and set the work environment by using related such as read.csv.
7
Figure 2
Figure 3
The first stage of data analysis is data exploration, during which one delves into a dataset
to gain an understanding of its contents (What is Exploratory Data Analysis, 2025). It includes
various in-built functions such as str (), head (), and summary () to find structure, anomalies, and
other features that might merit additional investigation.
Figure 4
Figure 5
Figure 6
Figure 7
Data Cleaning
The process of finding and fixing (or eliminating) mistakes, and inaccuracies within a
dataset. It is referred to as cleaning up data or data scrubbing. Accurate, consistent, and
dependable data is necessary for efficient analysis and decision-making, and this critical step in
the data management process (Data Cleaning, 2024)
Figure 8:
Data Cleaning
Figure 9
Data Manipulation
Figure 10
Figure 11
Figure 12
Data Transformation
Figure 13
Figure 14
Data Visualization
Data Visualization is the way of displaying data in visual form such as chart, graphs, and
maps, which make easy to understand data (Coursera Staff, 2025). It also consists scatter plots,
pie charts, boxplots, and histograms.
Figure 15
Figure 16
1. Does the number of bank accounts a person holds influence how frequently they
delay payments?
Figure 17
We have used linear regression to ascertain whether there is a correlation between the
quantity of bank accounts and the quantity of late payments. The quantity of late payments can
be predicted based on the numbers of bank accounts. The arguments in the lm() function are the
independent variable and the dependent variable. Using summary data, we compute p-value, R-
squared, slope, and intercept.
13
Figure 18
We used to ggplot2 to generate two important plots: a scatter plot with a regression line
and a boxplot. The geom_point() function graphs a scatter plot of the data points. Each point on
the scatter plot represents a single observation. The geom_smooth() function includes a
regression line on the plot to show a trend or relationship between the two variables; lab() gives
the title and axis labels that are descriptive; and theme_minimal() designs the plot theme.
Furthermore, geom_boxplot() creates boxplots for each group by bank account number, with
whiskers representing the range of values, and stat_summary() adds a blue dot representing the
mean late payment for each group.
Figure 19
Figure 20
The plots show no strong associations between late payment and the quantity of bank
accounts. Most points lie around zero with some stringent outliers. The regression line is
horizontal, indicating no effect. This can be further shown in the boxplot, where little difference
exists among groups. Generally, the quantity of bank accounts has no noticeable impact on late
payment, although outliers may affect findings.
15
2. Does the frequency of credit card usage significantly differ between individuals with
low and high credit scores?
Figure 21
Here we preprocess the dataset by eliminating rows with no values and separating credit
scores into two categories: low and high. After that, by using chaining filtered the data that are
extreme outliers. Levene’s test is used to ensure variance homogeneity, followed by a Welch’s t-
test to compare the number of credit cards across low and high credit score categories. Finally,
Chnen’s d is to determine the magnitude of the difference, and a boxplot depicts the distribution
of credit score group.
Figure 22
Here we used ggplot2 to generate a boxplot of credit card numbers for low and high
credit score categories and geom_boxplot() function is used to generate the plot, with the x-axis
representing credit score and the y-axis representing the number of credit cards.
Figure 23
According to the findings, those with low credit scores have more credit cards than
people with excellent credit. Levene’s test suggests unequal variances between groups, therefore
Welch’s t-test appropriate. The t-test reveals a statistically significant difference in the average
number of credit cards used. The low credit score group used 6.64 cards, while the good credit
score group used 5.10 cards. The Cohen’s d value of -0.745 indicates a medium effect size,
which means the difference is practically significant but not too big. Thus, people with lower
credit ratings have more credit cards, and the difference matters in real life.
17
Figure 24
The boxplot illustrates that low credit-score individuals have more median credit cards
and a wider range than high credit-score individuals. The low group has a wider spread, whereas
the high group has a smaller and more consistent median, which supports the t-test results.
18
3. What is the distribution of credit cards among the customers aged in between 20-
30?
Figure 25
Here, we use t-test to compare two different age categories, i.e., <=25, - >25 along with
sample numbers of credit cards. This does first by calculating the mean, median, standard
deviation, minimum, and maximum number of credit cards. It then calculates the correlation
between age and the number of credit cards, keeping in mind all the missing values. The end
result is to segment individuals into two age groups and conduct an independent samples t-test to
compare the mean difference in the number of credit cards held by them. The code thus prints
descriptive statistics, correlation coefficients, and t-test results.
Figure 26
We filter customers age 20 to 30 with total credit cards between 8 and 1300. The next
step is to group these filtered data according to age and count the number of individuals within
each age group. From the output of age group counts, a bar chart is made to visualize this age
distribution. The cleaned dataset is saved into a CSV file, printed summary statistics along with
column names, structure of cleaned data into the console.
Figure 27
The above analysis a database with the age of customers, which can be anything between
14 and the absurd figure of 8698; the customer monthly salary could either be as less as 303.6 or
as high as 15204.6; also, many more banking; credit card, loan, and payment-related variables
needed to be observed.
Figure 28
The above bar graph shows the number of credit card distribution between age group 20
to 30. The chart shows a fairly uniform distribution along the considered ages, with numbers
varying nearly around 340-400 people per age. This chart shows the normal distribution.
21
Figure 29
We perform an independent sample t-test to compare annual income between two groups:
'Standard' credit scores and 'Poor' credit scores and subset the data to create two vectors
consisting of the annual incomes for good_credit and bad_credit for each group. The means of
the two vectors are compared using the t.test() function, which can help to determine if there's
any statistical significance in the incomes between these two groups depending on credit scores.
Figure 30
We proceed t-test to compare mean annual incomes across "Standard" and "Poor" credit
groups. And used density plot shows the annual income distribution across credit score
categories (standard and poor).
Figure 31
The output shows that for those individuals marked as having "Good" credit, annual
incomes range from 7006 to 24188807, medians being 37159 and means 187762, while incomes
range from 7006 to 23912939 for those with "Poor" credit, with medians being 32195 and means
154497. The "Good" credit group is thus more favorably inclined toward a higher average
income.
23
Figure 32
The above density plot illustrates the relationship between the annual income and credit
score as “Poor” and “Standard”. It shows that the trend that a person with standard credit scores
tends to have a much greater annual income as compared to others with poor credit scores. In
certain areas, lower-income individuals may overlap with higher-income individuals. However,
most of the distribution indicates that income has a positive correlation with credit score.
Therefore, the plot indicates that poor credit for lower level of income, while standard credit
belongs to the higher level of income.
24
Figure 33
The ANOVA test is conducted to check if there are any statistically significant differences
in the mean credit utilization ratio across the various occupations.
Figure 34
This code implements a box plot demonstrating the relationship between occupation and
credit utilization ratio. The first step is to create a data frame containing two columns: occupation
(as a factor) and credit_utilization_ratio. Then, a box plot is used to display the credit utilization
ratio distribution by occupation using ggplot2.
25
Figure 35
ANOVA test was done to understand the respective relationship between occupation
(Artist, Doctor, Engineer and Teacher) and credit utilization ratio. Descriptive statistics showed
varying means and medians of credit utilization modeled by occupation, with the variability
highest among doctors. A greater p-value of 0.903 shows that the observed differences in credit
utilization ratio between occupations are likely due to chance.
Figure 36
The above box plot shows how credit utilization ratio varies across different occupations:
Artist, Doctor, Engineer, and Teacher. Engineers have the lowest and most consistent credit
utilization. Doctors show the widest range of credit use, suggesting diverse financial behaviors
within this group. Artists have a consistently moderate utilization, while Teachers fall in
between, showing moderate variability. Essentially, the graph visually compares how
responsibly (or how much) people in these professions use their available credit.
6. What is the correlation between Credit Utilization Ratio and Outstanding Debt?
Figure 37
This code filters the Credit_score_data dataset only for records whose Credit Utilization
Ratios are between 20 and 25 and the range of Outstanding Debt is from 500 to 700 including
both. It checks whether data exists after the filtering and stops the execution. If data remains,
then that is to be used in calculating the correlation between Credit Utilization Ratio and
Outstanding Debt in the filtered dataset, which will then be rounded off and printed. The scatter
plot is used to visualize the relationship between Credit Utilization ratio and Outstanding Debt.
27
Figure 38
This scatter plot indicates a weak positive correlation between the credit utilization ratio
and outstanding debt. Although there is a slight tendency for high debt amounts to be associated
with high credit utilization, the wide scatter of data points indicates that the trend is weak and
that other factors are likely to be strong determinants of outstanding debt amounts. Hence, the
weak nature of this correlation suggests that the credit utilization ratio is not a reliable predictor
of outstanding debt.
28
7. Does the interest rate significantly vary based on the number of credit cards?
Figure 39
T-test was performed to find any difference in mean interest rates between those two
groups, to determine if there is a statistically significant difference in interest rates based on the
number of credit cards held.
Figure 40
This code uses line chart to determine and depict the relationship between the number of
credit cards and the average interest rate. It keeps grouped data according to the number of credit
cards and calculates their mean interest rate. This produces a line graph with points, plotting the
average interest rate against the number of credit cards.
29
Figure 41
The Welch Two Sample t-test shows that interest rates were significantly different
between credit cards groups labeled "High" and "Low" (p-value = 0.03123). The mean interest
rate was greater for the "High" group (75.65) than for the "Low" group (69.29). The 95%
confidence interval calculated for the mean differences ranged between 0.57 and 12.14.
Figure 42
The above line chart shows a strong correlation between the number of credit cards a
person has and the average interest rate they should pay. When someone has very few credit
cards (close to zero), the interest rate is consistently low. However, as the number of credit cards
increases, the average interest rate becomes much more volatile and experiences significant,
30
sharp spikes. This suggests that having many credit cards is associated with higher and less
predictable interest rates, potentially indicating greater financial risk.
Figure 43
This code read the dataset, removing missing data, and scales the column
Num_Credit_Inquries to numerical form before classifying it into ranges (0-1,2-3). A
contingency table us created to report the correlation between Credit Score Inquires, and a Chi-
Square test is conducted to determine if these variables are significantly correlated.
Figure 44
We used the geom_bar () function to create a bar plot to display the distribution of credit
scores between various credit inquiry groups. The avoiding position allows the comparison of the
various groups. The bar plot aids in understanding the relationship established by the Chi-Square
test.
31
Figure 45
The output shows a p-values (< 2.2e-16) is less than 0.05 that means there is a
statistically significant association between the number of “Num_Credit_Inquires” and
“Credit_Score”.
32
Figure 46
The above bar plot represents a strong relationship between the number of credit inquiries
group and credit scores. Here each inquiry (0-3) tend to have good credit scores, while those with
more inquiries especially those above 10 often have poor credit scores which means frequent
credit inquiries may negatively impact creditworthiness, as seen by the higher proportion of poor
scores in the higher inquiry groups and the Chi-Square test confirm that a significant association
between credit scores categories and credit inquires.
33
9. Is there a correlation between the Monthly Balance and Credit Utilization Ratio of
bank customers?
Figure 47
Figure 48
The correlation coefficient is 0.2514864, which shows weak positive correlation between
Monthly_Balance and Credit_Utilization_Ratio. It means when monthly balance increases, the
credit utilization ratio also increases slightly, although the relationship is not very strong. The
value closer to 0 indicates that the two variables are not strongly related.
34
Figure 49
This source code for scatterplot visualizes the relationship between Monthly_Balance and
Credit_Utilization_Ratio where each points represents data entry and a red regression line is
added to observe the trend.
Figure 50
This scatter plot visualizes a dispersed pattern of data points and red regression line has a
upward slope which indicates that the relationship between the two variables is weak positive
correlation.
Figure 51
This code calculates the mean, median and standard deviation of Monthly In-Hand Salary by
utilizing summarize () and it removes the missing values to correct calculations.
Figure 52
The descriptive statistics for Monthly In-Hand Salary show Mean (Average salary) =
4194.171, Median (Middle value) = 3093.745 and Standard Deviation (Spread of salary data) =
3183.686.
36
Figure 53
Figure 54
The histogram diagram shows the distribution of Monthly In-Hand Salary. Most
customers' salaries are concentrated around 10000-15000. The tail is longer in the right side so it
is positive and right skewed.
37
11. Do individuals with greater number of loans generally have a greater credit
utilization ratio?
Figure 55
Figure 56
Figure 57
the code scatter plot is created, where Num_of_Loan is on the x-axis as an independent
variable and Credit_Utilization_Ratio is on the y-axis as a dependent variable and the regression
line is added to show the trend.
39
Figure 58
The scatter plot shows an upward trend, showing that people having more loans have higher
credit utilization.
12. What is the percentage breakdown of customers within each Credit Score category?
Figure 59
This code Credit_Score, counts the number of customers and calculates the percentage of
total customers per category.
40
Figure 60
The output shows the percentage of customers in each credit score category such as Good:
17.828%, Poor: 28.998% and Standard: 53.174%.
Figure 61
A bar chart is used to show the percentage distribution, where each bar represents a credit score
category, and used different colors for distinction.
41
Figure 62
The bar chart shows the percentage distribution of customers by credit score category,
with a larger bar for Standard, and smaller bars for Good and Poor.
13. Does the number of bank accounts differ significantly across different occupation
types? (Kruskal-Wallis test)
Figure 63
The Kruskal-Walli’s test is a non-parametric test that determines whether there are any
significant differences in medium numbers of bank accounts by occupation. Since the data might
not be normally distributed, this is a suitable alternative to ANOVA.
Figure 64
The density plot illustrates the number of bank accounts per occupation. In order to cope
with skewness and great variability of values, the x-axis is transformed by a logarithm function
(scale_x_log10 ()). Varying colors are used to distinguish varying occupation groupings, and
transparency (alpha = 0.4) helps interpret overlapping distributions.
Figure 65
Figure 66
From the above visualization graph shows the density plot of a number of Bank accounts
across occupations. Different colors represent different occupations, helping to compare of how
the number of bank accounts varies across professions. The x-axis represents a number of bank
accounts and the y-axis represents density. The density plot shows that the numbers of bank
accounts is indeed associated with occupation type.
14. Do customers with a higher number of delayed payments have significantly different
annual incomes compared to others?
44
Figure 67
This code explains the statistical analysis and perform Mann-Whitney U test to analyze
whether there’s a significant difference in annual income between people with “More than 10
Delays” in payments and those with fewer delays.
Figure 68
This code explain how uses violin and boxplots to shows various income distribution
across different delayed payment across different delayed payment groups and log scaling to
handle skewed income data. Here alpha sets transparency, trim () is used to removes longs tails
outliers.
45
Figure 69
This output shows p-value is 2.2e-16 which is less than 0.05 means the result is highly
significant and it means people with “More than 10 Delays” in payments have significantly
different “Annual_Income” compared to those with “Delayed_Payments”.
46
Figure 70
This Violin plots shows that there are variations the distribution of Annual_Income
among the various delayed payment groups, implying a possible link between payment delays
and income level. This visual evidence supports the statistical result of the Mann-Whiteny U test,
confirming that the differences are statically significant.
15. Is there a significant correlation between the age of individuals and their total
monthly EMI payments?
Figure 71
This code explains the Spearman’s rank correlation. It finds if the “Age” versus “Total
EML per Month” correlation could be explained in terms of a monotonic function. A cat ()
function computed Spearman’s correlation coefficient to the console.
Figure 72
Here geom_point () function creates a scatter plot of data points and to add linear
regression used geom_smoth () and lm () where scatter plot represent the relationship and line
show the trend.
Figure 73
Here correlation (-0.07556284) shows it is a very weak and slightly negative correlation
between Age and Total EML per Months.
48
Figure 74
In general, the scatter plot graphically highlights the poor correlation between Age and Total
EML per Month and also identifies potential Data Quality issues that could be enhanced for a
more accurate study.
Conclusion
In conclusion, in this assignment we explored the dataset related to bank customers credit
information, where we focused on the understanding of factors that affect customers credit scores
by examining dataset that include customers biographic details and credit usage pattern. By using
R programming, we got to know about many data analysis methods such as data exploration,
transformation and visualization. Key factors, including age, occupation and credit utilization
ratio, were identified as significant determinants of a customer’s credit score. By implementing
hypothesis testing and descriptive analysis we confirmed this output, improving the
understanding of how demographic and credit behavior affect credit ratings.
49
References
Baiju, A., Lindsay, H., & John, M. S. (2020, November 05). Midterm: CAR Package Overview.
Retrieved from RPubs: https://rpubs.com/mjs3pf/carpackage#:~:text=%E2%80%9Ccar
%E2%80%9D%2C%20the%20name%20of,creates%20visualizations%2C%20and
%20transform%20data.
Coursera Staff. (2025, January 16). Data Visualization. Retrieved from Coursera:
https://www.coursera.org/articles/data-visualization
Hayes, M. (2024, June 19). What is data transformation. Retrieved from IBM:
https://www.ibm.com/think/topics/data-transformation
R Core Team and Contributors Worlwide . (n.d.). stats-package: The R Stats Package. Retrieved
from
51
https://stats.oarc.ucla.edu/stat/data/intro_r/intro_r_interactive_flat.html#:~:text=dat_csv
%20data%20set.-,Statistical%20analysis%20in%20R,correlation%20and%20covariance
Shah, E. (2023, August 21). dplyr Package in R Programming. Retrieved from Scaler.com:
https://www.scaler.com/topics/dplyr-package-in-r/
su, G. (2019, January 22). A Comprehensive List of Handy R Packages. Retrieved from Medium:
https://medium.com/towards-data-science/a-comprehensive-list-of-handy-r-packages-
e85dad294b3d
What is Exploratory Data Analysis. (2025, January 13). Retrieved from GeeksforGeeks:
https://www.geeksforgeeks.org/what-is-exploratory-data-analysis/
52
Appendix
Workload Matrix