DA 1 - Intro To RStudio
DA 1 - Intro To RStudio
DA 1 - Intro To RStudio
Introduction to RStudio
This first data analysis assignment is intended to help you familiarize yourself with the RStudio
interface and get comfortable running small chunks of R code. If you have not done so already,
please work through the Introduction to RStudio & Tutorial in the Getting Started & Course
Resources module on Canvas.
For this assignment, you will need the loan50.csv dataset which can be found on the Data
Analysis 1 assignment page and in the Introduction to RStudio & Tutorial.
You can find a description of the variables recorded in the loan50.csv dataset on the OpenIntro
Statistics loan50 info page.
The Introduction to RStudio & Tutorial walks you through using some of the basic, built-in
functions in R. Read through and run each line of code to ensure you understand what the
functions are doing and what types of output each produces.
Load the present dataset using the data() function, just as we did in the tutorial. Note that if
you do not have the openintro package loaded, you will not be able to access this dataset.
The tutorial discusses how to open packages using the library() function. Once you have the
present dataset loaded, answer the following questions.
b. (1 point) What are the dimensions of the data frame? Dimensions refers to the number of
rows and variables.
ST 314 Data Analysis 1
e. (1 point) In what year did we see the most total number of births in the U.S.? Hint: First
calculate the totals and save it as a new variable. Then, sort your dataset in descending
order based on the total column. You can do this interactively in the data viewer by
clicking on the arrows next to the variable names. To include the sorted result in your report
you will need to use two new functions. First we use arrange() to sorting the variable.
Then we can arrange the data in a descending order with another function, desc(), for
descending order. The sample code is provided below.
present %>%
arrange(desc(total))
1905
ST 314 Data Analysis 1
For this portion of the assignment, you’ll practice using R to explore the total_credit_limit
variable in the loan50.csv data set.
a. (1 point) Construct a histogram of the total credit limit data. Include informative labels and a
title. Include your histogram below.
ST 314 Data Analysis 1
b. (1 point) Construct a boxplot of the total credit limit data. Include informative labels and a
title. Include your boxplot below.
c. (1 point) Using the histogram you constructed in part a and the boxplot from part b, describe
the shape of the distribution of the total credit limit variable and comment on the presence of
any outliers.
The histogram shows that the image is skewed to the right. Most of the total credit data is
centered between 70,000 and 300,000 dollars. There are outliers, one at 800k.
d. (1.5 points) Calculate the mean of the total credit limit data.
It is 208547
e. (1.5 points) Calculate the median of the total credit limit data.
It is 147664
f. (1 point) Which measure of center (mean or median) is more appropriate for these data?
Why? Consider the shape of the distribution discussed in part c.
It is median.Because there is an outlier in this, it will pull up the average.
ST 314 Data Analysis 1
g. (1.5 points) Calculate the standard deviation of the total credit limit data.
h. (1.5 points) Calculate the interquartile range of the total credit limit data.
Let’s continue to explore the total credit limit data, but now consider how total credit limit may
vary between homeownership status (stored in the homeownership variable).
i. (1 point) Construct a side-by-side boxplot for total credit limit broken up by homeownership
type. Include informative labels and a title.
j. (2 points) How do the distributions of total credit limit compare for homeownership status?
Comment on the shape, center, spread, and presence of outliers for the two groups.
There is not much difference in renting and owning your own property. However, outliers
appear in renting. In mortgages, where most people have credit limits between 200,000 and 4 -
million, it has an outlier of about 800,000 dollars.
ST 314 Data Analysis 1
Finally, we’ll focus our attention only on the loan purpose variable (stored in loan_purpose).
k. (2 points) Construct a table of counts for the loan purpose variable. Report the number of
observations in each category below.
l. (2 points) Construct a table of proportions for the loan purpose variable. Report the
proportions for each category below.
m. (1 point) Construct a barplot that displays the distribution of loan purpose types. Include
informative labels and a title. Include your barplot below.
ST 314 Data Analysis 1