STA1007S Lab 4: Scatterplots and Basic Programming: "Hist"

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

STA1007S Lab 4: Scatterplots and basic programming

SUBMISSION INSTRUCTIONS:

Your answers need to be submitted on Vula.

Go into the Submissions section and click on Lab Session 4 to access the submission form. Please note that
the answers get automatically marked and so have to be in the correct format:

ENTER YOUR ANSWERS TO 2 DECIMAL PLACES UNLESS THE ANSWER IS A ZERO OR AN INTE-
GER (for example if the answer is 0 you just enter 0 and not 0.00, or if the answer is 2 you enter 2 and not 2.00).

DO NOT INCLUDE ANY UNITS (ie meters, mgs, etc).

PROBABILITIES MUST BE BETWEEN 0 AND 1, SO A 50% CHANCE WOULD CORRESPOND TO A


PROBABILITY OF 0.5.

Introduction
By now, we’ve become familiar with the basic functionality of R and how to read in data. We’ve also see how
to plot data using various types of graphs. In this lab, we’ll keep using the Class_survey2 data set. We’ll first
have a look at another important plotting function. And then on to something exciting! We mentioned earlier
that R also offers a complete programming environment. Together with all the functionality for plotting and
analysing data, this makes R a very power tool. In this lab, we are going to take a glimpse at some of the
basic programming functionality in R, namely loops and if-else statements.
You will find that most of the R code necessary to execute the R commands is provided. This lab is meant to
be practice for you, so even if the code and the output of the code is provided, you are expected to create
your own script, run the pieces of code yourself and check whether the output is what you would expect it to
be. Every now and then, you will be asked to fill in blank pieces of code marked as ---. In addition to “fill
in the code”, you will need to answer other questions for which you must produce plots, run your own code
or explore your data. The questions you need to submit through Vula will appear in the submission boxes.
At any time you might call the function help(), to obtain information from any function you want. E.g. If
you wanted to obtain a description of how the function sample() works, you can at any time type in the
console (bottom left panel in RStudio):
help("hist")

or you can just type:


?hist

You should make it a habit to check the help files of the functions you use for the first time.

Start a new R script and import your data


By now, you should already know how to start a new R script in an existing R project and write a few lines
describing what you are going to do. Do this now.

1
Remember to add a line to clean your working environment and one to double check that your working
directory is correct.
Now, read into R the Class_Survey2 data set (make sure it’s the correct file) and check that it has been read
correctly by using the functions: head(), summary(), etc. We will assume that you named the data frame
containing your data classData and this is how we will refer to it during the lab. It is perfectly fine if you
want to call it something else, you will just need to adapt your code accordingly.
Remember to save your script frequently!

Scatterplots and the plot() function


When we covered visual data analysis a while ago, we used scatterplots to explore the relationship between
two numerical variables. This is a common situation and we will learn how to describe such relationships
with the use of statistical models later in the course. Here, we’ll learn how to produce scatterplots.
R has a number of built-in data sets that come with the base installation and are useful as examples. One of
them is the cars data set. It contains data on the speed of cars and the distance it took them to stop. You
load the data set with the command data(cars). We ask you to explore the data set using the summary(),
head() and str() functions (demonstrated in earlier labs). Also type ?cars to get some more information
on this data set.
data(cars)

SUBMISSION:
Vula Question 1 How many rows and columns does the ‘cars‘ data set have?
Vula Question 2 What were the minimum and maximum observed distances?
Now let’s have a look at the relationship between speed and distance. From basic physics, we expect that the
speed of a car causes variation in the distance needed to stop: the faster it goes, the longer it takes to stop.
We tend to plot the variable that measures the cause of the relationship on the x axis and the variable that
measures the effect on the y axis. We call the former the ‘independent’ or ‘explanatory’ variable. The latter
is called the ‘dependent’ or ‘response’ variable (more on this later in the course).
The plot() function uses the formula syntax that is standard across many R functions: response ~
explanatory. We are using the data argument to tell R in which data set the variables are:
plot(dist ~ speed, data=cars)

2
100 120
80
dist

60
40
20
0

5 10 15 20 25

speed
As expected, we see that higher speeds are associated with longer stopping distances.
There are a number of things we can change if we are not happy with the defaults. For example, we may
not like the box around the plot; the y axis label is particularly uninformative. And we want the axis
annotation to always be horizontal. We can make these tweaks using various arguments and by drawing the
axes separately:
plot(dist ~ speed, data=cars, axes=F, ylab = "Distance")
axis(side=1)
axis(side=2, las=1)

120

100

80
Distance

60

40

20

5 10 15 20 25

speed
Play around with these settings to see which is doing what. Also, have a look at all the other graphical

3
parameter settings you can change by typing ?par into your console.
Now on to our own data. First read the “Class_Survey2.csv” data set into R. With the argument
stringsAsFactors = TRUE, we ask R to convert the string variables into factors right away.
Remember to use the summary(), head() and str() functions (demonstrated in earlier labs) to check that R
has read in the file correctly.
Now plot height (response) against age (explanatory variable):
plot(---, data=---)

SUBMISSION:
Vula Question 3 Which statement best describes the relationship between age and height in this data set?
a. Age has a positive effect on height.
b. Age and height are positively related.
c. There is no clear relationship between age and height.

The plot() function can do more than scatterplots. . .


The plot() function is actually a generic function and you will encounter it often. Depending on the type of
object you pass on to the function, it does different things. For example, if the response variable is a numeric
variable and the explanatory variable is categorical, plot() produces side-by-side box plots by default.
plot(height ~ eyes, data=classData)
2.0
1.8
height

1.6
1.4
1.2

Blue Brown Green Hazel Other

eyes
You’ve learned how to do side-by-side box plots using the boxplot function in Lab 2. Go back and check
how this was done and compare to the method using the plot() function – they are very similar.

4
Programming in R – the basics
Loops
Computers are good at doing the same thing over and over. . . In fact, whenever you are sitting at a computer
and are faced with a task that is boringly repetitive, you should probably consider whether you can program
your computer to do the job for you.
One of the most important tools for programming tasks that need to be repeated many times are loops. And
you will be excited to hear – we hope – that R can do loops!
There are different types of loops. Most often, you have a task that needs to be carried out a certain number
of times. In R, this can be achieved with for loops. For example, we want to write “R is great” 5 times.
This is how we’d do it:
for (i in 1:5) print("R is great")

## [1] "R is great"


## [1] "R is great"
## [1] "R is great"
## [1] "R is great"
## [1] "R is great"
The key to understanding how this works, we need to look at what happens inside the first pair of brackets.
The variable i is a running variable (you can use any name for this variable, it doesn’t need to be i). With
the next bit of code, we are telling R what values i should take on. In this case, we create a sequence from 1
to 5 with the code 1:5 and the in statement makes the variable i run through these values. We can see this
if we ask R to print the value of i at each iteration of the loop:
for (i in 1:5) print(i)

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
Ok, so far so good. But usually, we want to give R more complicated tasks to carry out repeatedly. For
example, we might want to draw random samples of 10 values from the variable height in our classData
data set, and we’d like to do this 5 times, i.e. draw five samples of size 10. The function sample() does this.
So let’s try:
for (i in 1:5) sample(classData$height, size = 10)

Hmm, nothing happened? In fact, R did the work but since we didn’t tell it what to do with these samples,
they got immediately dropped from memory again. Let’s say that we actually want to know what the mean –
x̄ – is for each sample. We first need to set up a vector that will hold these mean values; call this vector
means10. At this stage, it also becomes more efficient to set the number of samples as a variable, rather than
a fixed number; call this number nsamples.
nsamples <- 1000
means10 <- numeric(nsamples)

for (i in 1:nsamples) {means10[i] <- mean(sample(classData$height, size = 10))}

Study the code above carefully. We set nsamples to 1000, i.e. we want 1000 samples. Then, we set up a
numeric vector of length 1000 to hold the 1000 means we are going to obtain. In the for loop, we let the
running variable i run from 1 to nsamples (i.e. 1000 in this case). Then, we let R step through this loop
nsamples times, each time taking a random sample from the height variable, calculating the mean and

5
storing it in position i of the means10 vector. Also notice that we added curly brackets {} around the
instructions we want R to loop over. The curly brackets are not strictly needed in this case but they are
needed if the commands take up multiple lines of code.
Ok, so we now expect to have 1000 sample means stored in the means10 vector, and we can have a look at
them.
hist(means10)

Histogram of means10
250
200
150
Frequency

100
50
0

1.50 1.55 1.60 1.65 1.70 1.75

means10
These means nicely follow the familiar bell-shaped curve that we call the ‘normal’ distribution. Think back
to Chapter 5 in the notes, on estimation; this is entirely expected for sample means. If we take the height
measurements in our data set as the population (i.e. we are only interested in the distribution of height in our
STA1007 class), this is the sampling distribution of the mean taken from samples of size 10. What determines
the width of this sampling distribution?
Let’s take 1000 samples of size 5, calculate their means and plot the resulting sampling distribution for
samples of size 5. While we are at it, let’s plot these alongside the distribution we already obtained and the
distribution of the height variable itself.
means5 <- numeric(nsamples)
for (i in 1:---) {--- <- mean(---)}

par(mfrow=c(1,3))
breaks <- seq(min(classData$height), max(classData$height), length=30)
hist(classData$height, breaks=breaks)
hist(means5, breaks=breaks)
hist(means10, breaks=breaks)

6
Histogram of classData$height Histogram of means5 Histogram of means10
25

250
150
20

200
Frequency

Frequency

Frequency
15

150
100
10

100
50

50
5
0

0
1.2 1.4 1.6 1.8 2.0 1.2 1.4 1.6 1.8 2.0 1.2 1.4 1.6 1.8 2.0

classData$height means5 means10

A note about the plotting: we wanted to plot the three histograms next to each other and we wanted them
to use the same bins so that we can compare the distributions more easily. The line par(mfrow=c(1,3))
changes the number of panels R plots on the same graphics device. We are telling R that we want 1 row with
3 columns (c(1,3); the usual R notation, rows first and then columns) of graphs. We explicitly define the
breaks as a sequence of 30 equally spaced breakpoints that span the range of values in the height variable,
which is the one with the biggest range.
We can clearly see from this figure that the sample means follow a normal distribution much more neatly
than the original values, and that they are less variable. Means of samples of 10 are less variable, i.e. more
precise, than means of samples of 5. This confirms what we saw in Chapter 5. The sampling distributions
are centered around the same value as the data, i.e. the sample mean is an unbiased estimate of the true
population mean.

SUBMISSION:
Vula Question 4. Obtain 1000 samples of 10 from the variable ’classData$age’ and calculate the median
for each of them. Call this the sampling distribution of the median age. What is the mean of this sampling
distribution?

If - else statements
Another important tool in programming are ‘if-else’ statements. We want R to check whether a statement is
true, then do one thing if it is and another thing if it isn’t.
How many in our class are smoking? From the five-number summary of the variable cigs, we can see that at
least 34 of you answered 0 when asked how many cigarettes you are smoking per day (how can we tell?).
summary(classData$cigs)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

7
## 0.000 0.000 0.000 0.199 0.000 8.000
We could use loops and an if statement to count the number of smokers in our data set:
nsmokers <- 0
for (i in 1:length(classData$cigs)) {if (classData$cigs[i]>0) nsmokers <- nsmokers + 1}
nsmokers

## [1] 11
Let’s have a careful look at this. We first define the object nsmokers and assign a value of 0 to it – before we
look at the data, we haven’t found any smokers yet. Then, we use a loop to step through the classData$cigs
vector and for each value, we ask whether it is greater than zero (i.e. the person smokes), and if it is, we add
1 to the object nsmokers. The result is that we have 11 smokers in this data set.
Now, let’s create a new variable called “smoker” with the values “no” for those who don’t smoke, and “yes”
for those who do smoke (regardless of how much).
smoker <- character()

for (i in 1:length(classData$cigs)) {if (classData$cigs[i]==0) smoker[i] <- "no"


else smoker[i] <- "yes"}

table(smoker)

## smoker
## no yes
## 185 11
Now, we are asking R to do two different things, depending on the value of classData$cigs, using an else
statement in addition to the if statement. If it is zero, we assign the value no to the variable smoker and if
it is not zero, we assign the value yes. The table function tabulates the values of the variable smoker (we
will look at the table function in more detail in the next Lab). We can see that we have 185 non-smokers
and 11 in this class.
There are in fact simpler ways to get the answers above from R. There is the very handy ifelse function
that allows us to tell R to do two different things depending on whether a condition is true or not. It has
three slots. The first is for specifying the condition; the second is for telling R what to return when the
condition is TRUE; and the third is for telling R what to return when the condition is FALSE. The ifelse
function works on vectors, which makes this very elegant.
smoker2 <- ifelse(classData$cigs==0, "no", "yes")

table(smoker2)

## smoker2
## no yes
## 185 11
And the question about how many smokers we have could also be answered by:
length(classData$cigs[classData$cigs>0])

## [1] 11

SUBMISSION:
Vula Question 5. How many students in this class are higher than 1.7m? How many are shorter?

8
The commands you learned today
These are the functions and operators that you learned today. Fill in your own description of what they do.
plot()
axis()
for()
sample()
if else
table()
ifelse()

You might also like