SC Lecture01 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Introduction to R: Rstudio, using the help facility, using the R

script.
Dr Diana Abdul Wahab

Sem II 2022/2023

About the course…


Course Description:
• Computational data analysis is an essential part of modern statistics.
• This experiential work-based learning course employs computational, graphical,
and numerical approaches to solve statistical problems.
• The course focuses on an open source software statistical language as an ideal
computing environment.
• The goal of this course is to introduce students to the R programming environment
and related eco-system and thus provide them with an in-demand skill-set, in both
the research and business environments.
• This course provides guidance to students through the steps of importing,
wrangling, exploring, and modeling the data, and communicating the results.
• This course prepares students to take up analytic and data science courses in the
future.
• No previous programming experience is assumed.
Course Syllabus:
1. Introduction to R: Rstudio, using the help facility, using the R script.
2. Data structures: vectors, matrices, lists and data frames.
3. Reading data into R: from various data sources, merging data across data sources,
exploratory data analysis, using dplyr and attach functions.
4. Statistical modeling functions: lm and glm.
5. Writing your own functions: implement specific computational algorithms, R
function syntax, argument handling.
6. Iterating with R: logic and flow control.
7. Data visualization: ggplot()
8. Simulation: bootstrap, Monte-Carlo simulation, random number generation,
permutation tests as alternatives to classical hypothesis tests, evaluate modeling
assumptions (normality/MLE), compare efficiency of different sampling
methodologies.
9. Debugging and maintenance: fixing errors and testing, efficient programming,
optimization.
10. Dynamic and web reporting: R markdown, Knitr and Shiny.
11. [Case study] Application: predicting death using Titanic data.
Course Assessment:
1. Test 1 15% 4.5.2023 6:30-7:30pm Week 2-6
2. Test 2 20% 15.6.2023 6:30-8pm Week 8-12
3. Collaborative assignment 25% due Week 14
Communication:
• Dr Diana Abdul Wahab
• diana.abdwahab@um.edu.my
• 03-7967 3639
• Room E04, Building H10
• Consultation: email me for online discussion on Google Meet

Introduction to R
What we’re going to do today:
1. Obtaining and installing R and provide an overview of its uses and general
information on getting started.
2. Using the text editors for the code and provide recommendations for the general
working style.
3. Obtaining assistance using help files.
4. Loading packages.

Introduction to R and RStudio, R script, calling functions, running code.


You should install R on your own computer at the first opportunity. Visit http://cran.r-
project.org/
Try to spend some time getting used to the basics of the software, including arithmetic
operations and functions. There are many excellent online tutorials for this purpose.
R is portable, and works equally well on Windows, OS X, and Linux.
Help Files:
• Help files can be accessed with the ? and ?? commands.
• The internet has almost all the answers, and knows much more about R than I do.
• If you have a problem, it’s extremely likely that someone will have had the same
difficulty already, and posted a question on an internet forum.
Get Going with the Classes:
• Learning R has much in common with learning a natural language
• The more you learn the more elegantly you will be able to express yourself
• There is a smaller core of ‘everyday’ language which we will focus on, and which you
will be expected to understand in exams and practical assessments
• Lectures will not follow the notes exactly, so be prepared to take your own notes
• The practical classes will complement the lectures, and you can be examined on
anything we study in either
The Best Practice!
Don’t copy and paste the commands from this guide into R; you will find it very hard to
remember the details of the language and will have to look everything up when you come
to code something yourself.
If you find any mistakes or omissions in these notes, I’d be very grateful to be informed.
Why R?
• Statistics for relatively advanced users: R has thousands of packages, designed,
maintained, and widely used by statisticians.
• Statistical graphics: more advanced than Stata or SPSS.
• Flexible code: R has a rather liberal syntax, and variables don’t need to be declared
as they would in (for example) C++, which makes it very easy to code in. This also
has disadvantages in terms of how safe the code is.
• Vectorization: R is designed to make it very easy to write functions which are
applied pointwise to every element of a vector. This is extremely useful in statistics.
• R is powerful: if a command doesn’t exist already, you can code it yourself.
R Interface using R Studio:
• Very popular, with a nice interface and well thought out.
• Can be a bit buggy, so make sure you update it regularly.
• Available on all platforms.
• Download it from https://rstudio.com/products/rstudio/download/
• Make sure your have installed the base package first!

Basic Arithmetic and Objects


R has a command line interface, and will accept simple commands to it. This is marked by a
> symbol, called the prompt. If you type a command and press return, R will evaluate it and
print the result for you.
1 + 2

## [1] 3

x <- 10
x - 2

## [1] 8
Note: Assignment can also be done with = (or <-). But in this class I’d prefer to use the <-
sign.

Documenting script code


• Whatever you write in the console will disappear once you closed the app.
• You can use R script to type in all the codes. You can save this file.
• on RStudio: File > New File > R Script.
• You can type in the codes in the R script. Press CTRL+R to run each line of code.

Vector
The key feature which makes R very useful for statistics is that it is vectorized. This means
that many operations can be performed point-wise on a vector. The function c() is used to
create vectors:
x <- c(1,-1,3.5,2)
x

## [1] 1.0 -1.0 3.5 2.0

x+2

## [1] 3.0 1.0 5.5 4.0

sum((x - mean(x))^2)

## [1] 10.6875

Exercise
1. What is a vector?
2. The weights of five people before and after a diet program are given:
Before: 78 72 78 79 105
After: 67 65 79 70 93
Read the ‘before’ and ‘after’ values into two different vectors called before and after. Use R
to evaluate the amount of weight lost for each participant. What is the average amount of
weight lost?

Colon Operator
Some useful vectors can be created quickly with R. The colon operator is used to generate
integer sequences
1:10

## [1] 1 2 3 4 5 6 7 8 9 10

9:5
## [1] 9 8 7 6 5

2^(0:10)

## [1] 1 2 4 8 16 32 64 128 256 512 1024

1:3 + rep(seq(from=0,by=10,to=30), each=3)

## [1] 1 2 3 11 12 13 21 22 23 31 32 33

seq() Function
More generally, the function seq() can generate any arithmetic progression.
seq(from=2, to=6, by=0.4)

## [1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0

seq(from=-1, to=1, length=6)

## [1] -1.0 -0.6 -0.2 0.2 0.6 1.0

rep() Function
Sometimes it’s necessary to have repeated values, for which we use rep()
rep(5,3)

## [1] 5 5 5

rep(2:5,each=3)

## [1] 2 2 2 3 3 3 4 4 4 5 5 5

rep(-1:3, length.out=10)

## [1] -1 0 1 2 3 -1 0 1 2 3

Exercise
Create the following vectors in R using seq() and rep().
• 1, 1.5, 2, 2.5, … , 12
• 1, 8, 27, 64 ,…, 1000.
• 1, 0, 3, 0, 5, 0, 7 ,…, 0, 49.

Subsetting
x <- c(5,9,2,14,-4)
x[3]

## [1] 2

x[c(2,3,5)]

## [1] 9 2 -4
x[1:3]

## [1] 5 9 2

x[3:length(x)]

## [1] 2 14 -4

x > 4

## [1] TRUE TRUE FALSE TRUE FALSE

x[x > 4]

## [1] 5 9 14

x[-1]

## [1] 9 2 14 -4

x[-c(1,4)]

## [1] 9 2 -4

Exercise
The built-in vector LETTERS contains the uppercase letters of the alphabet. Produce a
vector of - the first 12 letters - the odd ‘numbered’ letters - the (English) consonants.

Logical Operators
x <= 2

## [1] FALSE FALSE TRUE FALSE TRUE

x == 2

## [1] FALSE FALSE TRUE FALSE FALSE

x != 2

## [1] TRUE TRUE FALSE TRUE TRUE

(x>0) & (x<10)

## [1] TRUE TRUE TRUE FALSE FALSE

(x==5)|(x>10)

## [1] TRUE FALSE FALSE TRUE FALSE

!(x>5)

## [1] TRUE FALSE TRUE FALSE TRUE

Exercise
The function rnorm() generates normal random variables. For instance, rnorm(10) gives a
vector of 10 i.i.d. standard normals. Generate 20 standard normals, and store them as x.
Then obtain subvectors of
• the entries in x which are less than 1;
• the entries between -1/2 and 1;
• the entries whose absolute value is larger than 1.5.

Packages
R comes with a series of default packages. A package is a collection of previously
programmed functions, often including functions for specific tasks. It is tempting to call this
a library, but the R community refers to it as a package.
There are two types of packages: those that come with the base installation of R and
packages that you must manually download and install. With the base installation we mean
the big executable file that you downloaded and installed before. The base version contains
the most common packages.
There are literally hundreds of user-contributed packages that are not part of the base
installation, most of which are available on the R website. Many packages are available that
will allow you to execute the same statistical calculations as commercial packages. For
example, the multivariate vegan package can execute methods that are possible using
commercial packages such as PRIMER, PCORD, CANOCO, and Brodgar.
Loading a package that came with the base installation may be accomplished either by a
mouse click or by entering a specific command. For instance, to load the MASS package,
type the command:
library(MASS)

and press enter. You now have access to all functions in the MASS package. So what next?
You could read a book, such as that by Venables and Ripley (2002), to learn all that you can
do with this package.
If you need to use a package not included in the base installation, you can type:
install.packages("name_of_package")
library("name_of_package")

You might also like