0% found this document useful (0 votes)
2 views211 pages

RProgrammingLanguage Workshop

The document presents an online workshop on 'Data Analysis Using R Programming Language' led by Dr. Ahmed Elshahhat, focusing on teaching the basics of R for data handling, visualization, and statistics. It includes an overview of R, its advantages and restrictions, as well as practical sessions on data structures, statistics, graphics, and inference. The workshop aims to equip participants with essential tools and knowledge to effectively utilize R for data analysis.

Uploaded by

Abdul Hafeez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views211 pages

RProgrammingLanguage Workshop

The document presents an online workshop on 'Data Analysis Using R Programming Language' led by Dr. Ahmed Elshahhat, focusing on teaching the basics of R for data handling, visualization, and statistics. It includes an overview of R, its advantages and restrictions, as well as practical sessions on data structures, statistics, graphics, and inference. The workshop aims to equip participants with essential tools and knowledge to effectively utilize R for data analysis.

Uploaded by

Abdul Hafeez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 211

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/374088889

Data Analysis Using R Programming Language

Presentation · September 2023


DOI: 10.13140/RG.2.2.24265.31841/1

CITATIONS READS

0 1,015

1 author:

Ahmed Elshahhat
Zagazig University
90 PUBLICATIONS 562 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ahmed Elshahhat on 22 September 2023.

The user has requested enhancement of the downloaded file.


Amoud University
International Faculty Development Programme
Empowering Scholar: Unlocking the Path to High-Impact Research Publications

Online Workshop

Data Analysis Using R Programming Language

By

Dr. Ahmed Elshahhat


PhD in Statistics, Cairo University, Giza, Egypt
Lecturer of Statistics, Information Systems Dep.,
Faculty of Technology and Development,
Zagazig University

September 2023

DOI:10.13140/RG.2.2.24265.31841
To the late Prof. Samir K. Ashour

Professor of Mathematical Statistics,


Cairo University, Giza, Egypt

1943-2022

1 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Dr. Ahmed Elshahhat is currently working as a lecturer
in statistics in the Department of Information Systems,
Faculty of Technology and Development, Zagazig Uni-
versity. He obtained a master’s degree in statistics
(control) in 2016 from Cairo University and a doctor-
ate in statistics (reliability) in 2019 from the same uni-
versity. He has 60+ published research papers in jour-
nals of repute. He has made contributions in the ar-
eas of distribution theory, reliability theory, Bayesian in-
ference, MCMC techniques, parametric inference, de-
signing new censoring plans, generating new families,
R programming language, etc.

2 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


To become a perfect statistician

Try to be a perfect programmer

▶ Dr. Ahmed Elshahhat

3 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Outline

1 Overview

4 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Outline

1 Overview

2 Data Structures

4 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Outline

1 Overview

2 Data Structures

3 R Data Import

4 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

4 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

4 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

4 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


R is a powerful programming environment that provides a scripting language for
data handling, data visualization, and statistics with excellent graphical support.
This workshop will give you the basic tools to start exploring the R environment
and all it has to offer. This repository contains teaching materials for a 3- to 4-
hour, hands-on workshop on "Data Analysis Using R Programming Language".

5 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Learning Objectives

1 An Overview of R Programming Language:


What is R?
R Advantages
R Restrictions
Installing R System
R Programming Tools
Top 10 R Programming Books, Courses, Online Resources

6 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Learning Objectives

1 An Overview of R Programming Language:


What is R?
R Advantages
R Restrictions
Installing R System
R Programming Tools
Top 10 R Programming Books, Courses, Online Resources
2 Data Structures:
Vectors
Matrices
Arrays

6 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Learning Objectives, Cont’d

3 R Data Import

7 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Learning Objectives, Cont’d

3 R Data Import
4 R Statistics

7 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Learning Objectives, Cont’d

3 R Data Import
4 R Statistics
5 R Graphics:
Bar & Box
Histogram & Density
Heatmap
Pairs
QQ
3D

7 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Learning Objectives, Cont’d

3 R Data Import
4 R Statistics
5 R Graphics:
Bar & Box
Histogram & Density
Heatmap
Pairs
QQ
3D
6 Inference:
Parameter Estimation
Monte Carlo of Parameter Estimation
Linear Regression Models
Monte Carlo of Linear Regression Models

7 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Schedule, Dataset & Installation Requirements

Schedule:
Activity Time (in Minutes)
Overview 30
Data Structures 35
R Statistics 20
R Graphics 35
Inference 60
Dataset: All R scripts used in this workshop are available within these slides.
Installation Requirements: Download the latest versions of R.

8 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview

Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

9 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview

10 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview

Data Science: Python vs SAS vs R


Python is a general-purpose programming language, free and open source
which has become very popular in data science. It is easy to learn and
understand. It is used by many biggies like Google, Quora, Reddit, etc.
SAS has been proved as one of the unchallenged leaders in the field of data
science. It is also easy to learn. But, it is not open-source and ends up being
an expensive option for a beginner. So it is used by various IT companies like
Nestle, Barclays, Volvo, and HSBC.
R is a quite popular language for statistics. It is a counterpart of SAS and is
free as it is an open-source platform. It is mainly used in the academics and
research section.
So, the question is not which one to choose, but how to make the best use of
these programming languages for your specific use cases.
More details can be found on the Mindmajix training platform: https://mindmajix.com/python- vs- sas- vs- r

11 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview

On July 2016, Burtch Works’ HR department asked over 1,000 quantitative


professionals which language they preferred, SAS, R or Python.

SAS is an expensive commercial software and is mostly used by large


corporations with huge budgets while Python and R are free software that
can be downloaded by anyone.
More details can be found on the Edvancer website: https://edvancer.in/r- python- or- sas- which- one- should- you- learn- first/

12 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview What is R?

Overview
What is R?

1 R is a programming language for data analysis and graphics.


2 The R project was initially written by Ross Ihaka and Robert Gentleman at
Department of Statistics of University of Auckland, New Zealand during
1990s and has been developed with contributions from all over the world
since mid-1997.
3 All information about R is found on http://www.R-project.org
4 R system contains two major components:
1. Base system – contains the R language software.
2. User contributed add-on packages.

13 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview What is R?

Overview
R Advantages

14 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview What is R?

Overview
R Advantages

15 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview What is R?

Overview
R Advantages

1 R is open source.
2 R has a wide community.
3 Outstanding graphical outputs.
4 R is easy to learn and understand.
5 More than 18,000 packages are available and free.
6 R is good for MacOS, Linux and Microsoft Windows.
7 R is cross-platform which runs on many operating systems.
8 R is excellent for simulation, programming, computer intensive analyses, etc.
9 In R, anyone is welcome to provide bug fixes, code enhancements, and add
new packages.
10 Knowledge support for any base default without internet connection.

16 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Restrictions

Overview
R Restrictions

1 In R, need step minimal learning level.


2 In R, any one may be make mistakes and not know.
3 In R, quality of some packages is less than perfect.
4 In R, no commercial support.
5 In R, no one to complain if something doesn’t work.
6 In R, working with large data sets is limited by RAM.
7 R is a software application that many people devote their own time to devel-
oping.
8 R commands give little thought to memory management.
9 R can consume all available memory.
10 Data preparation, organizing can be messier, more mistake prone in R vs.
SPSS or SAS.

17 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
Installing R System

1 Go to the official site of R, https://www.r-project.org/.


2 Click on CRAN link on the left sidebar.
3 Select a mirror.
4 Choose your computer from the list (Linux, MacOS X, or Windows).
5 Click on the link that downloads the base distribution.
6 Run the file and follow the steps in the instructions to install R.
7 Save R. Have fun!

18 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
R Installation

19 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
Installing R System

20 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
Installing R System

21 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
Installing R System

22 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
Installing R System

23 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

Overview
Installing add-on Packages

1 All packages are available on: https://cloud.r-project.org/web/packages/


2 Pick package from list and download
3 To install add-on package:
1. install.packages("package name")
2. library("package name")
4 Verify the package is installed by
any(grepl("package name",installed.packages()))

24 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

R Session
R Console

Console: Outputs "usually unsaved".

25 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

R Session
R Editor

Editor (File+New script): input scripts and saved as tex document.

26 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Installation

R Session
Interactive R sessions

27 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Programming Tools
Arithmetic Operators

28 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘+’ used to add two (or more) subjects

x <- c(1, 2.5, 5)


y <- c(4, 4.5, 3)
print (x + y)
[1] 5 7 8

Operator ‘-’ used to subtract a subject from another one

x <- c(1, 2.5, 5)


y <- c(4, 4.5, 3)
print (x - y)
[1] -3 -2 2

29 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘*’ used to multiply two subjects with each other

x <- c(1, 2.5, 5)


y <- c(4, 4.5, 3)
print (x - y)
[1] 4.00 11.25 15.00

Operator ‘/’ used to divides a subject from another one

x <- c(1, 1, 6)
y <- c(0, 4, 3)
print (x / y)
[1] Inf 0.25 2.00

30 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘%%’ returns the remainder of the division of two subjects

x <- c(5, 4, 2)
y <- c(2, 1, 4)
print (x %% y)
[1] 1 0 2
Operator ‘%*%’ used to find the division of 1st subject with 2nd subject

x <- c(4, 4, 2)
y <- c(2, 0, 4)
print (x %/% y)
[1] 2 Inf 0
V
Operator ‘ ’ raised 1st subject to the exponent of 2nd subject

x <- c(4, 3, 2)
y <- c(3, 0, 5)
print (x ^ y)
[1] 64 1 32

31 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘>’ returns TRUE when every element in 1st subject is greater than
the corresponding element of 2nd subject

x <- c(4, 3, 2)
y <- c(3, 0, 5)
print (x > y)
[1] TRUE TRUE FALSE

Operator ‘<’ returns TRUE when every element in 1st subject is less than the
corresponding element of 2nd subject

x <- c(4, 3, 2)
y <- c(3, 0, 5)
print (x < y)
[1] FALSE FALSE TRUE

32 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘<=’ returns TRUE when every element in 1st subject is less than or
equal to the corresponding element of another subject

x <- c(4, 3, 2)
y <- c(7, 0, 5)
print (x <= y)
[1] TRUE FALSE TRUE

Operator ‘>=’ returns TRUE when every element in 1st subject is greater than
or equal to the corresponding element of another subject

x <- c(4, 3, 2)
y <- c(7, 0, 5)
print (x >= y)
[1] FALSE TRUE FALSE

33 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘==’ returns TRUE when every element in 1st subject is equal to the
corresponding element of 2nd subject

x <- c(4, 3, 2)
y <- c(7, 3, 5)
print (x == y)
[1] FALSE TRUE FALSE

Operator ‘!=’ returns TRUE when every element in 1st subject is not equal to
the corresponding element of 2nd subject

x <- c(4, 3, 2)
y <- c(7, 3, 5)
print (x != y)
[1] TRUE FALSE TRUE

34 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Operators
Arithmetic Operators

Operator ‘:’ used to create the series of numbers in sequence for a subject

x <- c (1:10)
print (x)
[1] 1 2 3 4 5 6 7 8 9 10

Operator ‘%in%’ used to identify if an element belongs to a vector

x <- c(1 ,4 ,7)


y <- c(1:6)
print (x%in%y)
[1] TRUE TRUE FALSE

35 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Programming Tools
Commonly Functions

table counts
c concatenate
print show value
which TRUE indices
length no. of values
summary generic stats
dim matrix order
min, max minimum, maximum
help(), ? provide informations
rbind, cbind bind vectors as a row, a column
class type of an argument
apply repeat over rows, columns
sort, order, rank sort, order, vector rank

36 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Programming Tools
Commonly Functions, Cont’d

mean(x) average
var(x) variance
cor(x) correlation
cov(x) covariance
sqrt(x) square root
log10(x) log base 10
sin(x), cos(x), tan(x) linear algebra
log(x) natural logarithm
seq(x) sequence generation
median(x) middle number in a sorted
mad(x) median absolute deviation
d, p, q, r density, probability, quantile, generating rns functions

37 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Programming Tools
Probability Distribution Functions

Beta pbeta, qbeta, dbeta, rbeta


Binomial pbinom, qbinom, dbinom, rbinom
Cauchy pcauchy, qcauchy, dcauchy, rcauchy
Chi-Square pchisq, qchisq, dchisq, rchisq
Exponential pexp, qexp, dexp, rexp
F pf, qf, df, rf
Gamma pgamma, qgamma, dgamma, rgamma
Geometric pgeom, qgeom, dgeom, rgeom
Hypergeometric phyper, qhyper, dhyper, rhyper
Logistic plogis, qlogis, dlogis, rlogis

38 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview R Programming Tools

R Programming Tools
Probability Distribution Functions, Cont’d

Log-Normal plnorm, qlnorm, dlnorm, rlnorm


Negative Binomial pnbinom, qnbinom, dnbinom, rnbinom
Normal pnorm, qnorm, dnorm, rnorm
Poisson ppois, qpois, dpois, rpois
Student-t pt, qt, dt, rt
Uniform punif, qunif, dunif, runif
Studentized Range ptukey, qtukey, dtukey, rtukey
Weibull pweibull, qweibull, dweibull, rweibull
Wilcoxon’s Rank pwilcox, qwilcox, dwilcox, rwilcox

For more details about R Functions (+ Examples) see; https://statisticsglobe.com/

39 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.

#1 R in Action

For details see Kabacoff (2015).

40 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.
#2 R for Data Science

For details see Wickham and Grolemund (2016).

41 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.
#3 The Art of R Programming

For details see Matloff (2011).

42 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.
#4 Hands-On Programming with R

For details see Grolemund (2014).

43 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.
#5 R Graphics Cookbook

For details see Chang (2018).

44 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.

#6 The Big R-Book

For details see De Brouwer (2020).

45 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.

#7 Practical Data Science with R

For details see Mount and Zumel (2019).

46 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.

#8 R for Everyone

For details see Lander (2014).

47 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.

#9 The Book of R

For details see Davies (2016).

48 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Books


The book listed first does not have to be better than others. They are all deserving
of inclusion on the list, in our opinion.

#10 The R Book

For details see Crawley (2012).

49 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Data Analysis with R Programming (Google)

https://www.coursera.org/learn/data- analysis- r

50 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

R Programming Fundamentals (Stanford University)

https://online.stanford.edu/courses/xfds112- r- programming- fundamentals

51 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Data Science: R Basics (Harvard University)

https://pll.harvard.edu/course/data- science- r- basics?delta=0

52 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Data Analysis with R (Facebook)

https://www.facebook.com/groups/2101100100212657/

53 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

The Analytics Edge (Massachusetts Institute of Technology)

https://www.classcentral.com/course/edx- the- analytics- edge- 1623

54 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Introduction to R (DataCamp)

https://www.datacamp.com/courses/free- introduction- to- r

55 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Swirl: Learn R

https://swirlstats.com/

56 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Introduction to Business Analytics with R (University of Illinois)

https://www.coursera.org/learn/business- analytics- r

57 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

Introduction to Probability and Data with R (Duke University)

https://www.coursera.org/learn/probability- intro

58 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Courses

The best online courses to learn R programming, the language used by data
analysts and statisticians to structure, analyze, and visualize data.

R Programming A-Z (Udemy)

https://www.udemy.com/course/r- programming

59 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#1 R-bloggers

Note: The R-bloggers website comprises the efforts of more than 750 R bloggers.
https://www.r- bloggers.com/

60 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#2 Microsoft R Application Network (Revolution R Open)

Note: In 2015, Microsoft acquired Inside-R’s parent company Revolution Analytics. One result of this acquisition is
the Microsoft R Application Network, (MRAN).
https://mran.microsoft.com/

61 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#3 Quick-R

Note: Professor Rob Kabacoff at Wesleyan University created this website to introduce you to R and its applications.
https://www.statmethods.net/

62 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#4 RStudio

Note: RStudio is an online learning page that links to tutorials and examples to help you master R and related tools.
https://www.rstudio.com/

63 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#5 Statistics Globe

Note: Statistics Globe is an education platform that provides free programming tutorials in R and Python as well as
theoretical explanations for the field of statistics and data science.
https://statisticsglobe.com/

64 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#6 Stack Overflow

Note: Stack Overflow is a multimillion-member community of programmers dedicated to helping each other. You can
search their Q&A base for help with a problem, or you can ask a question.
https://stackoverflow.com/

65 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#7 R Tutorial

Note: R tutorial is designed for software programmers, statisticians and data miners who are looking forward for
developing statistical software using R programming.
https://www.tutorialspoint.com/r/index.htm

66 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#8 R Programming Tutorial

Note: R Programming Tutorial is designed for both beginners and professionals. Our tutorial provides all the basic
and advanced concepts of data analysis and visualization.
https://www.javatpoint.com/r- tutorial

67 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#9 RDocumentation

Note: RDocumentation enables you to search for R packages and functions that suit your needs.
https://www.rdocumentation.org/

68 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Overview Top 10 R Programming Books, Courses, Online Resources

Top 10 R Programming Online Resources

In R, you might run into a situation or two that requires some expert help. The
websites listed can provide the assistance you need.

#10 R Manuals

Note: If you want to go directly to the source, visit the R manuals page.
https://cran.r- project.org/manuals.html

69 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures

Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

70 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures

R Data Types
In R, there are 6 basic data types called: logical, numeric, integer, complex,
character and raw.

71 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures

R Data Types

Data Types

print("abc") # Character
[1] "abc"

print (5) # Integer


[1] 5

print(c(10 ,20)) # Numeric


[1] 10 20

72 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures

R Data Types

Data Types, Cont’d

print(TRUE) # Logical
[1] TRUE

print (2+3i) # Complex


[1] 2+3i

print( charToRaw (’hello ’)) # Raw


[1] 68 65 6c 6c 6f

Note: charToRaw() command converts each character to an American Standard Code for Information Interchange
(ASCII) value.

73 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures

R Data Structures
R has a wide variety of data types including factors, matrices, vectors, arrays, data
frames, and lists.

74 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

A vector is the basic data structure in R that stores data of six types of data such
as logical, integer, double, complex, character and raw.
Vectors

x <- 5:15; print(x) # Sequence from 5 to 15


[1] 5 6 7 8 9 10 11 12 13 14 15

x <- seq (5 ,15); print(x) # Sequence from 5 to 15


[1] 5 6 7 8 9 10 11 12 13 14 15

x <- 5.5:12.5; print(x) # Sequence from 5.5 to 12.5


[1] 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5

75 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d

x <- 5.5:13; print(x) # Sequence 5.5 to 13


[1] 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5

x <- seq (2,6,by =0.5); print(x) # Sequence 2 to 6 by 0.5


[1] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

x <- seq (2 ,6 ,0.5); print(x) # Sequence 2 to 6 by 0.5


[1] 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0

76 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d

x <- c(1 ,2 ,3 ,4 ,5); y <- c(6 ,7 ,8 ,9 ,10) # Vectors x and y


x; y
[1] 1 2 3 4 5
[1] 6 7 8 9 10

x[2]; y[2] # Access 2nd element in x and y


[1] 2
[1] 7

x[ -2]; y[-2] # Exclude 2nd element in x and y


[1] 1 3 4 5
[1] 6 8 9 10

77 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d

x <- c(1 ,2 ,3 ,4 ,5); y <- c(6 ,7 ,8 ,9 ,10) # Vectors x and y


x; y
[1] 1 2 3 4 5
[1] 6 7 8 9 10

x[c(1 ,5) ]; y[c(1 ,5)] # Access (1st ,5th) items in x,y


[1] 1 5
[1] 6 10

x[-c(1 ,5) ]; y[-c(1 ,5)] # Exclude (1st ,5th) items in x,y


[1] 2 3 4
[1] 7 8 9

78 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
x; y
[1] 1 2 3 4 5
[1] 6 7 8 9 10

x[1]= -1; x # Replace 1st item in x


[1] -1 2 3 4 5
y[c(1 ,5) ]=c(10 ,100); y # Replace (1st ,5th) items in y
[1] 10 7 8 9 100

A=x+y; A # Vector addition


[1] 9 9 11 13 105
B=x-y; B # Vector subtraction
[1] -11 -5 -5 -5 -95

79 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
x;y
[1] 1 2 3 4 5
[1] 6 7 8 9 10

x+5; y-6 # Add 5 to x; subtract 6 from y


[1] 6 7 8 9 10
[1] 0 1 2 3 4

x*5; y/2 # Multiply 5 in x; divide y by 2


[1] 5 10 15 20 25
[1] 3.0 3.5 4.0 4.5 5.0

80 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
x;y
[1] 1 2 3 4 5
[1] 6 7 8 9 10

x[x <5]; x[x >=2]; x[x <1] # Access elements from x


[1] 1 2 3 4
[1] 2 3 4 5
numeric (0)

y^2; sqrt(y) # Get y^2; square root of x


[1] 36 49 64 81 100
[1] 2.449490 2.645751 2.828427 3.000000 3.162278

81 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
data <- rep(c(2 ,4 ,6) , times =3)
print(data) # Repeat vector 3 times
[1] 2 4 6 2 4 6 2 4 6

data <- rep(c(2 ,4 ,6) , each =3)


print(data) # Repeat each item 3 times
[1] 2 2 2 4 4 4 6 6 6

data <- rep(seq (1 ,3 ,0.5) , times =2)


print(data) # Repeat sequence 2 times
[1] 1.0 1.5 2.0 2.5 3.0 1.0 1.5 2.0 2.5 3.0

82 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
for (i in seq (1 ,3 ,0.5)) {
print(i) # Sequence 1 to 3 by 0.5 separately
}
[1] 1
[1] 1.5
[1] 2
[1] 2.5
[1] 3

83 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
data <- c(1 ,2 ,3 ,4 ,5 ,6)

for (i in data) {
if (i %% 2 == 0)
print(i) # Print even integers
}
[1] 2
[1] 4
[1] 6

84 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
data <- c(1 ,2 ,3 ,4 ,5 ,6)

for (i in data) {
if (i %% 2 != 0)
print(i) # Print odd integers
}
[1] 1
[1] 3
[1] 5

85 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Vectors

R Data Structures
Vectors

Vectors, Cont’d
data <- c(1 ,2 ,3 ,4 ,5 ,6)

for (i in data) {
if (i %% 2 == 1)
print(i) # Print odd integers
}
[1] 1
[1] 3
[1] 5

86 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

A matrix is a two-dimensional data structure where data are arranged into rows
and columns. In R, the basic syntax for creating a matrix is matrix() function as
matrix (x, nrow , ncol , byrow ) # Insert matrix
x - data items of same type
nrow - number of rows
ncol - number of columns
byrow (optional) - if TRUE, the matrix is filled row-wise. By
default, the matrix is filled column-wise.

87 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices
x = c(9 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8)
# Create a 3x3 matrix
A = matrix (x,nrow =3, ncol =3); print(A)
[,1] [,2] [ ,3]
[1 ,] 9 3 6
[2 ,] 1 4 7
[3 ,] 2 5 8

A = matrix (x ,3 ,3); print(A)


[,1] [,2] [ ,3]
[1 ,] 9 3 6
[2 ,] 1 4 7
[3 ,] 2 5 8

88 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d
A = matrix (c(9 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8) ,nrow =3, ncol =3); print(A)
[,1] [,2] [ ,3]
[1 ,] 9 3 6
[2 ,] 1 4 7
[3 ,] 2 5 8

dim(A) # Dimension of A
[1] 3 3

det(A) # Determinant of A
[1] -27

diag(A) # Diagonal elements of A


[1] 9 4 8

89 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d

sum(diag(A)) # Trace of A
[1] 21

qr(A)$rank # Rank of A
[1] 3

A[2 ,2] # Access (2nd column ,2nd row) element


[1] 4

90 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d
A[ ,1]; A[1 ,]
[1] 9 1 2
[1] 9 3 6

matrix (A[ ,1] ,3 ,1) # Access 1st column as matrix


[,1]
[1 ,] 9
[2 ,] 1
[3 ,] 2

matrix (A[1 ,] ,1 ,3) # Access 1st row as matrix


[,1] [,2] [ ,3]
[1 ,] 9 3 6

91 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d
cbind(A[1 ,]) # Transpose 1st row to a column
[,1]
[1 ,] 9
[2 ,] 3
[3 ,] 6

rbind(A[ ,1]) # Transpose 1st column to a row


[,1] [,2] [ ,3]
[1 ,] 9 1 2

A[ ,1]=c(2 ,0 ,2); A # Replace 1st column by other items


[,1] [,2] [ ,3]
[1 ,] 2 3 6
[2 ,] 0 4 7
[3 ,] 2 5 8

92 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d
colSums (A); rowSums (A) # Sum columns ; Sum rows of A
[1] 12 12 21
[1] 18 12 15

t(A) # Transpose A
[,1] [,2] [ ,3]
[1 ,] 9 1 2
[2 ,] 3 4 5
[3 ,] 6 7 8

solve(A) # Inverse matrix of A


[,1] [,2] [,3]
[1 ,] 0.1111111 -0.2222222 0.1111111
[2 ,] -0.2222222 -2.2222222 2.1111111
[3 ,] 0.1111111 1.4444444 -1.2222222

93 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d
x = c(9 ,1 ,2 ,3 ,4 ,5 ,6 ,7 ,8) # Data x
y = c(0 ,2 ,4 ,6 ,8 ,10 ,12 ,14 ,16) # Data y

A = matrix (x,nrow =3, ncol =3); A # Matrix A


[,1] [,2] [ ,3]
[1 ,] 9 3 6
[2 ,] 1 4 7
[3 ,] 2 5 8

B = matrix (y,nrow =3, ncol =3); B # Matrix B


[,1] [,2] [ ,3]
[1 ,] 0 6 12
[2 ,] 2 8 14
[3 ,] 4 10 16

94 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d

Z=A+B; Z # Add matrix A to B


[,1] [,2] [ ,3]
[1 ,] 9 9 18
[2 ,] 3 12 21
[3 ,] 6 15 24

Z=A-B; Z # Subtract matrix A from B


[,1] [,2] [ ,3]
[1 ,] 9 -3 -6
[2 ,] -1 -4 -7
[3 ,] -2 -5 -8

95 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d

Z=A/B; Z # Divide matrix A by B


[,1] [,2] [ ,3]
[1 ,] Inf 0.5 0.5
[2 ,] 0.5 0.5 0.5
[3 ,] 0.5 0.5 0.5

96 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Matrices

R Data Structures
Matrices

Matrices, Cont’d

Z=A*B; Z # Multiply matrices A and B correspondingly


[,1] [,2] [ ,3]
[1 ,] 0 18 72
[2 ,] 2 32 98
[3 ,] 8 50 128

Z=A%*%B; Z # Multiply matrices A and B by rule


[,1] [,2] [ ,3]
[1 ,] 30 138 246
[2 ,] 36 108 180
[3 ,] 42 132 222

97 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Arrays

R Data Structures
Arrays

Array is a data structure which can store data of the same type in more than two
dimensions. In R, the basic syntax for creating an array is array() function as
array(x, dim = c(nrow , ncol , nmat)) # Insert array

x - data items of same type


nrow - number of rows
ncol - number of columns
nmat - number of matrices

98 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Arrays

R Data Structures
Arrays

Arrays

A= array(c (1:12) , dim = c(2 ,3 ,2))


print(A) # Create two matrices 2x3
, , 1

[,1] [,2] [ ,3]


[1 ,] 1 3 5
[2 ,] 2 4 6

, , 2

[,1] [,2] [ ,3]


[1 ,] 7 9 11
[2 ,] 8 10 12

99 / 196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language


Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d

A[,,2] # Access matrix #2


[,1] [,2] [ ,3]
[1 ,] 7 9 11
[2 ,] 8 10 12

A[ ,3,1] # Access 3rd column in matrix #1


[1] 5 6

A[2 ,3 ,2] # Access 2nd item of 3rd col. in matrix #2


[1] 12

12% in%A[,,2] # Check element 12 in matrix #2


[1] TRUE

100 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d
A1 <- matrix (c (1:6) , 2, 3, byrow = TRUE) # Matrix A1
A2 <- matrix (c( -1: -6) , 2, 3, byrow = TRUE) # Matrix A2
col.names <- c("COL1","COL2","COL3") # Col.names
row.names <- c("ROW1","ROW2") # Row.names
mat.names <- c(" Matrix1 "," Matrix2 ") # Matrix .names

Array <- array(c(A1 ,A2),dim = c(2 ,3 ,2) ,


dimnames = list(row.names ,col.names ,mat.names)
) # Set A1 and A2 to array
print(Array )

101 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d
, , Matrix1

COL1 COL2 COL3


ROW1 1 2 3
ROW2 4 5 6

, , Matrix2

COL1 COL2 COL3


ROW1 -1 -2 -3
ROW2 -4 -5 -6

102 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d
Mat <- Array [,,1] + Array [,,2] # Add arrays
print(Mat)
COL1 COL2 COL3
ROW1 0 0 0
ROW2 0 0 0

Transpose .1 <- t( Array [,,1]) # Transpose 1st array


print( Transpose .1)
ROW1 ROW2
COL1 1 4
COL2 2 5
COL3 3 6

103 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d

Res .1 <- apply(Array ,c(1) ,sum) # Sum rows in arrays


print(Res .1)
ROW1 ROW2
0 0

Res .2 <- apply(Array ,c(2) ,sum) # Sum columns in arrays


print(Res .2)
COL1 COL2 COL3
0 0 0

104 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d

Res .3 <- apply(Array ,c(3) ,sum) # Sum all items in array


print(Res .3)
Matrix1 Matrix2
21 -21

Res .4 <- sum(Array) # Sum all items in arrays


print(Res .4)
[1] 0

105 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Data Structures Arrays

R Data Structures
Arrays

Arrays, Cont’d

# Multiply arrays correspondingly


Res .5 <- Array [,,1]* Array [,,2]
print(Res .5)
COL1 COL2 COL3
ROW1 -1 -4 -9
ROW2 -16 -25 -36
# Multiply arrays in rule
Res .6 <- Array [,,1]%*%t(Array [,,2])
print(Res .6)
ROW1 ROW2
ROW1 -14 -32
ROW2 -32 -77

106 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

107 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import In R, one can read data from files stored outside the R environ-
ment. One can also write data into files which will be stored and accessed by
the operating system. R can also read and write into various file formats such
as txt, excel, csv, xml etc.

108 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a csv file


The CSV (Comma Separated Value) file is a plain text file in which the values in
the columns are separated by a comma. To see how we read CSV files in R, let’s
consider the following data present in the file named staff.csv as
id , name , salary , age , jobe
1, Ahmed , 2850 , 32, Prof
2, Islam , 2680 , 28, Eng
3, Adam , 2540 , 25, Dr
4, Asmaa , 2760 , 30, HR
5, Mona , 2400 , 26, IT

In R, the basic syntax of ‘read.csv()’ function is


read.csv(file , header = TRUE , sep = ",", quote = "\"",
dec = ".", fill = TRUE , comment .char = "" ,...)

109 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import
Read a csv file, Cont’d
Following the path directory of staff.csv file in your computer, the read.csv()
function can read the staff.csv file as
read_data <- read.csv("C:\\ Users \\ king \\ Desktop \\ staff.csv")
print(read_data) # Read all data in staff.csv
Finally, the csv file is displayed as:
id name salary age jobe
1 1 Ahmed 2850 32 Prof
2 2 Islam 2680 28 Eng
3 3 Adam 2540 25 Dr
4 4 Asmaa 2760 30 HR
5 5 Mona 2400 26 IT
cat("Total Columns :", ncol(read_data)) # No. of columns
Total Columns : 5
cat("Total Rows:", nrow(read_data)) # No. of rows
Total Rows: 5

110 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a csv file, Cont’d


Next, different functions are used to analyze the data in staff.csv as:
min_data <- min(read_data$ salary )
print(min_data) # Minimum value of salary
[1] 2400

max_data <- max(read_data$age)


print(max_data) # Maximum value of age
[1] 32

sub_data <- subset (read_data , salary > 2600)


print(sub_data) # Get salaries greater than 2600
id name salary age jobe
1 1 Ahmed 2850 32 Prof
2 2 Islam 2680 28 Eng
4 4 Asmaa 2760 30 HR

111 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import
Read a csv file, Cont’d

min_ person <- subset (read_data , salary == min( salary ))


print(min_ person ) # Get the person having min salary
id name salary age jobe
5 5 Mona 2400 26 IT

people <- subset ( read_data , jobe == "HR")


print( people ) # Get people working as HR
id name salary age jobe
4 4 Asmaa 2760 30 HR

info_ people <- subset (read_data , salary > 2700 & jobe == "
Eng")
print(info_ people ) # Get people working as Eng having
salary > 2700
[1] id name salary age jobe
<0 rows > (or 0- length row.names) # Means it does not
available (NA)
112 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a xlsx file


Microsoft excel is the most widely used spreadsheet program which stores data in
.xls (or .xlsx) format. R can read directly from these files using some excel specific
packages namely readxl, xlsx and openxlsx packages. To see how we read an
excel file in R, suppose we have an excel sheet named students.xlsx with
following data:

Now, it is interesting to demonstrate how to use the read_excel() function to read


the xlsx file available in your current working directory.

113 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a xlsx file, Cont’d


First, setup and load the "readxl" package in R environment as
install . packages (" readxl ") # Install readxl package
library (" readxl ") # Load readxl file

In R, the basic syntax of ‘read_excel()’ function is


read_excel(path ,sheet = NULL ,range = NULL ,col_names = TRUE ,
col_types = NULL , na = "", trim_ws = TRUE , skip = 0,
n_max = Inf , guess_max = min (1000 , n_max), progress =
readxl _ progress (), .name_ repair = " unique "
)

114 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import
Read a xlsx file, Cont’d
Following the path directory of students.xlsx file in your computer, the read_excel()
function can read the students.xlsx file as
# Read students .xlsx file
read_data <- read_ excel(" students .xlsx", sheet = 1)
print(read_data) # Read all data in students .xlsx
id name level age college
1 1 Ahmed 4 21 Business
2 2 Islam 3 20 Engineering
3 3 Adam 2 18 Engineering
4 4 Asmaa 2 19 Arts
5 5 Mimi 1 18 Law

cat("Total Columns :", ncol(read_data)) # No. of columns


Total Columns : 5
cat("Total Rows:", nrow(read_data)) # No. of rows
Total Rows: 5

115 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a xlsx file, Cont’d


Next, different functions are used to analyze the data in students.xlsx as:
min_data <- min(read_data$age)
print(min_data) # Minimum age
[1] 18

max_data <- max(read_data$level)


print(max_data) # Maximum level
[1] 4

sub_data <- subset (read_data , age > 18)


print(sub_data) # Get people having age > 18
id name level age college
1 1 Ahmed 4 21 Business
2 2 Islam 3 20 Engineering
3 4 Asmaa 2 19 Arts

116 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a xlsx file, Cont’d

sub_data <- read_ excel(" students .xlsx",sheetIndex =1,


rowIndex =1:3)
print(sub_data) # Access the rows 1:2
1 1 Ahmed 4 21 Business
2 2 Islam 3 20 Engineering

117 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import
Read a txt file
A txt file is a kind of computer file that is structured as a sequence of lines of
electronic text. To see how we read txt files in R, let’s consider the following data
present in the file named guest.txt as
id name star age country
1 1 Ahmed 4 21 UK
2 2 Islam 3 20 Germany
3 3 Adam 3 18 France
4 4 Asmaa 5 19 Canada
5 5 Mimi 2 18 USA

In R, the basic syntax of ‘read.table()’ function is


read.table(file , # txt data file
header = FALSE , # Display the header TRUE or FALSE
sep = "", # Separator of the columns
dec = ".") # Separate decimals of the numbers

118 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import
Read a txt file, Cont’d
Following the path directory of guest.txt file in your computer, the read_table()
function can read the guest.txt file as
# Read guest.txt file
read_data <- read. table("guest.txt", header = TRUE)
print(read_data) # Read all data in guest.txt
id name star age country
1 1 Ahmed 4 21 UK
2 2 Islam 3 20 Germany
3 3 Adam 3 18 France
4 4 Asmaa 5 19 Canada
5 5 Mimi 2 18 USA

cat("Total Columns :", ncol(read_data)) # No. of columns


Total Columns : 5
cat("Total Rows:", nrow(read_data)) # No. of rows
Total Rows: 5

119 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a txt file, Cont’d

max_data <- max(read_data$age)


print(max_data) # Maximum age
[1] 21

min_data <- min(read_data$star)


print(min_data) # Maximum star
[1] 2

sub .1 <- subset (read_data , star == 3)


print(sub .1) # Number of guests in a three -star room
id name star age country
2 2 Islam 3 20 Germany
3 3 Adam 3 18 France

120 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Data Import

R Data Import

Read a txt file, Cont’d

sub .2 <- subset (read_data[,c(1:3) ]) # Access columns 1:3


print(sub .2)
id name star
1 1 Ahmed 4
2 2 Islam 3
3 3 Adam 3
4 4 Asmaa 5
5 5 Mimi 2

sub .3 <- subset (read_data[c(1:3) ,]) # Access rows 1:3


print(sub .3)
id name star age country
1 1 Ahmed 4 21 UK
2 2 Islam 3 20 Germany
3 3 Adam 3 18 France

121 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Statistics

Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

122 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Statistics

Statistics

Statistics
set.seed (1234) # Set seed for reproducibility

x <- rnorm (1000 ,0 ,1) # Generate random sample

print(mean(x)) # Mean
[1] -0.0265972

print( median (x)) # Median


[1] -0.0397941

print(range (x)) # Range of x


[1] -3.396064 3.195901
Note: Set.seed() function helps to reuse the same set of random variables when the same results of randomization
cannot be imported in the future.

123 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Statistics

Statistics

Statistics, Cont’d

print( quantile (x ,0.75) ) # 3rd Quartile


75%
0.6158186

print(mad(x)) # Median absolute deviation


[1] 0.9522307

print(var(x)) # Variance
[1] 0.9946825

print(sd(x)) # Standard deviation


[1] 0.9973377

124 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Statistics

Statistics

Statistics, Cont’d

print(min(x)) # Minimum value


[1] -3.396064

print(max(x)) # Maximum value


[1] 3.1959011

print( summary (x)) # Statistics summary


Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.39606 -0.67325 -0.03979 -0.02660 0.61582 3.19590

125 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

126 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

127 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots

Simple Plot #1

set.seed (123)
x <- rnorm (500) # Generate sample x from N(0 ,1)
y <- x + rnorm (500) # Generate sample y
plot(x, y) # Plot samples x and y

127 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots

Simple Plot #1

128 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots

Suppose we have different functions namely x, y, z1 and z2 formulated


respectively as:

x = rnorm(500)
y = x + rnorm(100)
z1 = x − 2y + 100
z1
e 100
z2 = (z1 + 5)
log(z1 )

129 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots

Simple Plot #2

set.seed (123)
x <- rnorm (500)
y <- x + rnorm (100)
z1 <- x - 2*y + 100
z2 <- (z1 +5)*(exp(z1/100)/log(z1))
plot(z1 , z2 , lwd = 3, col = "coral",
xlab = expression (z[1]) , ylab = expression (z[2]) ,
main = expression (
frac ((z [1]+5) *e^frac(z[1] ,100) ,log(z[1]))
)
)
# Plot sample z1 and z2

130 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots

Simple Plot #2

131 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots

Suppose we have two different sequences x1 and x2 as

x1 = 1 : 10
x2 = 1 : 10

Simple Plot #3

par(mfrow = c(2, 3))


plot(x1 , x2 , type = "l", main = "type=’l’", lwd = 6)
plot(x1 , x2 , type = "s", main = "type=’s’", lwd = 5)
plot(x1 , x2 , type = "p", main = "type=’p’", lwd = 4)
plot(x1 , x2 , type = "l", main = "type=’o’", lwd = 3)
plot(x1 , x2 , type = "s", main = "type=’b’", lwd = 2)
plot(x1 , x2 , type = "h", main = "type=’h’", lwd = 1)

132 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics

R Plots
Simple Plot #3

133 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Bar & Box

R Plots
Barplot

A barplot (or barchart; bargraph) illustrates the association between a numeric


and a categorical variable.

x = rnorm(50)
y = x + rnorm(50)

Barplot Plot

barplot (x, col =2) # Draw barplot


For more details see; https://statisticsglobe.com/barplot- in- r

134 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Bar & Box

R Plots

Barplot

135 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Bar & Box

R Plots
Boxplot

A boxplot (or box-and-whisker plot) displays the distribution of a numerical


variable based on different statistics namely: minimum non-outlier; first-quartile;
median; third-quartile; and maximum non-outlier.

x = rnorm(50)
y = x + rnorm(50)

Boxplot Plot

boxplot (x, col =2) # Draw boxplot


For more details see; https://statisticsglobe.com/boxplot- in- r

136 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Bar & Box

R Plots

Boxplot

137 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots
Histogram

A histogram represents the frequencies of values of a variable bucketed into


ranges. The height of each bar shows the amount of observations within each
range.

x = rnorm(50)
y = x + rnorm(50)

Histogram Plot

set.seed (123)
x <- rnorm (50)
y <- x + rnorm (50)
par(mfrow=c(1 ,2) , oma=c(0 ,0 ,0 ,0))
hist(x)
hist(y) # Draw histograms of x & y in one row

For more details see; https://www.tutorialspoint.com/r/r_histograms.htm


138 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots

Histogram

139 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots
Density Plot

A density (kernel density or density trace) plot shows the distribution of a


numerical variable over a continuous interval.

x = rnorm(50)
y = x + rnorm(50)

Density Plot

set.seed (123)
x <- rnorm (50)
y <- x + rnorm (50)
plot( density (x))
polygon ( density (x), col = 1) # Draw density

For more details see; https://statisticsglobe.com/kernel- density- plot- in- base- r

140 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots

Density Plot

141 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots
Histogram & Density Plot

A plot of histogram with Gaussian line in R.

x = rnorm(50)
y = x + rnorm(50)

Histogram & Density Plot

set.seed (123)
x <- rnorm (50)
y <- x + rnorm (50)
hist(x, prob = TRUE) # Draw histogram and density
lines( density (x), lwd =3, col = "red")

For more details see; https://statisticsglobe.com/kernel- density- plot- in- base- r

142 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots

Histogram & Density Plot

143 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots

A general plot represents the scatter, bar, box, time series, time-based and a
specified function in 2×3 window.

General Plot
set.seed (123)
x <- rnorm (500)
y <- x + rnorm (500)
Data_1 <- ts( matrix (x ,500 ,1) ,start=c(0 ,1) ,frequency =12)
Data_2 <- seq(as.Date("2005/1/1"),by="month",length =50)
Data_3 <- factor ( mtcars $cyl)
Data_4 <- function (x) x^2
Data_5 <- rnorm (32)
Data_6 <- rnorm (50)

The ts() function converts a numeric vector into a time series object.

144 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots

General Plot, Cont’d


par( mfrow =c(2 ,3) , oma=c(0 ,0 ,0 ,0))
# Scatterplot
plot(x, y, main = " Scatterplot ")
# Barplot
plot(Data_3, main = " Barplot ",xlab="Data_3",col =2)
# Boxplot
plot(Data_3, Data_5, main=" Boxplot ", xlab="Data_3",
ylab="Data_5",col =3)
# Time series plot
plot(Data_1, main = "Time series ", col =4)
# Time - based plot
plot(Data_2, Data_6, main = "Time based plot",
xlab="Data_2", ylab="Data_6", col =6, lwd =3)
# Plot a specified function
plot(Data_4, 0, 10, main = expression (x^2) ,col =2, lwd =4)
For more details see; https://r- coder.com/plot- r/

145 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Histogram & Density

R Plots
General Plot, Cont’d

146 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Heatmap

R Plots
Heatmap

A heatmap (or shading matrix) visualizes individual values of a matrix with


specified colors.
Heatmap
library ( ggplot2 ); library ( tidyr )
dir. create ("data"); dir. create (" output ")
download .file(url = " https :// tinyurl .com/mine -data -csv",
destfile <- "data/mine -data.csv")
mine.data <- read.csv(file = "data/mine -data.csv")
mine.long <- pivot _ longer (data = mine.data ,
cols = -c (1:3) ,names _to=" Class ",values _to=" Abundance ")
mine. heatmap <- ggplot (data = mine.long ,
mapping = aes(x= Sample .name ,y=Class ,fill= Abundance )) +
geom_tile () + xlab( label = " Sample ") +
scale _fill_ gradient (low=" yellow ", high="red")
mine. heatmap # Draw heatmap
For more details see; https://jcoliver.github.io/

147 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Heatmap

R Plots

Heatmap, Cont’d
The data set in excel file is

148 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Heatmap

R Plots

Heatmap, Cont’d

149 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Pairs

R Plots
Pairs Plot

A pairs plot is a plot of a matrix consisting of scatterplots for each


variable-combination of a data frame.
set.seed (123)
x <- rnorm (50)
y <- x + rnorm (50)
pairs(data. frame (x, y)) # Draw pairs
For more details see; https://statisticsglobe.com/r- pairs- plot- example/

150 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Pairs

R Plots

Pairs Plot

151 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Pairs

R Plots
Pairs Plot

set.seed (123)
x1 <- rnorm (1000) # Create variable
x2 <- x1 + rnorm (1000 , 0, 2)
x3 <- 3 * x1 - x2 + rnorm (1000 ,0 ,4)
PR <- data. frame (x1 ,x2 ,x3)
pairs(PR) # Draw pairs

For more details see; https://statisticsglobe.com/r- pairs- plot- example/

152 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Pairs

R Plots

Pairs Plot

153 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Pairs

R Plots
Pairs Plot

Pairs Plot
set.seed (123)
library (" ggplot2 ")
library (" GGally ")
x1 <- rnorm (1000) # Create variable x1
x2 <- x1 + rnorm (1000 ,0 ,2) # Create variable x2
x3 <- 3*x1 -x2 + rnorm (1000 ,0 ,4) # Create variable x3
data <- data. frame(x1 ,x2 ,x3) # Combine all variables
ggpairs (data) # Apply ggpairs function
cor(x1 ,x2) # Correlation between x1 and x2
cor(x1 ,x2) # Correlation between x1 and x3
cor(x2 ,x3) # Correlation between x2 and x3

For more details see; https://statisticsglobe.com/r- pairs- plot- example/

154 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics Pairs

R Plots

Pairs Plot

155 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics QQ

R Plots
QQ Plot

A quantile-quantile (QQ) plot compares the empirical quantiles obtained in the


sample versus the quantiles calculated from a theoretical distribution.
set.seed (123)
x <- rnorm (1000)
qqnorm (x) # Draw QQ plot
qqline (x, lwd =3, col = "red") # Add QQ line
For more details see; https://statisticsglobe.com/r- qqplot- qqnorm- qqline- function

156 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics QQ

R Plots
QQ Plot

157 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics 3D

R Plots
3D Plot

3D plot in R Language is used to add title, change viewing direction, and add color
and shade to the plot.
q
G= x2 + y2

3D Plot

G <- function (x,y) sqrt(x^2 + y^2)


x <- y <- seq(-1, 1, length = 30)
z <- outer(x, y, G)
persp(x, y, z, main="3D Plot", zlab = " Height ",theta =30,
phi =11 , col ="cyan", shade =0.3) # Draw 3D plot

For more details see; https://www.geeksforgeeks.org/creating- 3d- plots- in- r- programming- persp- function/

158 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics 3D

R Plots

3D Plot

159 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
R Graphics 3D

R Plots
Colors in R Plots

160 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference

Outline

1 Overview

2 Data Structures

3 R Data Import

4 R Statistics

5 R Graphics

6 Inference

161 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference

162 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference

Inference

In statistics we often use a parametric probability model to describe the behavior


of an unknown parameter(s). The role of data in all of this is to provide estimates
of the parameters of the probability model. The three most widely-used statistical
estimation techniques are:

Maximum Likelihood Estimation


Least-Squares Estimation
Bayesian Estimation

Hint: It would be interesting to investigate Bayes MCMC methods, but due to the
time limit of this workshop this part of statistical inference will be investigated later.

162 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Parameter Estimation

Maximum Likelihood
MLE-Two dimensional

Here, the maximum likelihood estimation method is used to estimate the


parameter(s) of a target population (e.g., Weibull) given a sample.
MLE via maxLik function
set.seed (1234) # Set seed for reproducible
library ( maxLik ) # Load maxLik package
alpha <- 2 # Shape parameter value
lambda <- 1 # Scale parameter value
x <- rweibull (20 , alpha , lambda ) # Simulate random sample
n <- length (x) # No. of observations
LL <- function ( param ) { # Set log -lik function
alpha <- param [1]
lambda <- param [2]
logL <- sum(log( alpha * lambda *x^( alpha -1)*exp(- lambda *x^ alpha )))
}
fit <- maxLik (LL , start =c(alpha , lambda ))
For more details see Henningsen and Toomet (2011).

163 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Parameter Estimation

Maximum Likelihood
MLE-Two dimensional

Here, the maximum likelihood estimation method is used to estimate the


parameter(s) of a target population (e.g., Weibull) given a sample.
MLE via maxLik function, Cont’d
summary (fit)
--------------------------------------------
Maximum Likelihood estimation
BFGS maximization , 12 iterations
Return code 0: successful convergence
Log - Likelihood : -10.36542
2 free parameters
Estimates :
Estimate Std. error t value Pr(> t)
[1 ,] 2.2905 0.3798 6.032 1.62e -09 ***
[2 ,] 0.8986 0.2184 4.115 3.87e -05 ***
---
Signif . codes : 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

164 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Parameter Estimation

Least-Squares
LSE-Two dimensional

Here, the least-squares estimation method is used to find the best fit for the
parameter(s) of a target population (e.g., Weibull) based on data set by minimizing
the sum of squares of differences between the theoretical and empirical CDFs.
LSE
set.seed (1234) # Set seed for reproducible
alpha <- 2 # Shape parameter value
lambda <- 1 # Scale parameter value
start <- c(alpha , lambda ) # Start value
x <- rweibull (20 , alpha , lambda ) # Simulate random sample
n <- length (x) # No. of observations
lower <- c(0 ,0); upper <- c(+Inf ,+ Inf)
Dweibull <- function (x, param ) { # Weibull distribution
alpha <- param [1]
lambda <- param [2]
res <- 1-exp(- lambda *x^ alpha )
}

165 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Parameter Estimation

Least-Squares
LSE-Two dimensional

LSE, Cont’d
LSE <- function (param ,x,CDF) { # Set an objective function
x <- sort(x)
D <- rep (0,l= length (x))
for(i in 1:n) { D[i] <- (CDF(x[i], param ) -(i/(n+1)))^2 }
sum(D)
}

OLS <- function (CDF ,start ,data ,lim_inf ,lim_sup) {


max <- nlminb ( start =start , objective =LSE ,x=x,CDF=CDF ,
lower =lim_inf , upper = lim_sup)
return (max$par)
}
For more details see Swain et al. (1988).

166 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Parameter Estimation

Least-Squares
LSE-Two dimensional

LSE, Cont’d
OLS_ weibull =OLS(Dweibull ,start ,x,lower , upper )
print (OLS_ weibull )

[1] 2.1672734 0.9257518

summary (OLS_ weibull )


Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9258 1.2361 1.5465 1.5465 1.8569 2.1673

167 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo simulation is a computerized mathematical technique to generate


random sample data based on some known distribution for numerical experi-
ments. This method is applied to risk quantitative analysis and decision mak-
ing problems. So, this method was first used by scientists working on the atom
bomb in 1940. Therefore, this method was used by the professionals of various
fields such as finance, project management, energy, manufacturing, engineer-
ing, research and development, insurance, oil and gas, transportation, etc.

168 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Introduction

Monte Carlo Simulation


1 The main properties of Monte-Carlo method are:
Its depend on generate random samples.
Its input distribution must be known.
Its result must be known while performing an experiment.

169 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Introduction

Monte Carlo Simulation, Cont’d


1 The main advantages of Monte-Carlo method are:
Easy to implement.
Provides statistical sampling for numerical experiments.
Provides approximate solution to complex mathematical problems.
1 The main disadvantages of Monte-Carlo method are:
Time consuming to get the desired output.
Its results are only the approximation of true values, not the exact.

170 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation

Now, we’ll discuss the R script of drawing a Monte Carlo simulation of Weibull
parameters based on complete sampling using two classical methods of
estimation are:

Maximum Likelihood Estimation


Least-Squares Estimation

171 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Maximum Likelihood

Step #1 - Determine the inputs


set.seed (1234) # Put set.seed
library ( maxLik ) # Run maxLik package
alpha = 0.7; lambda = 0.4 # Set true parameter values
N = 10000; n = 50 # Set no. of replications ; sample size
w = x = matrix (0,N,n) # Matrix of generated samples
Est = matrix (0,N ,2) # Matrix of simulation outputs
for(i in 1:N){
w[i ,] <- matrix (( runif (n ,0 ,1)),n ,1)
x[i ,] <- (( -1/ lambda )*log (1-w[i ,]))^(1/ alpha )
LL <- function ( param ) { # Set log -lik function
alpha <- param [1]
lambda <- param [2]
logL <- sum(log(
alpha * lambda *x[i ,]^( alpha -1)*exp(- lambda *x[i ,]^ alpha )))
}
Est[i ,] <- maxLik (LL , start =c(alpha , lambda ))$est
}

172 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Maximum Likelihood

Step #2 - Run Monte Carlo experiment and get outputs

MLE_1= mean(Est [ ,1]); MLE_2= mean(Est [ ,2])


MSE_1= mean (( Est [,1]- alpha )^2); MSE_2= mean (( Est [,2]- lambda )^2)
MAB_1= mean(abs(Est [,1]- alpha )); MAB_2= mean(abs(Est [,2]- lambda ))

Res_1=c(MLE_1,MSE_1,MAB_1); Res_1 # Av.Est;MSE;MAB of alpha


[1] 0.719777654 0.007196305 0.065652642

Res_2=c(MLE_2,MSE_2,MAB_2); Res_2 # Av.Est;MSE;MAB of lambda


[1] 0.398973150 0.006927434 0.066228848

173 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Least-Squares

Step #1 - Determine the inputs


set.seed (1234) # Put set.seed
alpha = 0.7; lambda = 0.4 # Set true parameter values
N = 10000 # Set no. of replications
n = 50 # Set sample size
w = x = matrix (0,N,n) # Matrix of generated samples
Est = matrix (0,N ,2) # Matrix of simulation outputs
lower = c(0 ,0); upper = c(+Inf ,+ Inf)

cdf_ weibull = function (x, param ) { # Weibull distribution


alpha = param [1]
lambda = param [2]
res = 1-exp(- lambda *x^ alpha )
return (res)
}

174 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Least-Squares

Step #1 - Determine the inputs, Cont’d


LSE = function (param ,x,cdf) { # Set an objective function
x = sort(x)
D = rep (0,l=n)
for(i in 1:(n)) {
D[i] = (cdf(x[i], param ) -(i/(n+1)))^2
}
sum(D)
}

OLS = function (cdf ,start ,x,lim_inf ,lim_sup) {


max = nlminb ( start =c(alpha , lambda ),objective =LSE ,x=x,cdf=cdf ,
lower =lim_inf , upper =lim_sup)
return (max$par)
}

175 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Parameter Estimation

Monte Carlo Simulation


Least-Squares

Step #2 - Run Monte Carlo experiment and get outputs


for(i in 1:N){
w[i ,] <- matrix (( runif (n ,0 ,1)),n ,1)
x[i ,] <- sort ((( -1/ lambda )*log (1-w[i ,]))^(1/ alpha ))
Est[i ,] <- OLS(cdf=cdf_weibull , start =c(alpha , lambda ),x[i,],
lim_inf=lower , lim_sup= upper )
}

LSE_1= mean(Est [ ,1]); LSE_2= mean(Est [ ,2])


MSE_1= mean (( Est [,1]- alpha )^2); MSE_2= mean (( Est [,2]- lambda )^2)
MAB_1= mean(abs(Est [,1]- alpha )); MAB_2= mean(abs(Est [,2]- lambda ))

Res_1=c(LSE_1,MSE_1,MAB_1); Res_1 # Av.Est;MSE;MAB of alpha


[1] 0.702144763 0.009249139 0.075082418

Res_2=c(LSE_2,MSE_2,MAB_2); Res_2 # Av.Est;MSE;MAB of lambda


[1] 0.406889288 0.007449279 0.068068669

176 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Simple

Regression analysis is a very important statistical method to establish a


relationship model between two variables. One of these variable is called predictor
variable whose value is gathered through experiments. The other variable is
called response variable whose value is derived from the predictor variable.
Simple Linear Regression
The general mathematical expresion for a simple linear regression is:
y = ax + b

y - response variable.
x - predictor variable.
a & b - regression coefficients.

177 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Simple

Simple Linear Regression, Cont’d


In R, the lm() function used to creates the relationship model between the
predictor and the response variable. In R, the basic syntax for lm() function in the
case of simple linear regression is
lm(formula ,data)

formula - is a an expression presenting the relation between x and y.


data - is the vector on which the formula will be applied.

178 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Simple

Simple Linear Regression, Cont’d


The following example represents the relationship between the tall ( x) and goal
success percentage (y). Then, the simple linear regression model between x & y
is fitted as:
y<-c (0.63 ,0.81 ,0.56 ,0.91 ,0.47 ,0.57 ,0.76 ,0.72 ,0.62 ,0.48)
x<-c (1.51 ,1.74 ,1.38 ,1.86 ,1.28 ,1.36 ,1.79 ,1.63 ,1.52 ,1.31)

mydata <- data. frame(y=y,x=x)


model <- lm(y~x, data = mydata ) # Apply lm() function
print( summary (model)) # Call fit results

179 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Simple

Simple Linear Regression, Cont’d


Call:
lm( formula = y ~ x)

Residuals :
Min 1Q Median 3Q Max
-0.063002 -0.016629 0.000412 0.018944 0.039775

Coefficients :
Estimate Std. Error t value Pr(>|t|)
( Intercept ) -0.38455 0.08049 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e -06 ***
---
Signif . codes : 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 0.03253 on 8 degrees of freedom


Multiple R- squared : 0.9548 , Adjusted R- squared : 0.9491
F- statistic : 168.9 on 1 and 8 DF , p- value : 1.164e -06

180 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Simple

Simple Linear Regression, Cont’d


print(model ) # Call the fitted model

Call:
lm( formula = y ~ x, data = mydata )

Coefficients :
( Intercept ) x
-0.3846 0.6746

a <- data. frame(x = 1.75)


pred <- predict (model ,a) # predict y if x is 1.75
print(pred)
1
0.7960174

181 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Simple

Simple Linear Regression, Cont’d


plot(y,x, abline (lm(x~y)),cex = 1.3 , pch = 16, col = "blue",lwd =3,
main = "Tall & Goal Success Regression ",
xlab = "Tall in Meter ",
ylab = "Goal Success Percentage ") # Plot the fitted model

182 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Multiple

Multiple regression is an extension of linear regression into relationship between


more than two variables. In simple linear relation we have one predictor and one
response variable, but in multiple regression we have more than one predictor
variable and one response variable.
Multiple Linear Regression
The general mathematical expresion for a multiple linear regression is:
y = a + b1 x1 + b2 x2 + · · · + bn xn

y - response variable.
x1 , x2 , ..., xn - predictor variables.
a, b1 , b2 , ..., bn - regression coefficients.

183 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Multiple

Multiple Linear Regression, Cont’d


In R, the lm() function used to creates the relationship model between the
predictors and the response variable. In R, the basic syntax for lm() function in the
case of multiple linear regression is
lm(formula ,data)

formula - is an expression presenting the relation between the response


variable and predictor variables.
data - is the vector on which the formula will be applied.

184 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Multiple

Multiple Linear Regression, Cont’d


The following example represents the relationship between the tall ( x1 ), age ( x2 )
and goal success percentage (y). Then, the multiple linear regression model
between x1 , x2 & y is fitted as:
y <- c (0.63 ,0.81 ,0.56 ,0.91 ,0.47 ,0.57 ,0.76 ,0.72 ,0.62 ,0.48)
x1 <- c (1.51 ,1.74 ,1.38 ,1.86 ,1.28 ,1.36 ,1.79 ,1.63 ,1.52 ,1.31)
x2 <- c(25 , 22, 19, 20, 28, 26, 29, 31, 25, 24 )

mydata <- data. frame (y=y,x1=x1 ,x2=x2)


model <- lm(y~x1+x2 , data = mydata ) # Apply lm () function

print ( summary ( model )) # Call fit results

185 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Multiple

Multiple Linear Regression, Cont’d


Call:
lm( formula = y ~ x1 + x2)
Residuals :
Min 1Q Median 3Q Max
-0.045299 -0.018143 -0.000696 0.018518 0.040738

Coefficients :
Estimate Std. Error t value Pr(>|t|)
( Intercept ) -0.277006 0.101562 -2.727 0.0294 *
x1 0.670162 0.047952 13.976 2.27e -06 ***
x2 -0.004044 0.002607 -1.551 0.1647
---
Signif . codes : 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error : 0.03 on 7 degrees of freedom


Multiple R- squared : 0.9664 , Adjusted R- squared : 0.9567
F- statistic : 100.5 on 2 and 7 DF , p- value : 6.988e -06

186 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Linear Regression Models

Linear Regression
Multiple

Multiple Linear Regression, Cont’d


print ( model ) # Call the fitted model

Call:
lm( formula = y ~ x1 + x2 , data = mydata )

Coefficients :
( Intercept ) x1 x2
-0.277006 0.670162 -0.004044

a <- data. frame (x1 = 1.75 , x2 = 25)


pred <- predict (model ,a) # predict y if x1 is 1.75 and x2 is 25
print (pred)
1
0.7946699

187 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Monte Carlo Simulation


Linear Regression

Now, we’ll discuss the R script of drawing a Monte Carlo simulation of both simple
and multiple linear regression models.

188 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Monte Carlo Simulation


Simple Linear Regression

Monte Carlo Simulation of Simple Linear Regression


set.seed (1234) # Put set.seed
N = 10000 # Set no. of replications
n = 50; sd = 2 # Set sample size; St.d
beta_ 0=150; beta_1=4 # Set true values
Est = MSE = MAB = matrix (0,N ,2) # Set output matrices
linear = function (n, beta_0, beta_1) {# Set linear model
x1 = rnorm (n)
error = rnorm (n, 0, sd)
y = beta_0 + beta_1*x1 + error
mydata = data. frame (y, x1)
model = lm(y ~ x1 , data = mydata )
}
Res = replicate (N, linear (n,beta_0,beta_1) ,simplify =F)

189 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Monte Carlo Simulation


Simple Linear Regression

Monte Carlo Simulation of Simple Linear Regression, Cont’d


theta = c(beta_0,beta_1)

for(i in 1:N){
Est[i ,] = c(Res [[i]]$coef) # Calculate Av.Ests
MSE[i ,] = (theta -c(Res [[i]]$coef))^2 # Calculate MSEs
MAB[i ,] = abs(theta -c(Res [[i]]$coef))/ theta # Calculate MABs
}
Reg_1 = mean(Est [ ,1]); Reg_2 = mean(Est [ ,2])
MSE_1 = mean(MSE [ ,1]); MSE_2 = mean(MSE [ ,2])
MAB_1 = mean(MAB [ ,1]); MAB_2 = mean(MAB [ ,2])

Res_1 = c(Reg_1,MSE_1,MAB_1); Res_1


[1] 149.999998 0.08201119 0.00151946

Res_2 = c(Reg_2,MSE_2,MAB_2); Res_2


[1] 3.99619458 0.08534729 0.05788384

190 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Monte Carlo Simulation


Multiple Linear Regression

Monte Carlo Simulation of Multiple Linear Regression


set.seed (1234) # Put set.seed
N = 10000 # Set no. of replications
n = 50; sd = 2; ngroups = 2 # Set sample size; St.d
beta_0 = 5; beta_1 = -2; beta_2 = 4 # Set true parameter values
Est = MSE = MAB = matrix (0,N ,3) # Set output matrices
# Generate target lm
linear = function (n, beta_0, beta_1, beta_2) {
x = rnorm ( ngroups *n)
x1 = x[1:n]
x2 = x[c(n+1):c( ngroups *n)]
error = rnorm (n, 0, sd)
y = beta_0 + beta_1*x1 + beta_2*x2 + error
mydata = data. frame (y, x1 , x2)
model = lm(y ~ x1 + x2 , data = mydata )
}
Res = replicate (N, linear (n,beta_0,beta_1,beta_2) ,simplify =F)

191 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Monte Carlo Simulation


Multiple Linear Regression

Monte Carlo Simulation of Multiple Linear Regression, Cont’d


theta = c(beta_0,beta_1,beta_2)

for(i in 1:N){
Est[i ,] = c(Res [[i]]$coef)
MSE[i ,] = (theta -c(Res [[i]]$coef))^2
MAB[i ,] = abs(theta -c(Res [[i]]$coef))/ theta
}
# Calculate Av. Ests
Reg_1= mean(Est [ ,1]); Reg_2= mean(Est [ ,2]); Reg_3= mean(Est [ ,3])
# Calculate MSEs
MSE_1= mean(MSE [ ,1]); MSE_2= mean(MSE [ ,2]); MSE_3= mean(MSE [ ,3])
# Calculate MABs
MAB_1= mean(MAB [ ,1]); MAB_2= mean(MAB [ ,2]); MAB_3= mean(MAB [ ,3])

192 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Monte Carlo Simulation


Multiple Linear Regression

Monte Carlo Simulation of Multiple Linear Regression, Cont’d


Res_1 = c(Reg_1,MSE_1,MAB_1); Res_1
[1] 4.99598587 0.08406701 0.04635600

Res_2 = c(Reg_2,MSE_2,MAB_2); Res_2


[1] -2.0000555 0.0876486 -0.1174803

Res_3 = c(Reg_3,MSE_3,MAB_3); Res_3


[1] 3.99709386 0.08689668 0.05820485

193 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

This presentation is a new edition of


the workshop entitled "R Program-
ming Language for Data Analytics",
Cairo University, by Elshahhat (2022).

194 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Chang, W. (2018). R Graphics Cookbook: Practical Recipes for Visualizing Data.


O’Reilly Media, Inc.
Crawley, M. J. (2012). The R Book. John Wiley and Sons.
Davies, T. M. (2016). The Book of R: A First Course in Programming and
Statistics. No Starch Press.
De Brouwer, P. J. (2020). The Big R-Book: From Data Science to Learning
Machines and Big Data. John Wiley & Sons.
Elshahhat, A. (2022). R programming language for data analytics. The 55th
Annual International Conference on Data Science, Cairo University.
Grolemund, G. (2014). Hands-on Programming with R: Write Your Own Functions
and Simulations. O’Reilly Media, Inc.
Henningsen, A. and Toomet, O. (2011). maxlik: A package for maximum likelihood
estimation in R. Computational Statistics, 26(3):443–458.
Kabacoff, R. I. (2015). R in Action: Data Analysis and Graphics with R. Simon
and Schuster, Shelter Island, New York.
Lander, J. P. (2014). R for Everyone: Advanced Analytics and Graphics. Pearson
Education.
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software
Design. No Starch Press.
194 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

Mount, J. and Zumel, N. (2019). Practical Data Science with R. Simon and
Schuster, Shelter Island, New York.

Swain, J. J., Venkatraman, S., and Wilson, J. R. (1988). Least-squares estimation


of distribution functions in johnson’s translation system. Journal of Statistical
Computation and Simulation, 29(4):271–297.

Wickham, H. and Grolemund, G. (2016). R for Data Science: Import, Tidy,


Transform, Visualize, and Model data. O’Reilly Media, Inc.
195 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

195 /
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language
Inference Monte Carlo of Linear Regression Models

196 /
View publication stats
196 Dr. Ahmed Elshahhat Data Analysis Using R Programming Language

You might also like