Data Analysis Using r 04-07-24
Data Analysis Using r 04-07-24
called as Statistics
– Dr O to
According V the
Shanmuga
definitionSundaram
of Statistics
• Associate Professor / PG Mathematics
Studying
– STC, POLLACHI
Analysis
Interpretation
Presenting
Organising the data or finalising the data
DATA ANALYTICS USING R
Example
1. The number of people in the town who are
watching TV out of total population in the town.
Example
1. Regression Model
2. Machine learning Algorithms
3. Time series Model
DATA ANALYTICS USING R
What is R Programming?
– Dr O V Shanmuga Sundaram
• Associate Professor / PG Mathematics
– STC, POLLACHI
cs
ram
1. Statistical Interfaces a ti
d a e m
u n
a th
2. Data Analysis S M
u or ga / PG
m
han ess
3. Machine Learning oAlgorithms
f
V S Pr I
O a e
t LAC H
r soc POL i
– D s TC,
• A S
–
cs
ram a ti
for data analysis.
d a e m
u n a th
For different types of calculation S
a on G M
arrays, lists and vectors, R
u g r / P
n m sso
a ofe
contains a suite ofhoperators.
V S Pr I
O a e C H
tdataLAhandling
r soc POL
It provides i
effective and storage facility.
– D
• As TC,
–S
It is an open-source, powerful, and highly extensible software.
2. Academic Research cs
ram a ti
d a h em
3. Government (FDA, National u n Weather
a t Service)
S
a / PG M
g
u or
4. Retail m
n fess
h a
S Pro I
O V te C H
5. SocialrMediac i a LL A
– D Ass TC, PO
o
• –S
6. Data Journalism
7. Manufacturing
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
Definition: Compiler
A compiler is a special program that translates a
programming language’s source m code intoti cs machine
code, bytecode or another programming a ra m a language.
d
n written e
th in a high level,
The source code is typically u
S GM a
human readable language a
g r /suchP as Java or C, C++.
u
m sso
n
a ofe
Sh r
V
Definition: Interpretere P H I
O i a t LAC
D r soc directly
An Interpreter OL executes instructions written
– A s TC,
in a programming
P
or scripting language without
• –S
previously converting them to an object code or
machine code. Examples of interpreted languages
are Perl, Python, R and MATLAB.
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
[[3]]
function (x) .Primitive("sin")
>
DATA ANALYTICS USING R
R Data Types
Matrices
A matrix is a two-dimensional rectangular data set.
It can be created using a vector input to the matrix
function.
m ti cs
#create a matrix a ra m a
n d th e
u a
> M=matrix(c('a','b','c','d','e','f'), nrow=2,ncol=3,byrow=TRUE)
S M
> print(M)
g a / PG
[,1] [,2] [,3]
m u or
[1,] "a" "b" "c" a n fess
[2,] "d" "e" "f" S h ro I
V e P H
> O t
i a LL A C
r
D ss , PO
oc
> M=matrix(c('a','b','c','d','e','f'), nrow=3,ncol=2,byrow=TRUE)
–
> print(M) • A TC
– S
[,1] [,2]
[1,] "a" "b"
[2,] "c" "d"
[3,] "e" "f"
>
>
DATA ANALYTICS USING R
R Data Types
Factors
Factors are the r-objects which are created using a
vector.
cs
ram
a ti
It stores the vector along with d athe h
m
distinct
e values of
u n a t
the elements in the vector a S labels.
as G M
u g r/P
The labels are an m sso
h always
fe character irrespective of
S Pro I
whether itr O
V t
is numerice C
or
H
character or Boolean etc, in
c i a LL A
– D Asso TC, PO
the input• vector.
–S They are useful in statistical
modeling.
Factors are created using the factor() function.
The nlevels functions gives the count of levels.
DATA ANALYTICS USING R
R Data Types
Factors
# create a vector
> apple_colors<-
c('green','green','yellow','red','red','red','green')
m ti cs
>
a ra m a
#create a factor n d th e
S u M a
a / PG
> factor_apple<-factor(apple_colors)
g
> m u or
a n fess
#print the factor S h ro I
V e P H
O
> print(factor_apple) t
i a LL AC
r c
– D green
[1] green A s o
s Cyellow
, PO red red red green
• T
– S yellow
Levels: green red
>
> print(nlevels(factor_apple))
[1] 3
>
DATA ANALYTICS USING R
R Data Types
Data Frames
Data frames are tabular data objects.
Unlike a matrix in data frame each column can
contain different modes of data.
The first column can be numericrwhile m ti cs
a
a he m
the
a second
column can be character and n d
third tcolumn can be
u
S GM a
logical. a
g r/P
It is a list of vectors u
mof equal o length.
n
a ofe s s
Sh r
V e P H I
Data Frames O are i t LAC using the data.frame()
created
a
D r soc POL
–
function. A s TC,
• –S
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
Output[1]
STC COLLEGE POLLACHI
DATA ANALYTICS USING R
Syntax of R Programming
A sample program is generated
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an essof
V S Pr I
If we run the program
O a e
t LAC H
i
r soc POL
– D s ,
• A – STC
>string”Hello World”
print(string)
DATA ANALYTICS USING R
Syntax of R Programming
1. Multi-line comment
#Trick for multi-line comment
if(False){
cs
ram
a ti
d a
‘R is an interpreted computer programming h e m language
u n a t
which was created by a SRossG MIhaka and Robert
u g r/P
n m sso
Gentleman at hthe a oUniversity
fe of Auckland, New
V S Pr I
e
O ciat LLAC
Zealand.,’r }
H
– D Asso TC, PO
#My First• program
–S in R programming
>string”Hello World”
print(string)
Syntax
– A s TC,
• –S Example:
For equal to operators Var1 “hello”
Variable name = value Print(var1)
For leftward operator Output: “hello”
Variable namevalue
DATA ANALYTICS USING R
Rules for naming a R variable
1. A valid variable name consists of a combination of
alphabets, numbers, dot(.), and underscore (_)
characters. Ex. Var.1_ is valid
m ti cs
2. No other special character is allowed. a ra m a
n d th e
Ex. Var$! Or var#1 – invalid S u M a
3.It can start with alphabets g a or P G characters.
dot
m u or /
Ex. .var and vara–nvalid es s
Sh with
4.It should notV start Pr of
numbers
I or underscore
O a te AC H
Ex. 2varr or _var
c i – Oinvalid
L L
– D ss , P
o
5. If a variable
• A – starts
ST
C with a dot, the next thing after
the dot cannot be a number Ex. .3var – invalid
cs
ram
Roll No. Student name CATI
a ti
CATII
d a e m
1 A1
u n
a
A
th
26 32
S M
ga / PG
2 A2 B 25 25
3 A3
mu or C 19 31
4 A4
h an ess
of D 14 26
V S Pr I
e H
5 A5 E 25 28
O i a t LAC
Dr soc POL
6 A6 F 32 32
– 7
A s TC,
A7 G 29 42
•8 –S
A8 H 25 26
9 A9 I 31 38
10 A10 J 35 39
11 A11 K 33 31
12 A12 L 35 36
m
/
1 A1
a ofe
A n26
ss 32
h
V S e Pr CHI
2 A2 B 25 25
O i a t LA
r soc POL
3 A3 C 19 31 Roll No. Student name CATI CATII
4
–
A4D s , D 14 26 1 A1 A 26 32
5 A5 • A – STEC 25 28 2 A2 B 25 25
6 A6 F 32 32 3 A3 C 19 31
4 A4 D 14 26
5 A5 E 25 28
6 A6 F 32 32
7 A7 G 29 42
8 A8 H 25 26
n m sA2so
5 28 a 2
h 3 rof A3 e B 25 25
6 32 V S P I C 19 31
O a te AC H A4
7 42 r i
c P5OL
4
L
D 14 26
D
–26 As TC, s o A5 E 25 28
8 • mean(dataA[2:5,]$
9 38 –S CATII ) # To find the mean CATII
marks of 2nd row to 5th row
10 39 [1] 27.5
11 31
12 36
2 A2 25 25
5 A5 25 28
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
Bank of America makes use of R for financial reporting. With the help of
R, the data scientists at BOA are able to analyze financial losses and
DATA ANALYTICS USING R
make use of R’s visualization tools.
HEALTHCARE
m i c s
2. R is most widely used for exploratory data analysis. r a t
a most popular
R’s
package ggplot2 is considered to be one d aof thehbest
m
e visualization
u n a t
S
libraries due to its aesthetics and interactivity.
a / PG M
g
u toovalidate
r
3.
n
R also allows hypothesis testingm s s statistical models.
h a ofe
4. V S
You can find a correlation Pr
between I the variables in R using
O a e
t used A C H
r soc POL
the lm() function thati is L for establishing linear regression as well
– D
Aslinear
as multivariable
• TC
,regression.
–S
5. Moreover, with the help of R, you can develop predictive models that
make use of machine learning algorithms to find the occurrences of
future events.
cs
ram a ti
d a e m
n th
u status a and its social network
Facebook – Facebook uses R to update S
a colleague M
G interactions with R.
graph. It is also used for predictingg P
u or on Hadoop. It also relies on R for
/
Ford Motor Company – m Ford relies
a n e s s
statistical analysis as
S h wellroas f carrying out data-driven support for
decision making.V e P H I
O i a t important
A C
Foursquare r – R isc an L L stack behind Foursquare’s famed
– D Assengine.
recommendation
o
C ,P
O
John Deere • – Statisticians
–S
T at John Deere use R for time series
modeling and also geospatial analysis in a reliable and reproducible
way. The results are then integrated with Excel and SAP.
New York Times – R is used in the news cycle at The New York Times
to crunch data and prepare graphics before they go for printing.
ANZ Bank – ANZ, the fourth largest bank in Australia uses R for its
credit risk analysis. DATA ANALYTICS USING R
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC