0% found this document useful (0 votes)
3 views100 pages

Data Analysis Using r 04-07-24

The document provides an overview of data analysis using R programming, highlighting its definition, applications, and advantages. It explains the significance of statistics and statistical computing, the features of R, and its major uses in various fields such as machine learning and data journalism. Additionally, it discusses the advantages and disadvantages of R programming, emphasizing its open-source nature and active community support.

Uploaded by

girlsmiley144
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views100 pages

Data Analysis Using r 04-07-24

The document provides an overview of data analysis using R programming, highlighting its definition, applications, and advantages. It explains the significance of statistics and statistical computing, the features of R, and its major uses in various fields such as machine learning and data journalism. Additionally, it discusses the advantages and disadvantages of R programming, emphasizing its open-source nature and active community support.

Uploaded by

girlsmiley144
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 100

Data Analysis using R Programming

Dr. O.V.SHANMUGA SUNDARAM


Associate Professor/Mathematics
Sree Saraswathi Thyagaraja College
Pollachi-642107
DATA ANALYTICS USING R
What is Statistics?

Statistics is a branch of Mathematics.

It is defined as study and manipulation of data is

called as Statistics

– Dr O to
According V the
Shanmuga
definitionSundaram
of Statistics
• Associate Professor / PG Mathematics
 Studying
– STC, POLLACHI
 Analysis
 Interpretation
 Presenting
 Organising the data or finalising the data
DATA ANALYTICS USING R
Example
1. The number of people in the town who are
watching TV out of total population in the town.

What is Statistical Computing?


Statistical Computing is defined as it is the bond
between Statistics and Computer Science is called
– Dr O V
Statistical Shanmuga
Computing and Sundaram
also called Computational
• Associate Professor / PG Mathematics
Statistics
– STC, POLLACHI

Example
1. Regression Model
2. Machine learning Algorithms
3. Time series Model
DATA ANALYTICS USING R
What is R Programming?

R is the general purpose of the Programming


Language
It is also one of the interpreter programming
language and execute line by line the code.
R programming is mainly used in the data analysis
– Dr O V Shanmuga Sundaram
and research fields.
• Associate Professor / PG Mathematics
R supports
– STC, POLLACHI programming with functions
procedural
and for some functions , object oriented
programming with generic functions.
That is widely used as a statistical software and data
DATA ANALYTICS USING R
Why learn R Programming language?

R is the one of the most popular Statistical


Programming languages for Scientists.
It is heavily used in the field of Machine Leaning,
scientific computing and Statistical Analysis.
– DrROisV an
Since Shanmuga Sundaram
interpreted programming language,
• Associate
you can run yourProfessor / PG Mathematics
code without any compiler using
– STC, POLLACHI
interpreter. This makes development easier.
R can be used to perform vector calculations. It is a
vector language and can be used

DATA ANALYTICS USING R


Why learn R Programming language?

– Dr O V Shanmuga Sundaram
• Associate Professor / PG Mathematics
– STC, POLLACHI

DATA ANALYTICS USING R


Major uses of R Programming

cs
ram
1. Statistical Interfaces a ti
d a e m
u n
a th
2. Data Analysis S M
u or ga / PG
m
han ess
3. Machine Learning oAlgorithms
f
V S Pr I
O a e
t LAC H
r soc POL i
– D s TC,
• A S

DATA ANALYTICS USING R


History of R Programming
• Introduced in the middle year
of 1960’s and 1970’s from S
cs
ram
Programming a ti
d a h em
• Updated S Programming in un a t
a S GM
1990’s as R Programming u g r / P
m
n fes so
a
Sh Ihaka
• Designed byVRoss Pro I and
O te C H
r c i a LL A
Robert –D o
Gentleman,
A ss C , PO
• –S
T
coined by R programming by
these New Zealand
Statisticians’s first letter of their
names DATA ANALYTICS USING R
History of R Programming
• It is a free software environment for statistical
computing and graphics
cs
ram a ti
• Currently, the CRAN (Competitive a R Archive
m
n d th e
S u M a
Network) package repository a features
G packages.
u g r/ P
n m sso
They aimed toSh a ofe R Programming a free and
provide
Pr
O V e
t LAC H I
i
r soc tool a
flexible– D
software P OL for statistical analysis and data
A s TC,
• –S
visualization, making it accessible to all statisticians,

researchers and data analysts worldwide.


DATA ANALYTICS USING R
Features of R Programming Language
R is a domain specific programming language which aims to do
data analysis
It has some unique features which make it very powerful
s
m ti c
The most important arguably being the a a
r notation
m a of vectors.
n d th e
These vectors allow us to perform S u a complex
M a
operation on a set
a
g r/P G
u
m sso
of values in a single command.
n
a ofe
Sh
There are the following r
features
V P I of R programming: It is a
O i a te LAC H
D r effective
simple and s oc Pprogramming
OL language which has been
– As TC,
• –S
well developed.
It is data analysis software.
It is a well-designed, easy, and effective language which has
the concepts of use defined, looping, conditional, and various
DATA ANALYTICS USING R
I/O features.
Features of R Programming Language
It has a consistent and incorporated set of tools which are used

cs
ram a ti
for data analysis.
d a e m
u n a th
For different types of calculation S
a on G M
arrays, lists and vectors, R
u g r / P
n m sso
a ofe
contains a suite ofhoperators.
V S Pr I
O a e C H
tdataLAhandling
r soc POL
It provides i
effective and storage facility.
– D
• As TC,
–S
It is an open-source, powerful, and highly extensible software.

It provides highly extensible graphical techniques.

It allows us to perform multiple calculations using vectors.


DATA ANALYTICS USING R
R is an interpreted language
Application of ‘R’ Programming

1. Fintech Computers (Financial Services)

2. Academic Research cs
ram a ti
d a h em
3. Government (FDA, National u n Weather
a t Service)
S
a / PG M
g
u or
4. Retail m
n fess
h a
S Pro I
O V te C H
5. SocialrMediac i a LL A
– D Ass TC, PO
o
• –S
6. Data Journalism

7. Manufacturing

8. Healthcare DATA ANALYTICS USING R


The companies or organizations that use R
Airbnb
Microsoft
cs
ram
Uber a ti
d a e m
Facebook u n
a th
S M
u or ga / PG
Ford m
h an ess
of
Google V S Pr I
O a e
t LAC H
i
r soc POL
X
– D
As TC,
• –S
IBM
American Express
HP
DATA ANALYTICS USING R
Advantages and Disadvantages of ‘R’ Programming
1. Excellent for Statistical Computing and Analysis
R is a statistical language created by statisticians. Thus, it
excels in statistical computation. R is the most used
m ti cs
programming language for developingra a tools.
statistical
d a he m
2. Open Source u n a t
a S GM
u g r / P language. Anyone can
R is an open – source programming
n m sso
h
work with R without aany olicense
fe or fee. Due to this, R has a
V S Pr I
O
huge community a
that
e A
H
t contributes
C to its environment
r c i L L
– D sso , PO
A
3. A Large Variety
• TCof Libraries
–S
R’s massive community support has resulted in a very large
collection of libraries. R is famous for its graphical libraries.
These libraries support and enhance the R development
environment. R libraries
DATAwith a huge
ANALYTICS USING Rvariety of applications.
Advantages and Disadvantages of ‘R’ Programming
4. Cross – platform Support
R is machine independent: It supports the cross-platform
operation. Thus, it is usable on many different operating
systems.
m ti cs
5. Supports various Data types
a ra m a
R can perform operations on vectors, n d arrays,th e matrices, and
various other data objects or S u
varying M a
sizes.
g a / PG
6. Can do Data Cleansing. u
m sso Data r Wrangling, and Web
Scraping n
a ofe
R can collect V
data
h
S fromPrtheHinternet through web scraping and
I
O
other means. It can
i a tealsoLAperform
C data cleansing. Data
r c L
– D isAsthe
cleansing so process
C , PO o detecting and removing/correcting
inaccurate• or –corrupt
ST records. R is also useful for wrangling
which is the process of converting raw data into the desired
ormat for easier consumption.
7. Powerful Graphics
R has extensive libraries that can produce production quality
graphs and visualizations. These graphics can be of of static
DATA ANALYTICS USING R
as well as dynamic nature.
Advantages of ‘R’ Programming
8. Highly Active Community
The R community is very active. There are users from all
around the world to help and support you. Many latest ideas
and technology appear in the R community.
m ti cs
9. Parallel and Distributed Computing
a ra m a
Using libraries like ddR or multiDplyr, n d Rthcane process large
S
data sets using parallel or distributed
u M a
computing.
g a / PG
10.Doesn’t need a Compiler
m u or
R is an interpreted a n fess This means that it does not need
language.
ShthePcode
a compiler toVturn ro intoI an executable program.
Instead, rR O
interprets
i a tetheLAprovided
C H
code into lower – level calls
D sso code.c PO
L
and –pre-compiled ,
11.Compatible• A –with C
ST other Programming Languages
R is compatible with other languages like C, C++, and
FORTRAN. Other languages like .NET, Java, Python can also
directly manipulate objects.

DATA ANALYTICS USING R


Advantages of ‘R’ Programming
12. Used in Machine Learning
R can be useful for machine learning as well. Facebook does a
lot of its machine learning research with R. Sentiment analysis
and mood prediction are all done using R. The best s use of R
m c
tiexploration or
when it comes to machine learning isrin a case of
a h em a
when building one-off models. nd t
13. Can Interact with Databases
u
S GM a
a
g that P
R contains several packages u
m packages
o r / enable it to interact with
databases. Some of a nthese es s are Roracle, Open
S h f
o I
rProtocol,RmySQL,
Database Connectivity
V P
e AC H
etc.,
O
14. Comprehensive i a tEnvironment
r c L L
R has– aDveryA
o
s C, P O
scomprehensive development environment. It
• T
– S computing as well as software
helps in statistical
development. R is an object – oriented programming
language. It also has a robust package called Rshiny which
can produce full-fledged web apps. R can also be useful for
developing software packages.
DATA ANALYTICS USING R
Advantages of ‘R’ Programming

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Disadvantages of ‘R’ Programming
1. Steep Learning Curve
As many have said, R makes easy things hard, and
hard things easy. R’s syntax is very different than
other languages so are its data types. The learning
curve for R is pretty steep for a abeginner. m ti cs Though R
a r em a
is a bit dificult in the beginning, n d th data science
enthusiasts still prefer to S u itM due to the amazing
learn
a
g a / PG
features of R. m u or
a n fess
2. Some Packages Sh may r o be of poor Quality
V e P H I
CRANOhouse i a t more AC than 10,000 libraries and
Dr Some
packages. oc POL L
– A s TC, of them redundant as well. Due to
s
the large• quality,
–S some of the packages may be of
poor quality.
3. Poor Memory Management
R commands don’t concern with memory
management. As a result, R can take up all the
available space. DATA ANALYTICS USING R
Disadvantages of ‘R’ Programming
4. Slow Speed
The programs and functions in R are spread across
different packages. This makes it slower than
alternatives such as MATLAB and Python.
5. Poor Security m ti cs
a ra m a
R lacks basic security measures. n d So
th emaking web-
apps with it is not always S u
safe. M a
g a P G
6. No Dedicated Support m u orTeam /
a n fess
Sh
R has no dedicated r support team to help a user
o
V e P H I
with their O issues
i a t and ACproblems. But the community
D
is quite
r large,oc POL L
– A s Tso
s C , everybody helps each other out.
7. Flexible• Syntax
–S
R is flexible programming language and there are
no strict guidelines to follow. You need to maintain
proper coding standards to avoid messy and
complicated code.
DATA ANALYTICS USING R
Installation of ‘R’ Programming

Definition: Compiler
A compiler is a special program that translates a
programming language’s source m code intoti cs machine
code, bytecode or another programming a ra m a language.
d
n written e
th in a high level,
The source code is typically u
S GM a
human readable language a
g r /suchP as Java or C, C++.
u
m sso
n
a ofe
Sh r
V
Definition: Interpretere P H I
O i a t LAC
D r soc directly
An Interpreter OL executes instructions written
– A s TC,
in a programming
P
or scripting language without
• –S
previously converting them to an object code or
machine code. Examples of interpreted languages
are Perl, Python, R and MATLAB.

DATA ANALYTICS USING R


Installation of ‘R’ Programming

HOW COMPILER WORKS

SOURCE MACHINEm ti csOUTPUT


CODE
COMPILER
a r
CODEa m a
n d e th
S G u M a
a
g r/P
u
nm sso
a ofe
h
O V S HOWPINTERPRETER
r
te LAC HI WORKS
i a
r soc POL
– D s ,
• A – STC
SOURCE
INTERPRETER OUTPUT
CODE

DATA ANALYTICS USING R


Installation of ‘R’ Programming

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Installation of ‘R’ Programming

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Installation of ‘R’ Programming
COMPILER INTERPRETER
A compiler takes the An interpreter takes a
entire program in one go. single line of code at a
time. m ti cs
a ra m a
The compiler generates an The n dinterpreter
th e never
intermediate machine S u
produces
a
M any intermediate
g a P G
code. m u o r /
machine code.
a n fes s
The compiler V h
is Sbest ro I An interpreter is best
e P H
O t
i a LL
suited for rthe production AC
suited for a software
c
– D Ass TC, P
environment.
o O
development environment
• –S
The compiler is used by An interpreter is used by
programming languages programming languages
such as C, C++, C#, such as Python, PHP, Perl,
Scala, Java, etc., Ruby, etc.,
DATA ANALYTICS USING R
R Data Types

The variables are assigned with R-objects and the data


type of the R-object becomes the data type of the
variable. Many types of R-objects:ram ti cs
a
d a h e m
Vectors u n a t
S
a / PG M
Lists
g
u or
m
n fess
h a
Matrices S Pro I
O V te C H
r c i a LL A
Arrays– D Ass TC, PO
o
• –S
Factors
Data Frames

DATA ANALYTICS USING R


R Data Types
Vectors
When you want to create vector with more than one
element, we should use c() function m cs
whichtimeans to
a a
r em a
combine the elements into auvector. nd ath
a S GM
> apple<-c('red','green','yellow')u g r/P
> print(apple) n m sso
h a ofe
V S
[1] "red" "green" "yellow"
Pr I
O a e
> apple<-c('red','green','yellow')
t LAC H
> b<-c(1,2,3)r i
c POL
– D s o
s TC,
>
• A S
[1] "red" "green" –"yellow"
> print(b)
[1] 1 2 3
> print(class(apple))
[1] "character"
>
DATA ANALYTICS USING R
R Data Types
Lists
A list is an R-object which can contain many different
types of elements inside it like vectors, m ti cs
functions and
a a
r em a
even another list inside it. und ath
a S GM
#create a list
u g r/P
n m sso
> list1<-list(c(1,2,3,4),25.6,sin)
> print(list1) h a ofe
V S Pr I
[[1]] e
t LAC H
[1] 1 2 3 4 r O i a
c POL
– D s o
s TC,
• A S
[[2]] –
[1] 25.6

[[3]]
function (x) .Primitive("sin")
>
DATA ANALYTICS USING R
R Data Types
Matrices
A matrix is a two-dimensional rectangular data set.
It can be created using a vector input to the matrix
function.
m ti cs
#create a matrix a ra m a
n d th e
u a
> M=matrix(c('a','b','c','d','e','f'), nrow=2,ncol=3,byrow=TRUE)
S M
> print(M)
g a / PG
[,1] [,2] [,3]
m u or
[1,] "a" "b" "c" a n fess
[2,] "d" "e" "f" S h ro I
V e P H
> O t
i a LL A C
r
D ss , PO
oc
> M=matrix(c('a','b','c','d','e','f'), nrow=3,ncol=2,byrow=TRUE)

> print(M) • A TC
– S
[,1] [,2]
[1,] "a" "b"
[2,] "c" "d"
[3,] "e" "f"
>

DATA ANALYTICS USING R


R Data Types
Arrays
While matrices are confined to two dimensions,
arrays can be of any number of dimensions.
m ti cs
The array function takes a dimda ra
attribute m a
which
n th e
u a
S ofGdimension.
M
creates the required number a
g r/P
u
m sso
In the below example n
a ofwe e create an array with two
Sh r
V e P H I
O ciaare
elementsr which t 3LAXC 3 matrices each.
D s o OL
– A s TC, P
• –S

DATA ANALYTICS USING R


R Data Types
Arrays
#create an array
> a<-array(c('green','yellow'), dim=c(3,3,2))
>
> print(a) cs
ram
,,1 a ti
d a e m
u n a th
[,1] [,2] [,3] S
a / G M
g P
mu ssor
[1,] "green" "yellow" "green"
[2,] "yellow" "green" n"yellow"
h a "green"
o fe
S
[3,] "green" "yellow"
Pr
OV
I
e H
,,2 r iat LAC
c POL
– D s o
As TC,
[,1] •[,2] – S[,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

>
DATA ANALYTICS USING R
R Data Types
Factors
Factors are the r-objects which are created using a
vector.
cs
ram
a ti
It stores the vector along with d athe h
m
distinct
e values of
u n a t
the elements in the vector a S labels.
as G M
u g r/P
The labels are an m sso
h always
fe character irrespective of
S Pro I
whether itr O
V t
is numerice C
or
H
character or Boolean etc, in
c i a LL A
– D Asso TC, PO
the input• vector.
–S They are useful in statistical
modeling.
Factors are created using the factor() function.
The nlevels functions gives the count of levels.
DATA ANALYTICS USING R
R Data Types
Factors
# create a vector
> apple_colors<-
c('green','green','yellow','red','red','red','green')
m ti cs
>
a ra m a
#create a factor n d th e
S u M a
a / PG
> factor_apple<-factor(apple_colors)
g
> m u or
a n fess
#print the factor S h ro I
V e P H
O
> print(factor_apple) t
i a LL AC
r c
– D green
[1] green A s o
s Cyellow
, PO red red red green
• T
– S yellow
Levels: green red
>
> print(nlevels(factor_apple))
[1] 3
>
DATA ANALYTICS USING R
R Data Types
Data Frames
Data frames are tabular data objects.
Unlike a matrix in data frame each column can
contain different modes of data.
The first column can be numericrwhile m ti cs
a
a he m
the
a second
column can be character and n d
third tcolumn can be
u
S GM a
logical. a
g r/P
It is a list of vectors u
mof equal o length.
n
a ofe s s
Sh r
V e P H I
Data Frames O are i t LAC using the data.frame()
created
a
D r soc POL

function. A s TC,
• –S

DATA ANALYTICS USING R


R Data Types
Data Frames
# create the data frame
>
BMI<-data.frame(
m ti cs
gender<-c("Male",”Male","Female"),
a ra m a
height=c(152,171.5,165), n d th e
S u M a
weight =c(81,93,78),
g a / PG
m
Age=c(42,38,26))
u or
a n fess
> S h ro I
V e P H
> print(BMI) O t
i a LL AC
r c
– D Ass TC, PO
o
gender....c..Male....Male....Female.. height weight Age
• –S
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
>
DATA ANALYTICS USING R
Syntax of R Programming
R Programming is a very popular programming
language which is broadly used in data analysis. The
way in which we define its code is quite simple. The
“Hello World” is the basic program for all the
languages. m ti cs
a ra m a
To develop the programs inutwo nd mode
a th e
1. The command prompt
S
a / PG M
u g
2. The Script file nm ssor
h a ofe
1.The command V S PromptPr I “R” Console
The command O i a te LAC write the code command line.
prompt
H
D r soc POL
It is –directly
• A s interact
TC , with the interpreter.
S
Example: –
> "stc"
[1] "stc"
> "STC
+ COLLEGE POLLACHI"
[1] "STC\nCOLLEGE POLLACHI"
DATA ANALYTICS USING R
>
Installing, Running, and Interacting with R

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Installing, Running, and Interacting with R

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Installing, Running, and Interacting with R

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Syntax of R Programming
2. The script file (R Studio)
R script file is another way on which we can write
our programs and then we execute thosesscripts
m ti c
at our command prompt withdathe
a
r help m a
of R
n th e
u a
interpreter known as gRscript.a S / PG M
m u or
We make a text file a nandfewrite
ss the following code. We
S h ro
will save this Vfile withe P .R I
extension
H as
O t
i a LL AC
Filename.Rr c
– D Ass TC, PO
o
• – S
>string- “ STC COLLEGE POLLACHI”
>print(string)

Output[1]
STC COLLEGE POLLACHI
DATA ANALYTICS USING R
Syntax of R Programming
A sample program is generated

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an essof
V S Pr I
If we run the program
O a e
t LAC H
i
r soc POL
– D s ,
• A – STC

DATA ANALYTICS USING R


Syntax of R Programming
Comments
In R programming, comments are the progrmmer
readable explanation in the source codes of an R
m ti c
program. a a
r em a
n d th
u a
The purpose of addinggthese a S / PG M
comments is to make
m u or
an ofeto
the source codeheasier ss understand Compilers and
V S Pr I
O
interpreters generally
a e
t Lignore
AC H
these comments.
r c i L
– D Asso TC, PO
Two types•of comments
–S
1. Single line comment
2. Multi-line comment

DATA ANALYTICS USING R


Syntax of R Programming
1. Single line comment
In R programming, there is only a single line
comment. R doesn’t support multi line comment. s But
m ti c
if we want to perform multi-line a a
rcomments,
m a
then we
n d th e
u
Sblock. M a
can add our code in a false a
g r/P G
u
m sso
Single line comment n
a ofe
Sh r
V e P H I
O i a t LAC
Dr soc POL
– A s TC,
#My First• program
–S in R programming

>string”Hello World”
print(string)
DATA ANALYTICS USING R
Syntax of R Programming
1. Multi-line comment
#Trick for multi-line comment
if(False){
cs
ram
a ti
d a
‘R is an interpreted computer programming h e m language
u n a t
which was created by a SRossG MIhaka and Robert
u g r/P
n m sso
Gentleman at hthe a oUniversity
fe of Auckland, New
V S Pr I
e
O ciat LLAC
Zealand.,’r }
H

– D Asso TC, PO
#My First• program
–S in R programming
>string”Hello World”
print(string)

DATA ANALYTICS USING R


Variables in R Programming
•Store the information to be manipulated and
referenced in the R
program
• Store an atomic vector, a group of atomic m ti cs
vectors,
or a combination of many R objects. a ra m a
n d th e
• R supports three ways of variable S u M a
assignment
1. Using equal to operator g a – /operators
P G use an arrow
u
m assign or
or an equal sign n
a ofe
to s s values to variables.
Sh r– data
V
2. Leftward operator e P H I is copied from right to
t
left. Dr O socia POLL
A C

Syntax
– A s TC,
• –S Example:
For equal to operators Var1 “hello”
Variable name = value Print(var1)
For leftward operator Output: “hello”
Variable namevalue
DATA ANALYTICS USING R
Rules for naming a R variable
1. A valid variable name consists of a combination of
alphabets, numbers, dot(.), and underscore (_)
characters. Ex. Var.1_ is valid
m ti cs
2. No other special character is allowed. a ra m a
n d th e
Ex. Var$! Or var#1 – invalid S u M a
3.It can start with alphabets g a or P G characters.
dot
m u or /
Ex. .var and vara–nvalid es s
Sh with
4.It should notV start Pr of
numbers
I or underscore
O a te AC H
Ex. 2varr or _var
c i – Oinvalid
L L
– D ss , P
o
5. If a variable
• A – starts
ST
C with a dot, the next thing after
the dot cannot be a number Ex. .3var – invalid

6. Variable name should not be a reserved keyword


Ex. TRUE, FALSE, etc.,
DATA ANALYTICS USING R
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Data manipulation – Combining data – Obtaining subset of
data – Data sorting – Data aggregation

getwd() # To get working directory


# Save the excel file as a csv(Comma delimited) file in the
working directory # To import the file "TestmarksA"
m ti cs
dataA = read.csv("TestmarksA.csv") dataA
a ra m a
Sl.No. Name IT.1 IT.II
n d th e
1 A1 A 26 S
32 u M a
2 A2 B 25uga 25 / PG
m o r
3 A3 C
a n fes
19 s31
4 A4 DSh 14
P ro I 26
5 A5O VE iate 25ACH 28
r c
6 – DA6 ss F , P 32
o O L L
32
7 A7• A –GSTC 29 42
8 A8 H 25 26
9 A9 I 31 38
10 A10 J 35 39
11 A11 K 33 31
12 A12 L 35 36
DATA ANALYTICS USING R
Data manipulation – Combining data – Obtaining subset of
data – Data sorting – Data aggregation
# To rename column names
colnames(dataA)=c("Roll No.","Student name","CATI","CATII")
dataA

cs
ram
Roll No. Student name CATI
a ti
CATII

d a e m
1 A1

u n
a
A
th
26 32

S M
ga / PG
2 A2 B 25 25

3 A3
mu or C 19 31

4 A4
h an ess
of D 14 26

V S Pr I
e H
5 A5 E 25 28

O i a t LAC
Dr soc POL
6 A6 F 32 32

– 7
A s TC,
A7 G 29 42
•8 –S
A8 H 25 26

9 A9 I 31 38

10 A10 J 35 39

11 A11 K 33 31

12 A12 L 35 36

DATA ANALYTICS USING R


Data manipulation – Combining data – Obtaining subset of
data – Data sorting – Data aggregation
nrow(dataA)
[1] 12
ncol(dataA)
[1] 4 dim(dataA) [1] 12 4 names(dataA)
[1] "Roll No." "Student name" "CATI" "CATII"m ti cs
head(dataA) a ra m a
e n d th
u
S head(dataA,8) M a
g a P G
Roll No. Student name CATI
u or
CATII

m
/
1 A1
a ofe
A n26
ss 32
h
V S e Pr CHI
2 A2 B 25 25

O i a t LA
r soc POL
3 A3 C 19 31 Roll No. Student name CATI CATII

4

A4D s , D 14 26 1 A1 A 26 32

5 A5 • A – STEC 25 28 2 A2 B 25 25

6 A6 F 32 32 3 A3 C 19 31
4 A4 D 14 26
5 A5 E 25 28
6 A6 F 32 32
7 A7 G 29 42
8 A8 H 25 26

DATA ANALYTICS USING R


Data manipulation – Combining data – Obtaining subset of
data – Data sorting – Data aggregation
#Extracting rows and columns (Indexing)
dataA["CATII"] # To extract the column headed by CATII

dataA[8,] # To extract the 8th row


CAT II
Roll No. Student name CATI CATII
m ti cs
H ra 25 ma 26
1 32
8 A8
2 25 dataA[2:5,] # n
a hsecond
Todextract e
u a t to fifth rows
3 31
a S GM
4 26 u gRoll No.r /Student
P name CATI CATII

n m sA2so
5 28 a 2
h 3 rof A3 e B 25 25

6 32 V S P I C 19 31

O a te AC H A4
7 42 r i
c P5OL
4
L
D 14 26

D
–26 As TC, s o A5 E 25 28

8 • mean(dataA[2:5,]$
9 38 –S CATII ) # To find the mean CATII
marks of 2nd row to 5th row
10 39 [1] 27.5
11 31
12 36

DATA ANALYTICS USING R


Data manipulation – Combining data – Obtaining subset of
data – Data sorting – Data aggregation
#Extracting rows and columns (Indexing)
dataA["CATII"] # To extract the column headed by CATII

dataA[c(3,11),] # To extract the 3rd and 11th rows


Roll No. Student name CATI CATII m ti cs
3 A3 C 19 31 a ra m a
31 u n d th e
11 A11 K 33
S GM a
a
g r/P
u
m columns o
#Extracting specific rows n
a ofe
and s s (Slicing)
h
S #PTor extract
dataA[c(1,2,5), c(1,3,4)] rows 1,2,5 and columns 1,3,4
O V te LAC H I
i a
rRoll No.socCATI PCATII
D OL
– As TC,
1 • A1
– S26 32

2 A2 25 25
5 A5 25 28

DATA ANALYTICS USING R


Data manipulation – Combining data – Obtaining subset of
data – Data sorting – Data aggregation

#Adding semester marks to existing data


Sem=c(78,46,60,45,63,74,81,61,61,79,84,74)
dataA1=cbind(dataA,Sem) dataA1
cs
ram
Roll No. Student name CATI CATII Sem
a ti
1 A1 A 26
d a 78 hem
32
2 A2 B
S 31G M
25 un25 46at
3 A3 Ca
g 14 r / 26 45
19
P 60
Du
nmE s25so 28 63
4 A4
5 A5
h a Fofe 32
6 A6
A7 V
S Pr I
32 74
7
O a e G H
t HLAC 25
29 42 81
i
r soc POL
8
– D A8
A9 s ,
26 61
9
A
•A10 ST
C I 31 38 61
10
– J 35 39 79
11 A11 K 33 31 84
12 A12 L 35 36 74

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


Applications of R

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


FINANCE
Data Science is most widely used in the financial
industry.
R provides tools for moving averages, autoregression
and time-series analysis which forms m ti cs crux of
the
financial applications. R is being a ra
widely m a
used for credit
d
n ANZ th e
risk analysis at firms Su like M a and portfolio
management. g a / PG
m u or
a n fess
S h ro I
V
BANKING e P H
O t
i a LL AC
banking –
r
industries o c
D ssmake, Puse O of R for credit risk modeling and other
• A STC
forms of risk analytics.

Banks make heavy usage of the Mortgage Haircut Model that allows
them to take over the property in case of loan defaults.

Bank of America makes use of R for financial reporting. With the help of
R, the data scientists at BOA are able to analyze financial losses and
DATA ANALYTICS USING R
make use of R’s visualization tools.
HEALTHCARE

Genetics, Bioinformatics, Drug Discovery, Epidemiology are some of


the fields in healthcare that make heavy usage of R. With the help of
m
R, these companies are able to crunch data and i
process
t cs information,
providing an essential backdrop forda ra
furtherem
a
analysis and data
u n a th
processing.
a S GM
g r / Ppackage that provides various
R is also popular for its Bioconductor
u
a nmthe genomic
functionalities for analyzing
es so data.
Sh f
ro I
V e P
O a t LA CH
SOCIAL
r soc POLi MEDIA
– D s TC,
• A S
For many beginners – in Data Science and R, social media is a data
playground. Sentiment Analysis and other forms of social media data
mining are some of the important statistical tools that are used with R.
Mining user sentiment is another popular category in social media
analytics. With the help of R, companies are able to model statistical tools
that analyze user sentiments, allowing them to improve their
DATA ANALYTICS USING R
experiences.
E-COMMERCE

R is one of the standard tools that is being used in e-commerce.


Since these internet-based companies have to deal with various forms of
data, structured and unstructured, as well as from varying data sources
like spreadsheets and databases (SQL & NoSQL), R proves c s to be an
ram at
effective choice for these industries.
i
d a h em
Various statistical procedures like linear u n modeling
a t are necessary to
analyze the purchases made by the
S
a customers
G M as well as in predicting
product sales. Furthermore, m
g P
u or /use R for carrying out A/B testing
companies
analysis across the pagesaof n theirfeproducts.
ss
S h ro I
V e P H
O t
i a LL A C
r c
MANUFACTURING
– D Ass TC, PO
o
• –S
Manufacturing companies like Ford, Modelez, and John Deere use R to
analyze customer sentiment. This helps them optimize their product
according to trending consumer interests and also to match their
production volume to varying market demand. They also use R to
minimize their production costs and maximize profits.
DATA ANALYTICS USING R
Some More Applications of R
1. R is primarily used for descriptive statistics. Descriptive statistics
summarize the main features of the data. R is used for a variety of
purposes in summary statistics like central tendency, measurement of
variability, finding kurtosis and skewness.

m i c s
2. R is most widely used for exploratory data analysis. r a t
a most popular
R’s
package ggplot2 is considered to be one d aof thehbest
m
e visualization
u n a t
S
libraries due to its aesthetics and interactivity.
a / PG M
g
u toovalidate
r
3.
n
R also allows hypothesis testingm s s statistical models.
h a ofe
4. V S
You can find a correlation Pr
between I the variables in R using
O a e
t used A C H
r soc POL
the lm() function thati is L for establishing linear regression as well
– D
Aslinear
as multivariable
• TC
,regression.
–S
5. Moreover, with the help of R, you can develop predictive models that
make use of machine learning algorithms to find the occurrences of
future events.

6. R is also useful for developing statistical software packages and to


implement analytical processing in other software suites.
DATA ANALYTICS USING R
Real-Life Use Cases of R Language

cs
ram a ti
d a e m
n th
u status a and its social network
 Facebook – Facebook uses R to update S
a colleague M
G interactions with R.
graph. It is also used for predictingg P
u or on Hadoop. It also relies on R for
/
 Ford Motor Company – m Ford relies
a n e s s
statistical analysis as
S h wellroas f carrying out data-driven support for
decision making.V e P H I
O i a t important
A C
 Foursquare r – R isc an L L stack behind Foursquare’s famed
– D Assengine.
recommendation
o
C ,P
O
 John Deere • – Statisticians
–S
T at John Deere use R for time series
modeling and also geospatial analysis in a reliable and reproducible
way. The results are then integrated with Excel and SAP.
 New York Times – R is used in the news cycle at The New York Times
to crunch data and prepare graphics before they go for printing.
 ANZ Bank – ANZ, the fourth largest bank in Australia uses R for its
credit risk analysis. DATA ANALYTICS USING R
cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


More Links

cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R


cs
ram a ti
d a e m
u n
a th
S M
u or ga / PG
m
h an ess
of
V S Pr I
e H
r O c iat LAC
OL
– D sso , P
• A – STC

DATA ANALYTICS USING R

You might also like