0% found this document useful (0 votes)
6 views16 pages

Unit 1

The document provides an overview of the R programming language, highlighting its features, applications, advantages, and disadvantages. R is widely used for statistical computing, data analysis, and visualization, with a strong community and numerous packages available for various tasks. It also covers basic commands, data structures, and handling of missing values in R.

Uploaded by

kakingareak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

Unit 1

The document provides an overview of the R programming language, highlighting its features, applications, advantages, and disadvantages. R is widely used for statistical computing, data analysis, and visualization, with a strong community and numerous packages available for various tasks. It also covers basic commands, data structures, and handling of missing values in R.

Uploaded by

kakingareak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit 1

R Environment:
 R is a popular programming language used for statistical computing and graphical
presentation
 It is an integrated suite of software facilities for data manipulation, calculation and
graphical display
 It is an interpreted programming language
 it is a software environment that is widely used for statistical computing and data
analysis
 it was developed by Ross Ilaka and Robert Gentleman at the University of
Auckland, New Zealand in 1993
 R is an open-source implementation of the S programming language
 The R Core Team was formed in 1997 to develop the language further.
 Current Version is 4.4.1, released on 14.6.24

Why Use R?
 It is a great resource for data analysis, data visualization, data science and machine
learning
 It provides many statistical techniques (such as statistical tests, classification, clustering
and data reduction)
 It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc.
 It works on different platforms(platform independent) (Windows, Mac, Linux)
 It is open-source and free
 It has a large community support
 It has many packages (libraries of functions) that can be used to solve different
problems

1
Features of R
 R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility.
 R provides operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.

Applications of R Programming in the Real World:

1. Statistical Analysis and Data Visualization:


It offers a comprehensive toolkit for performing an array of statistical tests, from basic
descriptive statistics to advanced regression models.
R shines in the aspect of data visualization. Packages like ggplot2 provide a flexible and
powerful platform for creating compelling graphs and charts, facilitating a visual understanding
of complex datasets.
2. Data Exploration and Cleaning:
Handles missing values, outliers, and ensure overall data quality before diving into in-depth
analysis.
R ensures that datasets are meticulously prepped and refined for accurate and reliable insights.
3. Predictive Modeling and Machine Learning:
It provides an array of algorithms for regression, classification, and clustering, making it a
preferred language for building predictive models.
In real-time applications, such as predicting stock prices, customer behavior, or disease
outcomes, R's machine learning capabilities prove invaluable, driving data-driven decision-
making.
4. Biostatistics and Healthcare:
Biostatistics: it is utilized for analyzing clinical trial data, conducting epidemiological studies,
and aiding healthcare professionals in making data-driven decisions.
Healthcare: genomics, where R is instrumental in analyzing genetic data, identifying patterns
associated with diseases, and contributing to personalized medicine.
5. Finance and Risk Management:
The financial sector relies on R for risk modeling, portfolio optimization, and analyzing market
trends. R's ability to handle large datasets is crucial in financial analytics where real-time
insights can drive strategic decisions.
6. Social Sciences and Market Research:
R finds extensive use in social sciences for survey analysis, sentiment analysis on social media,
and understanding public opinion.
2
7. Environmental Science and Climate Research:
Analyze climate data, predicting environmental trends, and assessing the impact of human
activities on ecosystems.
Advantages of R language
 R is the most comprehensive statistical analysis package. As new technology and
concepts often appear first in R.
 It is an open source. Thus, can run R anywhere and at any time.
 R programming language is suitable for GNU/Linux and Windows operating systems.
 R programming is cross-platform and runs on any operating system.
 In R, everyone is welcome to provide new packages, bug fixes, and code enhancements.
Disadvantages of R language
 The standard of some packages is less than expected.
 consume all available memory
 nobody to complain if something doesn’t work.
 R programming language is much slower than other programming languages such as
Python and MATLAB.
R and statistics:
 R is used as a statistics system.
 It is an environment within which many classical and modern statistical techniques have
been implemented.
 A few of these are built into the base R environment, but many are supplied as
packages.
 There are about 25 packages supplied with R
 Packages are available through the CRAN(Comprehensive R Archive Network)
https://cran.r-project.org/
 CRAN stores R's executable files, source code, documentation, as well as packages
contributed by users.
R commands:
 R is an expression language with a very simple syntax
 It is case sensitive so A and a are different symbols and would refer to different variables
 Commands are separated either by a semi-colon (‘;’), or by a newline.
 Elementary commands can be grouped together into one compound expression by
braces (‘{’ and ‘}’).
 Comments can be put anywhere, starting with a hash mark (‘#’)
 The set of symbols which can be used in R names depends on the operating system and
country within which R is being run
 All alphanumeric symbols are allowed, ‘.’ and ‘_’, with
3
 a name must start with ‘.’ or a letter
 If it starts with ‘.’ , the second character must not be a digit.
 Names are unlimited in length.
 Elementary commands consist of either expressions or assignments.
 If an expression is given as a command, it is evaluated, printed (unless specifically made
invisible), and the value is lost.
 An assignment also evaluates an expression and passes the value to a variable but the
result is not automatically printed.
Data permanency and removing objects:
 The entities that R creates and manipulates are known as objects.
 These may be variables, arrays of numbers, character strings, functions, or more general
structures built from such components.
 During an R session, objects are created and stored by name
 The R command
>objects()
 Alternatively, ls() can be used to display the names of the objects which are currently
stored within R.
 The collection of objects currently stored is called the workspace.
 To remove objects the function rm is available:
rm(x, y, z, ink, junk, temp, foo, bar)
 All objects created during an R session can be stored permanently in a file for use in
future R sessions.
 Objects can be saved.
 The objects are written to a file called .RData in the current directory, and the command
lines used in the session are saved to a file called .Rhistory.
 When R is started at later time from the same directory it reloads the workspace from
this file.
 At the same time the associated commands history is reloaded.
Vectors and assignment:
 R operates on named data structures.
 The simplest such structure is the numeric vector,
 It is a single entity consisting of an ordered collection of numbers.
 To set up a vector named x, consisting of five numbers
x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
This is an assignment statement using the function c()
Example:
x <- c(1, 2, 3, 4, 5)
print(x)
c(10.4, 5.6, 3.1, 6.4, 21.7) -> y
4
print(y)
z<-1/x
print(z)
a <- c(x, 0, x)
print(a)
Output:
12345
10.4 5.6 3.1 6.4 21.7
1.0000000 0.5000000 0.3333333 0.2500000 0.2000000
12345012345
Vector arithmetic:
 Vectors can be used in arithmetic expressions, the operations are performed element by
element.
 The elementary arithmetic operators are the usual +, -, *, / and ^ for raising to a power.
 In addition to the common arithmetic functions, the other functions like log, exp, sin,
cos, tan, sqrt, and so on, all have their usual meaning.
Ex:
a <- c(1, 2, 3)
b <- c(10, 20, 30)
a + 100
a+b
a / 100
(a + b) / 10
# Take the integers from 1 to 5, then add 100 to each
1:5 + 100
a^2
log(a)
sin(a)
cos(a)
exp(a)

Output:
[1] 101 102 103
[1] 11 22 33
[1] 0.01 0.02 0.03
[1] 1.1 2.2 3.3
[1] 101 102 103 104 105
[1] 1 4 9
[1] 0.0000000 0.6931472 1.0986123
[1] 0.8414710 0.9092974 0.1411200
5
[1] 0.5403023 -0.4161468 -0.9899925
[1] 2.718282 7.389056 20.085537
 max and min select the largest and smallest values in their arguments, even if they are
given several vectors. The length() function is used to print the number of elements in
vector.
Ex:
a <- c(2, 1, 5, 4, 5,6)
min(a) # [1] 1
max(a) # [1] 6
length(a) # [1] 6

 Statistical measures like mean and median are essential for summarizing and
understanding the central tendency of a dataset. In R, these measures can be calculated
easily using built-in functions.

Mean : It is calculated by taking the sum of the values and dividing with the number of values in
a data series. The function mean() is used to calculate this in R.

Ex:
a <- c(2, 1, 5, 4, 5,6)
mean(a) # [1] 3.833333

Median : The median() function in R is used to compute the median (middle value) of a numeric
data set.

Ex:
a <- c(2, 1, 5, 4, 5,6)
median(a) # [1] 4.5

 The parallel maximum and minimum functions pmax and pmin return a vector (of length
equal to their longest argument) that contains in each element the largest (smallest)
element in that position in any of the input vectors.
Ex:
a<-c(1,2,3,4)
b<-c(5,6)
pmax(a,b) # [1] 5 6 5 6
pmin(a,b) # [1] 1 2 3 4

 The sqrt() function in R is used to compute the square root of a number or each element
in a numeric vector. The sqrt() function will return NaN for negative numbers because
the square root of a negative number is not defined in the realm of real numbers.

6
Syntax:
sqrt(x)
Ex:1
result <- sqrt(16)
print(result)
Output: 4
Ex:2
numbers <- c(4, 9, 16, 25)
sqrt_numbers <- sqrt(numbers)
print(sqrt_numbers)
Output: 2 3 4 5
Ex:3
result <- sqrt("a")
print(result)
Output:
Error in sqrt("a") : non-numeric argument to mathematical function
Execution halted
Ex:4
result <- sqrt(-9)
print(result)
Output : [1] NaN
 sort(x) returns a vector of the same size as x with the elements arranged in increasing
order
Ex:
a<-c(11,2,1,23)
sort(a)
Output:
[1] 1 2 11 23

Generating regular sequences:


The seq() function in R can be used to generate a sequence of numbers.
Syntax:
seq(from, to, by, length.out, along.with)
Where:
From : beginning number of the sequence.
To : Terminating the number of the sequence.
by : It is the increment of the given sequence. It is calculated as ((to-from) /(length.out-
1)).
length.out : Decides the total length of the sequence
along.with : Outputs a sequence of the same length as the input vector.
7
Examples:
x <- seq(15)
print(x)
Output:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

seq(from=1,to=10)
Output:
[1] 1 2 3 4 5 6 7 8 9 10

seq(1:5)
Output:
[1] 1 2 3 4 5

seq(-1,-10)
Output:
[1] -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

seq(1,5,by=2)
Output:
[1] 1 3 5

seq(from=5, to=10)
Output:
[1] 5 6 7 8 9 10

seq(from=-5,length=5, by=2)
Output:
[1] -5 -3 -1 1 3

seq(-3, 2, by=.5)
Output:
[1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0

A related function is rep() which can be used for replicating an object in various complicated
ways.

a<-c(2,3)
d<-rep(a, times=2)
d
d1<-rep(a, each=2)
d1
Output:
[1] 2 3 2 3
[1] 2 2 3 3

8
Logical Vectors:

 Logical vectors in R are vectors that consist of Boolean values: TRUE, FALSE, and NA (Not
Applicable for missing values).
 They are commonly used in conditional statements, subsetting, and logical operations.
 Logical vectors are fundamental in R for performing tasks that involve condition checking
and data filtering.
Ways to create Logical Vectors:
1. Direct Assignment
a <- c(TRUE, FALSE, TRUE, NA)
2. Using comparison operations
Ex:1
v <- c(10, 20, 30, 40, 50)
vec <- v > 25 # c(30, 40, 50)
vec
Output:
[1] FALSE FALSE TRUE TRUE TRUE
Ex:2
a<-c(4,3,7,6,1)
res<-a%%2==0
print(res)
Output:
[1] TRUE FALSE FALSE TRUE FALSE

 The typeof the logical vector is “logical”


Ex:
x1<-c(TRUE,TRUE,NA,FALSE)
print(typeof(x1))
print(x1)
Output:
[1] "logical"
[1] TRUE TRUE NA FALSE

Missing Values:

Missing values are those elements that are not known. NA or NaN are reserved words that
indicate a missing value.
Dealing Missing Values in R:
Missing Values in R, are handled with the use of some pre-defined functions:
1. is.na() Function for Finding Missing values:
A logical vector is returned by this function that indicates all the NA values present. It returns a
Boolean value. If NA is present in a vector it returns TRUE else FALSE.
Ex:
x<- c(NA, 3, 4, NA, NA, NA)
is.na(x)

9
Output:
[1] TRUE FALSE FALSE TRUE TRUE TRUE
2. Using the is.nan() Function
These are produced by numerical computation. We can apply the is.nan() function to check for
NAN values. This function returns a vector containing logical values (either True or False). If
there are some NAN values present in the vector, then it returns True corresponding to that
position in the vector otherwise it returns False.
Ex:
myVector <- c(NA, 100, 241, NA, 0 / 0, 101, 0 / 0)
is.nan(myVector)

Character vectors:
Character Vector in R is a vector of a type character that is used to store strings and NA values.
Character strings are entered using either matching double (") or single (') quotes, but are
printed using double quotes (or sometimes without quotes).
A vector where each element has only alphabet characters (a-z or A-Z) of any size is a character
vector. For example, c('ABC','abc','AbC','12AB') is a character vector of length 3.
Ex:
a<-c('A','45C',"abc")
print(a)
res<-is.character(a)
print(res)
Output:
[1] "A" "45C" "abc"
[1] TRUE
Objects, their modes and attributes:
Intrinsic attributes: mode and length
 The entities R operates on are technically known as objects.
 Examples are vectors of numeric (real) or complex values, vectors of logical values and
vectors of character strings.
 These are known as “atomic” structures since their components are all of the same type, or
mode, namely numeric, complex, logical, character and raw.
 Vectors must have their values all of the same mode.
 Thus any given vector must be unambiguously either logical, numeric, complex, character or
raw.
 R also operates on objects called lists, which are of mode list.
 These are ordered sequences of objects which individually can be of any mode
 lists are known as “recursive” rather than atomic structures since their components can
their components can themselves be lists in their own right.
 The other recursive structures are those of mode function and expression.

10
 Functions are the objects that form part of the R system along with similar user written
functions.
 Property of an object : Mode and Length
 The functions mode(object) and length(object) can be used to find out the mode and length
of any defined structure.
 z is a complex vector of length 100, then in an expression mode(z) is the character string
"complex" and length(z) is 100.
Changing the length of an object:
An “empty” object may still have a mode. For example e <- numeric() makes e an empty
vector structure of mode numeric. Similarly character() is a empty character vector. Once an
object of any size has been created, new components may be added to it simply by giving it
an index value outside its previous range. Thus e[3] <- 17 now makes e a vector of length 3
(the first two components of which are both NA).
Getting and setting attributes:
 The attr() function is used to get or set the value of a specific attribute of an object.
 Attributes in R are metadata that can be attached to R objects to provide additional
information.
 Common examples of attributes include names, dimensions, class, and others.
attr(x, which)
attr(x, which) <- value
x: The R object from which you want to get or to which you want to set an attribute.
which: A string specifying the name of the attribute that want to access or modify.
value: The new value to assign to the attribute.
Getting an Attribute:
To get the value of an attribute, you use the attr() function with two arguments: the object and
the name of the attribute.
mat <- matrix(1:6, nrow = 2, ncol = 3)
attr(mat, "dim") # 2 3
Setting an Attribute:
To set the value of an attribute, use the attr() function with the assignment form.
vec <- 1:6
attr(vec, "dim") <- c(2, 3)
vec
Output:
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6

Class of an Object:
11
 In R, the class of an object is an attribute that defines how the object should behave and
interact with functions and methods.
 The class of an object is fundamental in R's object-oriented programming, determining
the object's type and how it is treated by generic functions.
Ex:
x <- 42
class(x) # "numeric“
y <- "Hello"
class(y) # "character"
Setting the Class of an Object:
One can set or change the class of an object using the class() function with the assignment
form:
class(x)<-"Ingeter"
class(x) # “Integer”
Unclass() :
 The unclass function in R is used to remove the class attribute from an object.
 This can be useful when you want to strip an object of its class-specific behavior and
revert it to a more basic form, typically a vector or list.
 By doing this, the object no longer behaves according to its class methods but rather as a
basic data type.
unclass(object)
Example:
# Create a factor
f <- factor(c("apple", "banana", "cherry", "apple"))
print(f)
# Unclass the factor
uf <- unclass(f)
print(uf)
Output:
[1] apple banana cherry apple
Levels: apple banana cherry
[1] 1 2 3 1
attr(,"levels")
[1] "apple" "banana" "cherry"
Ordered and unordered factors:
Factors:
 The factor is a data structure which is used for fields which take only predefined finite
number of values.
 These are the variable which takes a limited number of different values.

12
 These are the data objects which are used to categorize the data and to store it on
multiple levels.
 It can store both integers and strings values, and are useful in the column that has a
limited number of unique values.
 A factor is a vector object used to specify a discrete classification (grouping) of the
components of other vectors of the same length.
 R provides both ordered and unordered factors
Attributes of a factor:
 x: It is the vector that needs to be converted into a factor.
 Levels: It is a set of distinct values which are given to the input vector x.
 Labels: It is a character vector corresponding to the number of labels.
 Exclude: This will mention all the values you want to exclude.
 Ordered: This logical attribute decides whether the levels are ordered.
 nmax: It will decide the upper limit for the maximum number of levels.
Ex:
x <-c("female", "male", "male", "female")
print(x)
# Converting the vector x into a factor named gender
gender <factor(x)
print(gender)
o/p:
[1] "female" "male" "male" "female"
[1] female male male female
Levels: female male
To display levels:
gender <- factor(c("male", “female", "male", "female"))
levels(gender)
Output :
[1] "female" "male"
levels argument inside the factor():
Ex:
gender <- factor(c("female", "male", "male", "female"), levels = c("female", "transgender",
"male"))
gender
Output:
[1] female male male female
Levels: female transgender male

Checking for a Factor in R:


13
 The function is.factor() is used to check whether the variable is a factor and returns
“TRUE” if it is a factor.
Ex:
gender <- factor(c("female", "male", "male", "female"));
print(is.factor(gender))
Output:
[1] TRUE
 Function class() is also used to check whether the variable is a factor and if true returns
“factor”.
Ex:
gender <- factor(c("female", "male", "male", "female"));
class(gender)
Output:
[1] "factor"
Accessing elements of a Factor in R:
Like we access elements of a vector, the same way we access the elements of a factor.
If gender is a factor then gender[i] would mean accessing an ith element in the factor.
Example :
gender <- factor(c("male", "female", "male", "female"))
gender[3]
gender[c(2, 4)]
Output:
[1] male
Levels: female male
[1] female female Levels: female male
Subtract one element at a time:
gender[-3]
O/P:
[1] male female female
Levels: female male
Modification of a Factor in R:
gender <- factor(c("male","female", "male", "female"))
gender[2]<-"male"
Gender
Output:
[1] male male male female
Levels: female male
Adding the value to the level:
gender <- factor(c("male","female", "male", "female"))
levels(gender)<-c(levels(gender),"Male")
14
gender[3]<-"Male"
Gender
Output:
[1] male female Male female
Levels: female male Male
The function tapply() and ragged arrays:
 The tapply function in R is used to apply a function over subsets of a vector.
 It stands for "table apply" and is commonly used for group-wise operations.
 A "ragged array" typically refers to an array where sub-arrays (or groups) have different
lengths, which can complicate applying functions directly.
Syntax: tapply(vector, factor, fun)
Parameters:
vector: The data vector to be split
factor: a list of one or more factor
fun: Function to be applied
Ex:
data <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
groups <- factor(c('A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C'))
ressum <- tapply(data, groups, sum)
print(ressum)
result <- tapply(data, groups, mean)
print(result)

Output:
A B C
3 12 40
A B C
1.5 4.0 8.0
Ordered factors:
 An ordered factor is a special type of factor where the levels are ordered.
 This ordering allows for comparisons between levels.
 For example, if you have an ordered factor representing "low", "medium", and
"high", we can say that "medium" is greater than "low".
 The ordered() function creates such ordered factors.
Ex:
size = c("small", "large", "large", "small",
"medium", "large", "medium", "medium")
size_factor <- factor(size)
print(size_factor)

15
# ordering the levels
ordered.size <- factor(size, levels = c(
"small", "medium", "large"), ordered = TRUE)
print(ordered.size)
O/P
[1] small large large small medium large medium medium
Levels: large medium small
[1] small large large small medium large medium medium
Levels: small < medium < large

*****

16

You might also like