Computational Statistic Using R Language
Computational Statistic Using R Language
language
(Unit 1)Introduction to statistic and R language
Key Takeaways
Statistics is the study and manipulation of data, including ways to gather, review,
analyze, and draw conclusions from data.
The two major areas of statistics are descriptive and inferential statistics.
Statistics are present in almost every department of every company and are an
integral part of investing.
Application of statistics
Applications of Statistics
now click on the link show above in image so R base start downloading and after
again go to main page and download and click on Install RStudio.
Using Terminal
R --version
user@Ubuntu:~$ R
(Note that R version should be 3.6+ to be able to install all packages like tm, e1071,
etc.). If there is issue with R version, see the end of the post.
$ cd Downloads/
$ ls
rstudio-1.2.5042-amd64.deb
Step 6: Test the R Studio using the basic “Hello world!” command
and exit.
Alternatively, RStudio can be installed through Ubuntu Software as well, but using the
above approach generally guarantees the latest version is installed.
If there are issues with the R version getting downloaded or the previously installed
version is older, check R version with
R --version
Add the key to secure APT from the CRAN package list:
Add the latest CRAN repository to the repository list. (This is for Ubuntu 18.04
specifically):
Conclusion
The number 1 you see in brackets before the 2 (i.e., [1]) is telling you that this line of
results starts with the first result. That fact is obvious here because there is only one
result. To make this idea clearer, let’s show you a result with multiple lines.
Here we see that we created a new object called x , which now appears in our Global
Environment. 3.6 This gives us another great opportunity to discuss some new
concepts.
First, we created the x object in the Console by assigning the value 2 to the letter x.
We did this by typing “x” followed by a less than symbol (<), a dash symbol (-), and
the number 2. R is kind of unique in this way. I have never seen another programming
language (although I’m sure they are out there) that uses <- to assign values to
variables. By the way, <- is called the assignment operator (or assignment arrow),
and ”assign” here means “make x contain 2” or “put 2 inside x.”
In many other languages you would write that as x = 2 . But, for whatever reason, in R
it is <- . Unfortunately, <- is more awkward to type than = . Fortunately, RStudio
gives us a keyboard shortcut to make it easier. To type the assignment operator in
RStudio, just hold down Option + - (dash key) on a Mac or Alt + - (dash key) on a PC
and RStudio will insert <- complete with spaces on either side of the arrow. This may
still seem awkward at first, but you will get used to it.
🗒Side Note: A note about using the letter “x”: By convention, the letter “x” is a widely
used variable name. You will see it used a lot in example documents and online.
However, there is nothing special about the letter x. We could have just as easily used
any other letter ( a <- 2 ), word ( variable <- 2 ), or descriptive name ( my_favorite_number
<- 2 ) that is allowed by R.
Second, you can see that our Global Environment now includes the object x , which
has a value of 2. In this case, we would say that x is a numeric vector of length 1
(i.e., it has one value stored in it). We will talk more about vectors and vector types
.Rdata) option.
CRAN serves as the primary platform for sharing packages with the R community.
install.packages("package_name")
R
# Installing a package with the help of CRAN install.packages("ggplot2")
This will download and install the ggplot2 package from CRAN, along with any
dependencies that it requires. Once the package is installed, you can load it into your
R session using the library() function:
R
# Code library(ggplot2)
You can also browse the CRAN website (https://cran.r-project.org/) to search for
packages and read their documentation. The website provides information on how to
install packages, as well as news and updates about the R community.
One can make contributions to CRAN which involves submitting new R packages or
updates for review. Developers must adhere to guidelines ensuring proper
documentation and functionality testing. For example, a developer creating a data
visualization package can share it with the R community through CRAN after meeting
the submission requirements.
Conclusion
$ R
This will launch the interpreter and now let’s write a basic Hello World program to get
started.
We can see that “Hello, World!” is being printed on the console. Now we can do the
same thing using print() which prints to the console. Usually, we will write our code
inside scripts which are called
RScripts
in R. To create one, write the below given code in a file and save it as
myFile.R
and then run it in console by writing:
Rscript myFile.R
Syntax of R program
A program in R is made up of three things: Variables, Comments, and Keywords.
Variables are used to store the data, Comments are used to improve code readability,
and Keywords are reserved words that hold a specific meaning to the compiler.
Variables in R
Previously, we wrote all our code in a print() but we don’t have a way to address them
as to perform further operations. This problem can be solved by
using variables which like any other programming language are the name given to
reserved memory locations that can store any type of data. In R, the assignment can
be denoted in three ways:
1. = (Simple Assignment)
Example:
Output:
"Simple Assignment"
"Leftward Assignment!"
"Rightward Assignment"
The rightward assignment is less common and can be confusing for some
programmers, so it is generally recommended to use the <- or = operator for
assigning values in R.
Output:
From the above output, we can see that both comments were ignored by the
interpreter.
Keywords in R
Keywords
are the words reserved by a program because they have a special meaning thus a
keyword can’t be used as a variable name, function name, etc. We can view these
keywords by using either help(reserved) or ?reserved.
The ones left are used as constants like TRUE/FALSE are used as boolean
constants.
NaN defines Not a Number value and NULL are used to define an Undefined
value.
Each variable in R has an associated data type. Each R-Data Type requires different
amounts of memory and has some specific operations which can be performed over
it.
1. numeric – (3,6.7,121)
3. logical – (‘True’)
R Programming language has the following basic R-data types and the following
table shows the data type and the values that each data type can take.
Basic Data
Values Examples
Types
“a”, “b”, “c”, …, “@”, “#”, “$”, …., “1”, “2”, "character_value <- "Hello
Character Geeks"
…etc
Real numbers with a decimal point are represented using this data type in R. It uses a
format for double-precision floating-point numbers to represent numerical values.
R
# A simple R program
x = 5.6
print(class(x))
print(typeof(x))
Output
[1] "numeric"
[1] "double"
R
# A simple R program
y = 5
print(class(y))
print(typeof(y))
Output
When R stores a number in a variable, it converts the number into a “double” value or
a decimal type with at least two decimal places.
This means that a value such as “5” here, is stored as 5.00 with a type of double and
a class of numeric. And also y is not an integer here can be confirmed with
the is.integer() function.
R
# A simple R program
y = 5
# is y an integer?
print(is.integer(y))
Output
[1] FALSE
You can create as well as convert a value into an integer type using
the as.integer() function.
You can also use the capital ‘L’ notation as a suffix to denote that a particular value is
of the integer R data type.
R
# A simple R program
x = as.integer(5)
print(class(x))
print(typeof(x))
y = 5L
print(class(y))
print(typeof(y))
Output
[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"
Boolean values, which have two possible values, are represented by this R data type:
FALSE or TRUE
R
# A simple R program
# Sample values
x = 4
y = 3
z = x > y
print(z)
print(class(z))
print(typeof(z))
Output
[1] TRUE
[1] "logical"
[1] "logical"
R
# A simple R program
x = 4 + 3i
print(class(x))
print(typeof(x))
Output
[1] "complex"
[1] "complex"
R
# A simple R program
char = "Geeksforgeeks"
print(class(char))
print(typeof(char))
Output
There are several tasks that can be done using R data types. Let’s understand each
task with its action and the syntax for doing the task along with an R code to illustrate
the task.
R
# Create a raw vector
print(x)
Output
[1] 01 02 03 04 05
Five elements make up this raw vector x, each of which represents a raw byte value.
Syntax
class(object)
Example
R
# A simple R program
# Logical
print(class(TRUE))
# Integer
# Numeric
print(class(10.5))
# Complex
print(class(1+2i))
# Character
print(class("12-04-2020"))
Output
[1] "logical"
[1] "integer"
[1] "numeric"
[1] "complex"
[1] "character"
Type verification
You can verify the data type of an object, if you doubt about it’s data type.
To do that, you need to use the prefix “is.” before the data type as a command.
Syntax:
is.data_type(object)
Example
R
# A simple R program
# Logical
print(is.logical(TRUE))
# Integer
print(is.integer(3L))
# Numeric
print(is.numeric(10.5))
# Complex
print(is.complex(1+2i))
# Character
print(is.character("12-04-2020"))
print(is.integer("a"))
Output
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] FALSE
Syntax
as.data_type(object)
Note: All the coercions are not possible and if attempted will be returning an “NA”
value.
Example
R
# A simple R program
# Logical
print(as.numeric(TRUE))
print(as.complex(3L))
# Numeric
print(as.logical(10.5))
# Complex
print(as.character(1+2i))
# Can't possible
print(as.numeric("12-04-2020"))
Output
[1] 1
[1] 3+0i
[1] TRUE
[1] "1+2i"
[1] NA
Warning message:
In print(as.numeric("12-04-2020")) : NAs introduced by coercion
Syntax: date()Parameters:
Does not accept any parameters
Example:
# R program to illustrate
# date function
date()
Output:
Sys.Date() Function
Sys.Date() function is used to return the system’s date.
Example:
# R program to illustrate
# Sys.Date function
Sys.Date()
Output:
[1] "2020-06-11"
Sys.time()
Sys.time() function is used to return the system’s date and time.
Syntax: Sys.time()Parameters:
Does not accept any parameters
Example:
# R program to illustrate
# Sys.time function
Sys.time()
Output:
Sys.timezone()
Sys.timezone() function is used to return the current time zone.
Syntax: Sys.timezone()Parameters:
Does not accept any parameters
Example:
# R program to illustrate
# Sys.timezone function
Sys.timezone()
Output:
[1] "Etc/UTC"
R’s base data structures are often organized by their dimensionality (1D, 2D, or nD)
and whether they’re homogeneous (all elements must be of the identical type) or
heterogeneous (the elements are often of various types). This gives rise to the six
data types which are most frequently utilized in data analysis.
Vectors
Lists
Dataframes
Matrices
Arrays
Factors
Tibbles
Vectors
A vector is an ordered collection of basic data types of a given length. The only key
thing here is all the elements of a vector must be of the identical data type e.g
homogeneous data structures. Vectors are one-dimensional data structures.
Example:
R
# R program to illustrate Vector# Vectors(ordered collection of same data type) X = c(1, 3, 5,
7, 8)
Output:
Lists
A list is a generic object consisting of an ordered collection of objects. Lists are
heterogeneous data structures. These are also one-dimensional data structures. A list
can be a list of vectors, list of matrices, a list of characters and a list of functions and
so on.
Example:
R
# R program to illustrate a List# The first attributes is a numeric vector# containing the
employee IDs which is # created using the 'c' command here empId = c(1, 2, 3, 4)
# The second attribute is the employee name # which is created using this line of code here#
which is the character vector empName = c("Debi", "Sandeep", "Subham", "Shiba")
# We can combine all these three different# data types into a list# containing the details of
employees# which can be done using a list command empList = list(empId, empName, numberOfEmp)
print(empList)
Output:
[[1]]
[1] 1 2 3 4
[[2]]
[1] "Debi" "Sandeep" "Subham" "Shiba"
[[3]]
[1] 4
Dataframes
Dataframes are generic data objects of R which are used to store the tabular data.
Dataframes are the foremost popular data objects in R programming because we are
comfortable in seeing the data within the tabular form. They are two-dimensional,
heterogeneous data structures. These are lists of vectors of equal lengths.
A data-frame must have column names and every row should have a unique
name.
Example:
R
# R program to illustrate dataframe# A vector which is a character vector Name = c("Amiya",
"Raj", "Asish")
# To create dataframe use data.frame command# and then pass each of the vectors # we have created
as arguments# to the function data.frame() df = data.frame(Name, Language, Age)
print(df)
Output:
Matrices
A matrix is a rectangular arrangement of numbers in rows and columns. In a matrix,
as we know rows are the ones that run horizontally and columns are the ones that run
vertically. Matrices are two-dimensional, homogeneous data structures.
Now, let’s see how to create a matrix in R. To create a matrix in R you need to use the
function called matrix. The arguments to this matrix() are the set of elements in the
vector. You have to pass how many numbers of rows and how many numbers of
columns you want to have in your matrix and this is the important point you have to
remember that by default, matrices are in column-wise order.
Example:
print(A)
Output:
Arrays
Arrays are the R data objects which store the data in more than two dimensions.
Arrays are n-dimensional data structures. For example, if we create an array of
dimensions (2, 3, 3) then it creates 3 rectangular matrices each with 2 rows and 3
columns. They are homogeneous data structures.
Now, let’s see how to create arrays in R. To create an array in R you need to use the
function called array(). The arguments to this array() are the set of elements in
vectors and you have to pass a vector containing the dimensions of the array.
Example:
Python3
# R program to illustrate an array A = array(
# Creating two rectangular matrices # each with two rows and two columns dim = c(2, 2, 2)
)
print(A)
Output:
, , 1
[,1] [,2]
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
Factors
Factors are the data objects which are used to categorize the data and store it as
levels. They are useful for storing categorical data. They can store both strings and
integers. They are useful to categorize unique values in columns like “TRUE” or
“FALSE”, or “MALE” or “FEMALE”, etc.. They are useful in data analysis for statistical
modeling.
Now, let’s see how to create factors in R. To create a factor in R you need to use the
function called factor(). The argument to this factor() is the vector.
Example:
R
# R program to illustrate factors# Creating factor using factor() fac = factor(c("Male",
"Female", "Male",
"Male", "Female", "Male", "Female"))
print(fac)
Output:
Tibbles
Tibbles are an enhanced version of data frames in R, part of the tidyverse. They offer
improved printing, stricter column types, consistent subsetting behavior, and allow
variables to be referred to as objects. Tibbles provide a modern, user-friendly
approach to tabular data in R.
Now, let’s see how we can create a tibble in R. To create tibbles in R we can use
the tibble function from the tibble package, which is part of the tidyverse.
Example:
# Create a tibble with three columns: name, age, and city my_data <- tibble(
name = c("Sandeep", "Amit", "Aman"),
age = c(25, 30, 35),
city = c("Pune", "Jaipur", "Delhi")
)
Output:
Control statements are expressions used to control the execution and flow of the
program based on the conditions provided in the statements. These structures are
used to make a decision after assessing the variable. In this article, we’ll discuss all
the control statements with the examples.
if condition
if-else condition
for loop
nested loops
while loop
return statement
next statement
if condition
This control structure checks the expression provided in parenthesis is true or not. If
true, the execution of the statements in braces {} continues.
Syntax:
Example:
x <- 100
Output:
if-else condition
It is similar to if condition but when the test expression in if condition fails, then
statements in else condition are executed.
Syntax:
if(expression){
statements
....
....
}
else{
statements
....
....
}
Example:
x <- 5
}else{
Output:
for loop
It is a type of loop or sequence of statements executed repeatedly until exit condition
is reached.
Syntax:
for(value in vector){
statements
....
....
}
Example:
x <- letters[4:10]
for(i in x){
print(i)
Output:
[1] "d"
[1] "e"
[1] "f"
[1] "g"
[1] "h"
[1] "i"
[1] "j"
Nested loops
Nested loops are similar to simple loops. Nested means loops inside loop. Moreover,
nested loops are used to manipulate the matrix.
Example:
# Defining matrix
m <- matrix(2:15, 2)
for (c in seq(ncol(m))) {
print(m[r, c])
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 3
[1] 5
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15
while loop
while loop is another kind of loop iterated until a condition is satisfied. The testing
expression is checked first before executing the body of loop.
Syntax:
while(expression){
statement
....
....
}
Example:
x = 1
# Print 1 to 5
print(x)
x = x + 1
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
Syntax:
repeat {
statements
....
....
if(expression) {
break
}
}
Example:
x = 1
# Print 1 to 5
repeat{
print(x)
x = x + 1
break
Output:
[1] 1
[1] 2
[1] 3
return statement
return statement is used to return the result of an executed function and returns
control to the calling function.
Syntax:
return(expression)
Example:
# Checks value is either positive, negative or zero
return("Positive")
return("Negative")
}else{
return("Zero")
func(1)
func(0)
func(-1)
Output:
[1] "Positive"
[1] "Zero"
[1] "Negative"
next statement
next statement is used to skip the current iteration without executing the further
statements and continues the next iteration cycle without terminating the loop.
Example:
# Defining vector
x <- 1:10
if(i%%2 != 0){
print(i)
Output:
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
Statistical measures like mean, median, and mode are essential for summarizing and
understanding the central tendency of a dataset. In R, these measures can be calculated
easily using built-in functions. This article will provide a comprehensive guide on how to
calculate mean, median, and mode in R Programming Language.
R
# R program to import data into R# Import the data using read.csv() myData =
read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
Output:
R
# R program to illustrate# Descriptive Analysis# Import the data using read.csv() myData =
read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
Output:
[1] 28.78889
Output:
[1] 26
mode = function(){
return(sort(-table(myData$Age))[1])
}
mode()
Output:
25: -25
R
# R program to illustrate# Descriptive Analysis# Import the library library(modeest)
Output:
[1] 25
R
x <- c(1, 2, NA , 4, 5, NA , 7, 8, NA , 9, 10)
mean(x, na.rm =
TRUE )
median(x, na.rm =
TRUE )
Output:
[1] 5.75
[1] 6
Adding up all the non-missing data is the first step in calculating the mean of x.
This function, median(x, na.rm = TRUE), finds the median of the non-missing values
in x. Any NA values in x are ensured to be omitted from the calculation by the na.rm =
TRUE option.
var() function in R Language computes the sample variance of a vector. It is the measure
of how much value is away from the mean value.
# variance of vector
x <- c(1, 2, 3, 4, 5, 6, 7)
var(x)
print(x)
Output:
4.667
Here in the above code, we took an example vector “x1” and calculated its variance.
sd() Function
sd() function is used to compute the standard deviation of given values in R. It is the
square root of its variance.
Syntax: sd(x)
x2 <- c(1, 2, 3, 4, 5, 6, 7)
sd(x2)
print(x2)
Output: 2.200
Here in the above code, we took an example vector “x2” and calculated its standard
deviation.
The range can be defined as the difference between the maximum and minimum
elements in the given data, the data can be a vector or a dataframe. So we can define the
range as the difference between maximum_value – minimum_value
Syntax:
max(vector)-min(vector)
If a vector contains NA values then we should use the na.rm function to exclude NA
values
Example:
R
# create vector
data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)
# display
print(data)
# find range
Output:
[1] 12 45 NA NA 67 23 45 78 NA 89
[1] 77
Syntax:
max(dataframe$column_name,na.rm=TRUE)-
min(dataframe$column_name,na.rm=TRUE)
where
Example:
R
# create dataframe
data = data.frame(column1=c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89),
column2=c(34, 41, NA, NA, 27, 23, 55, 78, NA, 73))
# display
print(data)
Output:
column1 column2
1 12 34
2 45 41
3 NA NA
4 NA NA
5 67 27
6 23 23
7 45 55
8 78 78
9 NA NA
10 89 73
[1] 55
max(dataframe,na.rm=TRUE)-min(dataframe,na.rm=TRUE)
Example:
R
# create dataframe
data = data.frame(column1=c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89),
column2=c(34, 41, NA, NA, 27, 23, 55, 78, NA, 73))
# display
print(data)
Output:
column1 column2
1 12 34
2 45 41
3 NA NA
4 NA NA
5 67 27
6 23 23
7 45 55
8 78 78
9 NA NA
10 89 73
[1] 77
Syntax:
range(vector/dataframe)
Example;
R
# create vector
data = c(12, 45, NA, NA, 67, 23, 45, 78, NA, 89)
# display
print(data)
print(range(data, na.rm=TRUE))
Output:
[1] 12 45 NA NA 67 23 45 78 NA 89
[1] 12 89
1. Bar Charts
Bar charts are one of the common visualization tool, used to symbolize and compare
express facts by way of showing square bars. A bar chart has X and Y Axis where the
X Axis represents the types and the Y axis represents the price. The top of the bar
represents the price for that class at the y-axis. Longer bars suggest better values.
There are various types of Bar charts like horizontal bar chart, Stacked bar chart,
Grouped bar chart and Diverging bar Chart.
Ranking: When we've got records with categories that need to be ranked with
highest to lowest.
2. Line Charts
Line chart or Line graph is used to symbolize facts through the years series. It
presentations records as a series of records points called as markers, connected with
the aid of line segments showing the between values over the years. This chart is
normally used to evaluate developments, view patterns or examine charge moves.
Line charts also are utilized in comparing trends among more than one facts
series.
3. Pie Charts
A pie chart is a round records visualization tool, this is divided into slices to
symbolize numerical percentage or percentages of an entire. Each slice in pie chart
corresponds to a category in the dataset and the perspective of the slice is
proportional to the share it represents. Pie charts are only valid with small variety of
categories. Simple Pie chart and Exploded Pie charts are distinctive varieties of Pie
charts.
Pie charts are used to show specific facts to expose the proportion of elements to
the whole. It is used to depict how exclusive classes make up a total pleasant.
Scatter charts are awesome for exploring dating between numerical variables and
in identifying traits, outliers and subgroup variations.
It is used while we've got to plot two sets of numerical statistics as one collection
of X and Y coordinates.
Scatter charts are satisfactory used for identifying outliers or unusual remark for
your facts.
5.Histogram
A histogram represents the distribution of numerical facts by using dividing it into
periods (packing containers) and displaying the frequency of records as bars. It is
commonly used to visualize the underlying distribution of a dataset and discover
styles inclusive of skewness, valuable tendency, and variability. Histograms are
treasured gear for exploring facts distributions, detecting outliers, and assessing
records great.
Identify Outliers: Box plots assist discover outliers and extreme values within
datasets, helping in information cleansing and anomaly detection.
Design principles are crucial in any context because they provide a foundational
framework for creating effective journeys and helping to make choices that enhance
the overall user experience and communication.
Effective data visualization relies on 12 key design principles that help convey
information accurately and efficiently. Here you will find
1. Clarity
The visualization should be clear and easily understood by the intended audience.
2. Simplicity
Understand what message or insight you want to communicate and design for that
purpose.
4. Consistency
5. Contextualization
Provide context for the data being presented.
6. Accuracy
7. Visuals Encoding
Choose appropriate visual encodings for the data types you are visualizing.
8. Intuitiveness
9. Interactivity
Although aesthetics are subjective, a visually appealing design can engage viewers
and increase their interest in the data.
12. Hierarchy
Work out hierarchy of information early on and always remind yourself of what the
purpose of representing the data is.
Ultimately, design principles play a pivotal role in streamlining the design process, a
facet of their significance that extends far beyond the realms of aesthetics. By
adhering to these principles, designers and creators can ensure that their work is not
only visually pleasing but also thoughtful, impactful, and harmonious for the end user.
Probability
Probability is the branch of mathematics that is concerned with the chances of
occurrence of events and possibilities. Further, it gives a method of measuring the
probability of uncertainty and predicting events in the future by using the available
information.
Probability is a measure of how likely an event is to occur. It ranges from 0 (impossible
event) to 1 (certain event). The probability of an event
Basic Concepts
Experiment: An action or process that leads to one or more outcomes. For example,
tossing a coin.
Sample Space (S): The set of all possible outcomes of an experiment. For a coin toss,
S={Heads, Tails}.
Event: A subset of the sample space. For instance, getting a head when tossing a
coin.
Table of Content
Probability
Properties of Probability
Conditional Probability
Bayes’ Theorem
Independence of Events
Probability Distributions
Applications of Probability
Axioms of Probability
There are three axioms that are the basis of probability and are as follows:
P(A)≥0
2. Normalization: The total chance of the whole possible outcomes of the sample space
S:P(S)=1
P(S)=1
3. Additivity: For any two mutually exclusive events we have A and B (i.e., events that
cannot occur simultaneously), the probability of their union is the sum of their
individual probabilities
Properties of Probability
Properties of Probability: Probability is a branch of mathematics that specifies how
likely an event can occur. The value of probability is between 0 and 1. Zero(0) indicates
Non-Negativity: The probability of any event is always non-negative. For any event
A,
P(A) ≥ 0.
P(S) = 1.
Additivity (Sum Rule): For any two mutually exclusive (disjoint) events A and B, the
probability of their union is the sum of their individual probabilities:
∪
P(A B) = P(A) + P(B)
∣
P(A B)= P(A∩B) / P(B) provided P(B)>0.
Multiplication Rule: For any two events A and B, the probability of both occurring
(i.e., the intersection) is:
∣
P(A∩B) = P(A B) ⋅ P(B)
Solution:
Solution:
Solution:
Example: We can notice that in all the above examples probability is always between 0 &
1.
Example: Probability of getting head and tail when a coin is tossed comes under mutual
exclusive events.
Solution:
Solution:
P(A)+P(A’)=1
Example: When a coin is tossed, the probability of getting ahead is 1/2, and the
complementary event for getting ahead is getting a tail so the Probability of getting a tail
is 1/2.
Solution:
8. If A and B are 2 events that are not mutually exclusive events then
P(AUB)=P(A)+P(B)-P(A∩B)
P(A∩B)=P(A)+P(B)-P(AUB)
Note: 2 events are said to be mutually not exclusive when they have at least one common
outcome.
Example: What is the probability of getting an even number or less than 4 when a die is
rolled?
Solution:
P(Number<4)=3/6=1/2
Solution:
P(A)=1/6
P(B)=1/6
P(C)=1/6
P(A U B U C)=(1/6)+(1/6)+(1/6)=3/6=1/2
Conditional Probability
Conditional probability quantifies the probability of an event A given that another event B
has occurred. It is defined as:
P(A)=Σi=0nP(A ∣Bi)P(Bi)
Bayes’ Theorem
Bayes’ Theorem provides a way to update the probability of a hypothesis H based on new
evidence E:
P(H ∣E)=P(E∣H)P(H)/P(E)
∣
where P(H) is the prior probability of the hypothesis, P(E H) is the likelihood of the
evidence given the hypothesis, and P(E) is the marginal likelihood of the evidence.
Independence of Events
Two events A and B are said to be independent if the occurrence of one does not affect
the probability of the occurrence of the other:
P(A∩B)=P(A)⋅P(B)
Discrete Random Variables: Be limited or countable in the sense that they take on one
of a finite or countably infinite number of values.
For a discrete random variable X with possible values x1,x2,… and corresponding
probabilities p1,p2,…E(X)=∑ixi⋅P(X=xi)
x1,x2,…
p1,p2,…
E(X)=∑ixi⋅P(X=xi)
f(x)
σX=Var(X)
Probability Distributions
This basically gives the manner in which probabilities are spread over the values of the
random variable involved. Some common distributions include:
Discrete Distributions
Binomial Distribution
Poisson Distribution
Continuous Distributions
Normal Distribution
Exponential Distribution
Example 2: A coin is tossed. Verify that the sum of probabilities of all outcomes is 1.
Example 4: The probability of rain tomorrow is 0.3. What is the probability it won’t
rain?
Example 5: In a standard deck, compare P(drawing a king) and P(drawing a face card).
Example 6: In a class, 60% of students play soccer, 30% play basketball, and 20%
play both. What percentage plays either soccer or basketball?
Example 7: A fair coin is tossed twice. What’s the probability of getting heads both
times?
Solution: P(H on first toss) = 1/2, P(H on second toss) = 1/2P(H and
H) = 1/2 × 1/2 = 1/4
Example 8: In a deck of 52 cards, what’s the probability of drawing a king, given that
it’s a face card?
Example 9: 30% of students are in Science. 80% of Science students and 60% of non-
Science students wear glasses. What percentage of all students wear glasses?
Example 10: 1% of people have a certain disease. The test for this disease is 95%
accurate (both for positive and negative results). If a person tests positive, what’s the
probability they have the disease?
f(x) = P (X = x)
f(x) = P (X ≤ x)
Step 2: Define random variable X as the event for which the probability has to be
found.
Step 3: Consider the possible values of x and find the probabilities for each value.
Step 4: Write all the values of x and their respective probabilities in tabular form to
get the discrete probability distribution.
Bernoulli Distribution
Binomial Distribution
Poisson Distribution
Geometric Distribution
Binomial Distribution
A discrete probability distribution that includes the number of trials n, probability of
success and probability of failure is called as Binomial distribution. The probability mass
function of the Binomial distribution is given by:
There are only two possible outcomes: success or failure, yes or no, true or false.
dbinom()
dbinom(k, n, p)
pbinom()
pbinom(k, n, p)
qbinom()
qbinom(P, n, p)
rbinom()
rbinom(n, N, p)
dbinom() Function
This function is used to find probability at a particular value for a data that follows
binomial distribution i.e. it finds:
P(X = k)
Syntax:
dbinom(k, n, p)
Example:
data.frame(x, probs)
Output :
The above piece of code first finds the probability at k=3, then it displays a data frame
containing the probability distribution for k from 0 to 10 which in this case is 0 to n.
pbinom() Function
The function pbinom() is used to find the cumulative probability of a data following
binomial distribution till a given value ie it finds
P(X <= k)
Syntax:
pbinom(k, n, p)
Example:
Output :
Syntax:
qbinom(P, n, p)
Example:
Output :
Syntax:
rbinom(n, N, p)
Example:
Output:
The Poisson distribution is a probability distribution that can be used to model events that
are rare and independent, and occur over a given time or space. Some properties and
applications of the Poisson distribution include:
R
dpois(2, 3)
dpois(6, 6)
Output:
[1] 0.2240418
[1] 0.1606231
R
ppois(2, 3)
ppois(6, 6)
Output:
[1] 0.4231901
[1] 0.6063028
R
rpois(2, 3)
rpois(6, 6)
Output:
[1] 2 3
[1] 6 7 6 10 9 4
R
y <- c(.01, .05, .1, .2)
qpois(y, 2)
qpois(y, 6)
Output:
[1] 0 0 0 1
[1] 1 2 3 4
In this article, we'll look into Real Life Applications of Continuous Probability Distribution.
Distributions that are continuous are commonly defined by probability density functions
(PDFs), which express the probability of the variable assuming a given value within its
range.
Key Points
Continuous Random Variable: In a continuous probability distribution, the random
variable can take on any value within a specified interval or range. This means that
the variable can theoretically assume an infinite number of values within that range.
For example:
e−xμ22σ2
Mean μ of a continuous random variable X is its average value, and it is given by: μ =
∫∞∞x⋅f(x)dx
4. Variance
σ2 = ∫∞∞ (x−μ)2⋅f(x)dx
Definition:
Properties:
1. Symmetry: The curve is symmetrical about the mean, meaning half the data points
lie to the left of the mean and half to the right.
2. Mean, Median, and Mode: In a normal distribution, the mean, median, and mode are
all equal.
3. Standard Deviation: The standard deviation determines the spread of the data. A
larger standard deviation indicates a wider spread, while a smaller standard deviation
indicates a narrower spread.
4. Empirical Rule: Approximately 68% of the data falls within one standard deviation of
the mean, 95% within two standard deviations, and 99.7% within three standard
deviations.
Applications:
The normal distribution is widely used in various fields due to its versatility and the
frequency with which it appears in natural phenomena. Here are some key applications:
Statistics:
Hypothesis testing
Confidence intervals
Regression analysis
Finance:
Risk assessment
Portfolio management
Quality control
Reliability analysis
Signal processing
Natural Sciences:
Physics
Chemistry
Biology
Social Sciences:
Psychology
Sociology
Economics
Central Limit Theorem: This theorem states that the distribution of sample means
approaches a normal distribution as the sample size increases, regardless of the
underlying population distribution. This makes the normal distribution a powerful tool
for statistical inference.
R provides several functions to work with the normal distribution. Here are the key ones:
1. dnorm()
Example:
Code snippet
# Probability density at x = 1 for a standard normal distribution
dnorm(1)
2. pnorm()
Example:
Code snippet
# Probability of a value less than 1.96 in a standard normal distribution
pnorm(1.96)
3. qnorm()
p : The probability.
Example:
Code snippet
# 95th percentile of a normal distribution with mean 100 and SD 15
qnorm(0.95, mean = 100, sd = 15)
4. rnorm()
Example:
Code snippet
# Generate 10 random numbers from a normal distribution with mean 50 and SD 10
rnorm(10, mean = 50, sd = 10)
Definition:
The exponential distribution is a continuous probability distribution that models the time
elapsed between events in a Poisson process. This means it's used to describe the time
between occurrences of events that happen at a constant average rate.
Properties:
2. Positive Support: The exponential distribution is defined only for positive values, as it
models time intervals.
4. Mean and Variance: The mean and variance of an exponential distribution with rate
parameter λ are:
Mean: 1/λ
Variance: 1/λ²
Applications:
1. Reliability Engineering:
2. Queueing Theory:
3. Telecommunications:
4. Physics:
5. Finance:
6. Biology:
In R:
By understanding the properties and applications of the exponential distribution, you can
effectively model and analyze a variety of real-world phenomena.
1. dexp(x, rate = 1)
Example:
Code snippet
# Probability density at x = 2 for a rate parameter of 0.5
dexp(2, rate = 0.5)
Example:
Code snippet
# Probability of a value less than 3 for a rate parameter of 1
pexp(3, rate = 1)
p is the probability.
Example:
Code snippet
# 90th percentile of an exponential distribution with rate parameter 2
qexp(0.9, rate = 2)
4. rexp(n, rate = 1)
Example:
Code snippet
# Generate 10 random numbers from an exponential distribution with rate parameter 1.5
rexp(10, rate = 1.5)
By understanding and utilizing these functions, you can effectively work with exponential
distributions in R for various statistical analyses and simulations.
Binomial Distribution:
Code snippet
# Generate 10 random numbers from a binomial distribution with 10 trials and probability of
success 0.5
rbinom(n = 10, size = 10, prob = 0.5)
Poisson Distribution:
Code snippet
# Generate 5 random numbers from a Poisson distribution with a rate parameter of 2
rpois(n = 5, lambda = 2)
Geometric Distribution:
Code snippet
# Generate 20 random numbers from a geometric distribution with probability of success 0.3
rgeom(n = 20, prob = 0.3)
Code snippet
# Generate 15 random numbers from a negative binomial distribution with 5 successes and
probability of success 0.2
rnbinom(n = 15, size = 5, prob = 0.2)
Hypergeometric Distribution:
Code snippet
# Generate 10 random numbers from a hypergeometric distribution with a population size of 20,
number of successes in the population of 10, and sample size of 5
rhyper(n = 10, m = 10, n = 10, k = 5)
Continuous Distributions:
Normal Distribution:
Code snippet
# Generate 20 random numbers from a normal distribution with mean 10 and standard deviation 2
rnorm(n = 20, mean = 10, sd = 2)
Uniform Distribution:
Code snippet
# Generate 15 random numbers from a uniform distribution between 0 and 1
runif(n = 15, min = 0, max = 1)
Exponential Distribution:
Code snippet
# Generate 10 random numbers from an exponential distribution with rate parameter 0.5
rexp(n = 10, rate = 0.5)
Gamma Distribution:
Code snippet
# Generate 8 random numbers from a gamma distribution with shape parameter 2 and rate parameter 1
rgamma(n = 8, shape = 2, rate = 1)
Beta Distribution:
Code snippet
# Generate 12 random numbers from a beta distribution with shape parameters 2 and 3
rbeta(n = 12, shape1 = 2, shape2 = 3)
Key Points:
The arguments n , mean , sd , lambda , prob , etc., specify the parameters of the
distribution.
You can adjust the parameters to generate random samples with different
characteristics.
By understanding these functions and their parameters, you can effectively generate
random samples from various probability distributions in R to conduct simulations,
statistical analysis, and data modeling.
Usage
Arguments
(Numeric Vector). A vector of numbers that can be inputted to estimate the
data
parameters of the distributional forms.
distr (String). The distribution to be fitted. Right now only norm or mnorm is supported
(List). Initialization parameters for each distribution. For mixtures, each named
init element in the list should be a vector with length equal to the number of
components
args_list (List). Named list of additional arguments passed onto fitdist and normalmixEM
... Other paremteres passed to fitdistrplus or normalmixEM
Details
The package fitdistrplus is used to estimate parameters of the normal distribution while
the package normalmixEM is used to estimate parameters of the mixture normal distribution.
So far we suggest only estimating two components for the mixture normal distribution.
For default options, we use mostly defaults from the packages themselves. The only
difference was the mixture normal distribution where the convergence parameters were
loosened and requiring more iterations to converge.
Value
A named list with all the parameter names and values
()
Method of moments:
Calculate sample moments (mean, variance, etc.) and equate them to theoretical
moments of the distribution.
Find the parameters that maximize the likelihood function of the data.
Bayesian estimation:
Combine prior information about the parameters with the data to obtain posterior
distribution.
Percentiles are measures of central tendency, which depict that out of the total data
about certain percent data lies below it. In R, we can use quantile() function to get the job
done.
R
x<-c(2,13,5,36,12,50)
res<-quantile(x,probs=0.5)
res
Output:
50%
R
x<-c(2,13,5,36,12,50)
res<-quantile(x,probs=c(0.5,0.75))
res
Output:
50% 75%
12.50 30.25
R
df<-data.frame(x=c(2,13,5,36,12,50),
y=c('a','b','c','c','c','b'))
res<-quantile(df$x,probs=c(0.35,0.7))
res
Output:
35% 70%
10.25 24.50
y=c('a','b','c','c','c','b'),
z=c(2.1,6,3.8,4.8,2.2,1.1))
sub_df<-df[,c('x','z')]
res
Output:
x z
12.5 3.0
R
library(dplyr)
df<-data.frame(x=c(2,13,5,36,12,50),
y=c('a','b','c','c','c','b'))
summarize(res=quantile(x,probs=0.5))
Output:
A tibble: 3 x 2
y res
<chr> <dbl>
a 2
b 31.5
c 12
R
df<-data.frame(x=c(2,13,5,36,12,50),
y=c('a','b','c','c','c','b'),
z=c(2.1,6,3.8,4.8,2.2,1.1))
n<-length(df$x)
ylab = "Value")
Output:
Probabilities using R
Sample Space (S): The set of all possible outcomes of a random experiment.
Calculating Probabilities in R
R offers various functions and packages for calculating Probability in R and performing
statistical analyses. Some commonly used functions include:
dbinom(): Computes the probability mass function (PMF) for the binomial distribution.
pnorm(): Calculates the cumulative distribution function (CDF) for the normal
distribution.
R
# Define the sample space sample_space <- c(1, 2, 3, 4, 5, 6)
# Define an event, for example, rolling an even number event <- c(2, 4, 6)
Output:
[1] 0.5
Probability Distributions in R
R provides extensive support for probability distributions, which are mathematical
functions that describe the likelihood of different outcomes in a random experiment.
Common probability distributions include:
Let’s visualize the normal distribution with a mean of 0 and standard deviation of 1.
R
library(ggplot2)
# Calculate the corresponding densities for normal distribution y <- dnorm(x, mean = 0, sd = 1)
Output:
Normal Distribution
R
# Simulating coin flips with a binomial distribution num_flips <- 1000
num_heads <- sum(rbinom(num_flips, size = 1, prob = 0.5))
probability_heads <- num_heads / num_flips
print(probability_heads)
Output:
[1] 0.494
Visualizing Probabilities in R
Visualization is essential for gaining insights from Probability in R and it offers numerous
packages such as ggplot2, lattice, and base graphics for creating visualizations. Common
plots include histograms, density plots, boxplots, and scatter plots, which help in
understanding the shape and characteristics of probability distributions.
R
# Visualizing the binomial distribution of coin flips flips <- rbinom(1000, size = 10, prob = 0.5)
hist(flips, breaks = seq(-0.5, 10.5, by = 1), col = "lightgreen",
main = "Binomial Distribution of Coin Flips", xlab = "Number of Heads",
ylab = "Frequency")
Output:
Basic Plots
1. Histogram:
Useful for understanding the shape, center, and spread of the data.
Code snippet
hist(x, breaks = "Sturges", main = "Histogram of X", xlab = "X")
2. Density Plot:
Code snippet
plot(density(x), main = "Density Plot of X", xlab = "X")
Advanced Plots
1. QQ-Plot (Quantile-Quantile Plot):
Used to assess whether the data follows a specific distribution (e.g., normal,
exponential).
Code snippet
qqnorm(x)
qqline(x)
2. Box Plot:
Shows the cumulative probability of a random variable being less than or equal to a
certain value.
Code snippet
plot(ecdf(x), main = "ECDF of X", xlab = "X", ylab = "Cumulative Probability")
Code snippet
library(ggplot2)
# QQ-Plot
ggplot(data.frame(x), aes(sample = x)) +
stat_qq() +
stat_qq_line() +
labs(title = "QQ-Plot of X")
# Box Plot
ggplot(data.frame(x), aes(y = x)) +
geom_boxplot() +
labs(title = "Boxplot of X", y = "X")
Key Considerations:
Data Cleaning and Preparation: Ensure data is clean and free from outliers or
missing values.
Choice of Plot: Select the appropriate plot type based on the data type (continuous
or discrete) and the desired insights.
Customization: Use R's plotting functions and ggplot2 to customize the appearance
of plots (colors, labels, themes).
What is sampling?
A sample is a subset of individuals from a larger population. Sampling means selecting
the group that you will actually collect data from in your research. For example, if you are
researching the opinions of students in your university, you could survey a sample of 100
students.
In statistics, sampling allows you to test a hypothesis about the characteristics of a
population.
Sampling Distribution
Sampling distribution is essential in various aspects of real life. Sampling distributions are
important for inferential statistics. A sampling distribution represents the distribution of a
statistic, like the mean or standard deviation, which is calculated from multiple samples of
a population. It shows how these statistics vary across different samples drawn from the
same population.
In this article, we will discuss the Sampling Distribution in detail and its types along with
examples and go through some practice questions too.
Confidence Interval: A range of values calculated from sample data that is likely to
contain the population parameter with a certain level of confidence.
1. Number Observed in a Population: The symbol for this variable is "N." It is the
measure of observed activity in a given group of data.
3. Method of Choosing Sample: How you chose the samples can account for variability
in some cases.
Z = (x - μ)/(σ/√n)
where,
z is z-score
x is Value being Standardized (either an individual data point or the sample mean)
μ is Population Mean
n is Sample Size
This formula quantifies how many standard deviations a data point (or sample mean) is
away from the population mean. Positive z-scores indicate values above the mean, while
negative z-scores indicate values below the mean. Follows the normal distribution with
mean 0 and variance unity, that is, the variate Z follows standard normal distribution.
According to the central limit theorem, the sampling distribution of the sample means
tends to normal distribution as sample size tends to large (n > 30).
Syntax: sd(data)/sqrt(length((data)))
print(sd(a)/sqrt(length((a))))
Output:
[1] 26.20274
Confidence Intervals
Confidence Intervals
1. Definition and Purpose of Confidence Intervals
A confidence interval (CI) is a range of values derived from sample data that estimates
an unknown population parameter with a given level of confidence.
Common Confidence Levels: Common confidence levels include 90%, 95%, and
99%. For example, a 95% CI means that if we were to take multiple samples and
calculate the CI for each, about 95% of those intervals would contain the true
population parameter.
Interpretation: A 95% confidence interval of (a, b) for a population mean implies that we
are 95% confident the true mean lies between a and b. Note that this does not mean
there's a 95% probability that the true mean is within the interval—it means that if
repeated samples were taken, 95% of those intervals would capture the true mean.
For calculating confidence intervals for the population mean, we use the t-distribution
when the sample size is small (n < 30) or when the population standard deviation is
unknown.
Formula:
\[
\text{CI} = \bar{x} \pm t_{\alpha/2, df} \times \frac{s}{\sqrt{n}}
\]
Where:
\(t_{\alpha/2, df}\): t-value for the desired confidence level and degrees of freedom
(df = n - 1)
Example Calculation
Suppose we have a sample of 20 observations with a sample mean (\(\bar{x}\)) of 50
and a standard deviation (s) of 10. For a 95% CI:
qt() : Gets the critical t-value based on confidence level and degrees of freedom.
Manual Calculation: You can also calculate CIs manually in R using the formula.
Example Calculation
Suppose 60 out of 100 people in a survey favor a policy. Here, \(\hat{p} = 0.60\) and \(n =
100\). For a 95% CI:
1. Bell Curve with Shaded Confidence Region: A normal distribution curve showing the
confidence interval range.
The blue shaded area represents the 95% confidence interval, meaning there's a
95% probability that the true mean falls within this range.
The red dashed lines mark the lower and upper bounds of this confidence interval.
The blue dashed line at the center represents the sample mean.
Would you like additional diagrams, such as for confidence intervals based on repeated
sampling?
Hypothesis Testing
The results of the analysis are used to decide whether the claim is true or not.
By employing hypothesis testing in data analytics and other fields, practitioners can
rigorously evaluate their assumptions and derive meaningful insights from their
analyses.
Defining Hypotheses
Null hypothesis (H0): In statistics, the null hypothesis is a general statement or
default position that there is no relationship between two measured cases or no
relationship among groups. In other words, it is a basic assumption or made based on
the problem knowledge.Example: A company’s mean production is 50 units/per da
H: μ = 50.
One-Tailed Test
There are two types of one-tailed test:
Left-Tailed (Left-Sided) Test: The alternative hypothesis asserts that the true
parameter value is less than the null hypothesis. Example: H:μ≥50 and H: μ<50
Right-Tailed (Right-Sided) Test: The alternative hypothesis asserts that the true
parameter value is greater than the null hypothesis. Example: H:μ≤50and H:μ>50
Two-Tailed Test
A two-tailed test considers both directions, greater than and less than a specified
value.We use a two-tailed test when there is no specific directional expectation, and want
to detect any significant difference.
Example: H0: μ=μ= 50 and H1: μ≠50μ=50
To delve deeper into differences into both types of test: Refer to link
Type I error: When we reject the null hypothesis, although that hypothesis was true.
Type I error is denoted by alpha(α).
Type II errors: When we accept the null hypothesis, but it is false. Type II errors are
denoted by beta(β).
. Chi-Square Test
Chi-Square Test for Independence categorical Data (Non-normally distributed) using:
χ2=∑(Oij–Eij)2Eijχ2=∑Eij(Oij–Eij)2
where,
ij
Row total×Column totalTotal observations
T-Statistics
T test is used when n<30,
t-statistic calculation is given by:
t=xˉ−μs/nt=s/nxˉ−μ
t = t-score,
x̄ = sample mean
μ = population mean,
n = sample size
Output:
chi- square statistics here is 1.33 which shows the discrepancy between the observed
frequencies and the expected frequencies under the null hypothesis. The value is small
here so it means there is not much difference.
Degrees of Freedom: This shows the number of independent pieces available for
estimation. The formula for calculating this is = number of categories -1. Here, 6
categories are present therefore, df will be 5 which is enough to make a decision.
P-value: A high p-value suggests that the observed frequencies are consistent with
the expected frequencies, and we fail to reject the null hypothesis.
We can also plot the graph to see the difference between the values. To plot graph we
need to load "dplyr" package in R programming language.
R
#install packages install.packages("dplyr")
# Calculate deviations between observed and expected frequencies data_plot <- data_plot %>%
mutate(deviation = Observed - Expected)
Output:
R
# load dataset data<- read.csv('path\to\your\file.csv')
Output:
data: cont_table
X-squared = 60493, df = 5, p-value < 2.2e-16
We can also plot these values with the help of ggplot2 library in R
R
# Extract observed and expected frequencies from the contingency table observed <-
as.vector(cont_table)
expected <- chi_sq_result$expected
Output:
As we saw our chi square test shows high discrepancy and our graph shows wide
difference between the expected and observed frequencies too.
# Define age groups and smartphone brands age_groups <- c("Teenager", "Adult", "Senior")
smartphone_brands <- c("Samsung", "Apple", "Xiaomi", "Huawei", "Google")
# Generate a fictional dataset n <- 1000 # Number of observations age_sample <- sample(age_groups,
n, replace = TRUE )
smartphone_sample <- sample(smartphone_brands, n, replace =
TRUE )
Output:
Phi-Coefficient : NA
Contingency Coeff.: 0.104
Cramer's V : 0.074
Output:
data: cont_table_2x2
X-squared = 6.9689, df = 1, p-value = 0.008294
alternative hypothesis: two.sided
95 percent confidence interval:
0.06596955 0.43403045
sample estimates:
prop 1 prop 2
0.75 0.50
We created a 2x2 contingency table where the rows represent treatment outcomes
(success or failure) and the columns represent the two groups (new drug treatment vs.
standard drug treatment).
We used the prop.test() function to perform a Chi-Square test for proportions on this 2x2
table.
Chi-Square Test in R
We will take the survey data in the MASS library which represents the data from a survey
conducted on students.
R
# load the MASS package
library(MASS)
print(str(survey))
Output:
R
# Create a data frame from the main data set.
stu_data = data.frame(survey$Smoke,survey$Exer)
stu_data = table(survey$Smoke,survey$Exer)
print(stu_data)
Output:
And finally we apply the chisq.test() function to the contingency table stu_data.
R
# applying chisq.test() function
print(chisq.test(stu_data))
Output:
data: stu_data
X-squared = 5.4885, df = 6, p-value = 0.4828
As the p-value 0.4828 is greater than the .05, we conclude that the smoking habit is
independent of the exercise level of the student and hence there is a weak or no
correlation between the two variables. The complete R code is given below.
So, in summary, it can be said that it is very easy to perform a Chi-square test using R.
One can perform this task using chisq.test() function in R.
print(str(survey))
print(stu_data)
print(chi_result)
Output:
Chi-Square Test in R
In this code we use the MASS library to conduct a Chi-Square Test on the ‘survey’ dataset,
focusing on the relationship between smoking habits and exercise levels.
y=β0+β1x+ϵ
Where:
yy
x=0x = 0
xx
yy
xx
Intercept β0\beta_0β0: Represents the predicted value of y when x=0 (often not
meaningful unless x=0 is within the data range).
yy
x=0x = 0
x=0x = 0
y = ax + b
Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Find the coefficients from the model created and create the mathematical
equation using these
Get a summary of the relationship model to know the average error in prediction.
Also called residuals.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
lm() Function
This function creates the relationship model between the predictor and the response
variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
-38.4551 0.6746
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(summary(relation))
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Explore our latest online courses and learn new skills at your own pace. Enroll and
become a certified expert to boost your career.
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
newdata is the vector containing the new value for predictor variable.
1
76.22869
Histograms in R language
A histogram contains a rectangular area to display the statistical information which is
proportional to the frequency of a variable and its width in successive numerical
intervals. A graphical representation that manages a group of data points into different
specified ranges. It has a special feature that shows no gaps between the bars and is
similar to a vertical bar graph.
R – Histograms
We can create histograms in R Programming Language using the hist() function.
Syntax: hist(v, main, xlab, xlim, ylim, breaks, col, border)
R
# Create data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32,
14, 19, 27, 39)
Output:
1. We can use the xlim and ylim parameters in X-axis and Y-axis.
Example
R
# Create data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32, 14, 19, 27, 39)
Output:
Output:
Example
R
# Creating data for the graph. v <- c(19, 23, 11, 5, 16, 21, 32, 14,
19, 27, 39, 120, 40, 70, 90)
Output:
R uses the barplot() function to create bar charts. Here, both vertical and Horizontal bars
can be drawn.
Syntax:
Parameters:
H: This parameter is a vector or matrix containing numeric values which are used in
bar chart.
names.arg: This parameter is a vector of names appearing under each bar in bar
chart.
col: This parameter is used to give colors to the bars in the graph.
R
# Create the data for the chart
Output:
1. Take all parameters which are required to make a simple bar chart.
barplot(A, horiz=TRUE )
R
# Create the data for the chart
Output:
2. X-axis and Y-axis can be labeled in bar chart. To add the label in bar chart.
barplot( A, col=color_name)
Implementations
R
# Create the data for the chart
Output:
R – GeeksforGeeks-Article chart
text(
Output:
cex.main , cex.lab , and cex.axis : These arguments control the font size of the chart
title, x-axis label, and y-axis label, respectively. They are set to 1.5, 1.2, and 1.1 to
increase the font size for better readability.
text() : We use the text() function to add data labels on top of each bar.
The x argument specifies the x-coordinates of the labels (same as the barplot() x-
coordinates), the y argument adds a value of 1 to the corresponding bar heights ( A +
1. Take a vector value and make it matrix M which to be grouped or stacked. Making of
matrix can be done by.
Output:
R – Total Revenue
Boxplots in R Language
A box graph is a chart that is used to display information in the form of distribution by
drawing boxplots for each of them. This distribution of data is based on five sets
(minimum, first quartile, median, third quartile, and maximum).
varwidth: This parameter is a logical value. Set as true to draw width of the box
proportionate to the sample size.
names: This parameter are the group labels that will be showed under each boxplot.
Creating a Dataset
We use the data set “mtcars”.
R
input <- mtcars[, c('mpg', 'cyl')]
print(head(input))
R
data(mtcars)
ylab = "Displacement")
UCI Machine Learning Repository: Known for datasets like the Iris, Wine, and Heart
Disease datasets, which are commonly used in academic research and training.
Government Websites: Datasets from sources like data.gov (U.S.), data.gov.uk (UK),
and World Bank provide reliable and extensive datasets on topics like public health,
economics, and social statistics.
Choosing Criteria:
Domain Relevance: Pick datasets related to your field of interest (e.g., finance,
healthcare, social sciences).
Size and Complexity: Start with smaller datasets and move to larger, more complex
datasets as you advance.
Shape and Structure: Use functions like head() , tail() , describe() (in Python) to
understand dataset dimensions, column names, and data types.
Identify Missing Data: Determine which columns have missing values and how
many. Missing data can lead to biased results if not addressed.
Strategies:
Imputation: Replace missing values with the mean, median, or mode (for
numeric variables), or with the most common category (for categorical
variables).
Outliers can skew analysis. Identify them through methods like IQR (interquartile
range) or z-score analysis.
Handling Strategies:
Capping: Set a reasonable threshold and cap values that exceed it.
4. Data Transformation:
Ensure that all variables are on a comparable scale if they’ll be used together.
Standardization (z-scores) or normalization (scaling between 0 and 1) are
common transformations.
B. Descriptive Statistics
Descriptive statistics help summarize data, providing a clear snapshot of key features.
1. Measures of Central Tendency: Mean, median, and mode provide insights into the
central values in the data.
3. Visualizing Data:
Box Plots: Useful for spotting outliers and comparing distributions across
categories.
C. Inferential Statistics
Once you have a solid understanding of the data, inferential statistics can help you draw
conclusions and make predictions.
T-tests: Compare means across two groups (e.g., male vs. female test scores).
Logistic Regression: Used when the outcome variable is binary (e.g., pass/fail).
Effect Size: Consider the strength of relationships and effect sizes, not just
statistical significance.
Practical Significance: Ask whether the results are meaningful and actionable in
real-world terms.
Use graphs, tables, and summaries to make complex results accessible. Bar
charts, line graphs, and heatmaps are popular choices depending on the data.
In reports, summarize findings succinctly, highlight key statistics, and discuss any
limitations (like sample size or biases).
# Load dataset
data <- read.csv("your_dataset.csv")
# Data Transformation
data$scaled_column <- scale(data$column) # Standardization
# Descriptive Statistics
mean(data$column)
sd(data$column)
# Visualizations
hist(data$column, main = "Histogram of Column", xlab = "Column Valu
es")
plot(data$x, data$y, main = "Scatter Plot", xlab = "X", ylab = "Y")
# Inferential Statistics
# T-test
t.test(data$group1, data$group2)
# Linear Regression
model <- lm(y ~ x, data = data)
summary(model) # Model summary
# Prediction
predict(model, newdata = data.frame(x = c(5, 10)))
Using this framework, you can approach any dataset with confidence, knowing each step
builds toward a comprehensive understanding and actionable insights. Let me know if
you would like visual examples to further illustrate any of these steps!