Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 3 - Point Estimate, Confidence Interval and Hypotesis
Tests - 16.01.2014
Introduction
Let’s start having empirical data (one variable of length N)
extracted from external file, suppose to consider it to be the
population. We define a sample of size n.
Suppose we do not have information on population (or, better, we
want to check if and how the sample can represent the
population)
We, in other words, want to make infererence using the
information contained in the sample, in order to obtain an
estimation for the population.
That sample is one of several samples we can randomly draw from
the population (the sample space).
What are the instruments to obtain infos about the population?
(1) Sample mean (point estimation) (2) Confidence interval (3)
Hypotesis tests
Sample space
In probability theory, the sample space of an experiment or random
trial is the set of all possible outcomes or results of that
experiment.
It is common to refer to a sample space by the labels S, Ω, or
U.
For example, for tossing two coins, the corresponding sample space
would be {(head,head), (head,tail), (tail,head), (tail,tail)}, so that
the dimension is 4. dim(Ω) = 4. It means that we can obtain 4
different samples with corresponding 4 different sample
means.
In pratice, we face up with only one sample took at random from
the sample space.
Point estimate
Point estimate permit us to summarize the information contained
in the population (dimension N), throughout only 1 value
constructed using n vales.
The most
Pn used, unbiased point estimator is the sample mean.
xi
X̂ n = 1=1
n
Other point estimators are: (1) Sample Median (2) Sample Mode
(3) Geometric mean.
pQ n
2
1 Pn
Geometric Mean = Mg =
i=1 xi = exp[ n
1=1 lnxi ]
An example of what is not an estimator is when you use the
sample mean after subsetting the sample truncating it on a certain
value.
P.S. A Naif definition of estimator: when the estimator is
computed using all the n informations in the sample.
Efficient estimators
The BLUE (Best Linear Unbiased Estimator) is defined as
follow:
1. is a linear function of all the sample values
2. is unbiased (E (X̂n ) = θ)
3. has the smallest sample variance among all unbiased
estimators.
The sample mean is BLUE for the parameter µ
Some estimators are biased but consistent: An estimator is
consistent when become unbiased for n −→ ∞
Point estimators - cases
◮
◮
◮
◮
Normal samples: X̂n is the BLUE estimator for µ parameter
(mean)
Bernoulli samples f (x) = ρx (1 − ρ)1−x : X̂n is a unbiased
estimator for ρ parameter (frequency)
e −k k x
): X̂n is a unbiased estimator
x!
for k parameter (which represent both mean and variance of
the distribution)
Poisson samples f (x) =
1
:is a unbiased
X̂n
estimator for λ parameter (density at value 0)
Exponential samples f (x) = λe −λy )
Confidence interval theory
With point estimators we make use of only one value to infer
about population.
With confidence interval we define a minimum and a maximum
value in which the population parameter we expect to lie.
Formally, we need to calculate:
σ
µ1 = X̂n − z ∗ √
n
σ
µ2 = X̂n + z ∗ √
n
and we end up with interval µ̂ = {µ1 ; µ2 }
Here: X̂n is the sample mean; z is the upper (or lower) critical
value of the theoretical distribution. σ is the standard deviation of
the theoretical distribution. n the sample size.
(See the graph)
Confidence interval theory - Gaussian
We will make some assumptions for what we might find in an
experiment and find the resulting confidence interval using a
normal distribution.
Let assume that the sample mean is 5, the standard deviation in
population is known and it is equal to 2, and the sample size is
n = 20. In the example below we will use a 95 per cent confidence
level and wish to find the confidence interval.
N.B. Here, since the confidence interval is 95, the z (the critical
value) to consider is the one corresponding with CDF (i.e. dnorm)
= 0.975.
We also can speak of α = 0.05, or 1 − α = 0.95, or
1 − α/2 = 0.975
Confidence interval theory - T-student
We use T − student distribution when n is small and sd is
unknown in population.
We need to use a sample variance
q
estimation: σ̂ =
P
(xi −X̂n )2
n−1
The t-student distribution is more spread out.
In simple words, since we do not know the population sd, we need
for more large intervals (caution - approach).
The only difference with normal distribution, is that we use the
command associated with the t-distribution rather than the normal
distribution. Here we repeat the procedures above, but we will
assume that we are working with a sample standard deviation
rather than an exact standard deviation.
N.B. The T distribution is characterize by its degree of freedom. In
this test the degree aere equal to n − 1, because we use 1
estimation (1 constraint)
Confidence interval theory - comparison of two means
In some case we can have an experiment called (for example)
case-control.
Let’s imagine to have the population splitted in 2: one is the
treated group, the second is the non treated group.
Suppose to extract two samples from them with aim to test if the
two samples comes from a population with the same mean
parameter (is the treatment effective?)
The output of this test will be a confidence interval represting the
difference between the two means.
N.B. Here, the degree of freedom of the t-distribution are equal to
min(n1 , n2 ) − 1
Formulas
◮
Gaussian confidence interval:
µ̂ = {µ1 , µ2 } = X̂n ± z ∗
◮
√σ
n
T - student confidence interval:
µ̂ = {µ1 , µ2 } = X̂n ± tn−1 ∗
◮
√σ̂
n
T-student confidence interval for two sample difference:
µ̂diff = {µdiff 1 , µdiff2 } = (X̂1 − X̂2 ) ± tn−1 ∗ sd;
where sd = sd1 ∗
◮
sd1
n1
+ sd2 ∗
sd2
n2
Gussian confidence interval for proportion (bernoulli
distribution):
ρ̂ = {ρ1 , ρ2 } = fˆ1 ± z ∗ sd;
q
where sd = ρ(1−ρ)
n2
Hypotesis testing
Researchers retain or reject hypothesis based on measurements of
observed samples.
The decision is often based on a statistical mechanism called
hypothesis testing.
A type I error is the mishap of falsely rejecting a null hypothesis
when the null hypothesis is true.
The probability of committing a type I error is called the
significance level of the hypothesis testing, and is denoted by the
Greek letter α (the same used in the confidence intervals).
We demonstrate the procedure of hypothesis testing in R first with
the intuitive critical value approach.
Then we discuss the popular p − value (and very quick) approach
as alternative.
Hypotesis testing - lower tail
The null hypothesis of the lower tail test of the population mean
can be expressed as follows:
µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true
population mean µ.
Let us define the test statistic z in terms of the sample mean, the
sample size and the population standard deviation σ:
z=
X̂n −µ
√0
σ/ n
Then the null hypothesis of the lower tail test is to be rejected if
z ≤ zα , where zα is the 100(α) percentile of the standard normal
distribution.
Hypotesis testing - upper tail
The null hypothesis of the upper tail test of the population mean
can be expressed as follows:
µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true
population mean µ.
Let us define the test statistic z in terms of the sample mean, the
sample size and the population standard deviation σ:
z=
X̂n −µ
√0
σ/ n
Then the null hypothesis of the upper tail test is to be rejected if
z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the
standard normal distribution.
Hypotesis testing - two tailed
The null hypothesis of the two-tailed test of the population mean
can be expressed as follows:
µ = µ0 ; where µ0 is a hypothesized value of the true population
mean µ. Let us define the test statistic z in terms of the sample
mean, the sample size and the population standard deviation
σ:
z=
X̂n −µ
√0
σ/ n
Then the null hypothesis of the two-tailed test is to be rejected if
z ≤ zα/2 or z ≥ z1−α/2 , where zα/2 is the 100(α/2) percentile of
the standard normal distribution.
Hypotesis testing - lower tail with Unknown variance
The null hypothesis of the lower tail test of the population mean
can be expressed as follows:
µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true
population mean µ.
Let us define the test statistic t in terms of the sample mean, the
sample size and the sample standard deviation σ̂:
t=
X̂n −µ
√0
σ̂/ n
Then the null hypothesis of the lower tail test is to be rejected if
t ≤ tα , where tα is the 100(α) percentile of the Student t
distribution with n − 1 degrees of freedom.
Hypotesis testing - upper tail with Unknown variance
The null hypothesis of the upper tail test of the population mean
can be expressed as follows:
µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true
population mean µ.
Let us define the test statistic t in terms of the sample mean, the
sample size and the sample standard deviation σ̂:
t=
X̂n −µ
√0
σ̂/ n
Then the null hypothesis of the upper tail test is to be rejected if
t ≥ t1−α , where t1−α is the 100(1 − α) percentile of the Student
t distribution with n1 degrees of freedom.
Hypotesis testing - two tailed with Unknown variance
The null hypothesis of the two-tailed test of the population mean
can be expressed as follows:
µ = µ0 ; where µ0 is a hypothesized value of the true population
mean µ. Let us define the test statistic t in terms of the sample
mean, the sample size and the sample standard deviation σ̂:
t=
X̂n −µ
√0
σ̂/ n
Then the null hypothesis of the two-tailed test is to be rejected if
t ≤ tα/2 or t ≥ t1−α/2 , where tα/2 is the 100(α/2) percentile of
the Student t distribution with n − 1 degrees of freedom.
Lower Tail Test of Population Proportion
The null hypothesis of the lower tail test about population
proportion can be expressed as follows:
ρ ≥ ρ0 ; where ρ0 is a hypothesized lower bound of the true
population proportion ρ.
Let us define the test statistic z in terms of the sample proportion
and the sample size:
z=
q ρ̂−ρ0
ρ0 (1−ρ0 )
n
Then the null hypothesis of the lower tail test is to be rejected if
z ≤ zα , where zα is the 100(α) percentile of the standard normal
distribution.
Upper Tail Test of Population Proportion
The null hypothesis of the upper tail test about population
proportion can be expressed as follows:
ρ ≤ ρ0 ; where ρ0 is a hypothesized lower bound of the true
population proportion ρ.
Let us define the test statistic z in terms of the sample proportion
and the sample size:
z=
q ρ̂−ρ0
ρ0 (1−ρ0 )
n
Then the null hypothesis of the lower tail test is to be rejected if
z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the standard
normal distribution.
Two Tailed Test of Population Proportion
The null hypothesis of the upper tail test about population
proportion can be expressed as follows:
ρ = ρ0 ; where ρ0 is a hypothesized true population
proportion.
Let us define the test statistic z in terms of the sample proportion
and the sample size:
z=
q ρ̂−ρ0
ρ0 (1−ρ0 )
n
Then the null hypothesis of the lower tail test is to be rejected if
z ≤ zα/2 or z ≥ z1−α/2
Sample size definition
The quality of a sample survey can be improved (worsened) by
increasing (decreasing) the sample size.
The formula below provide the sample size needed under the
requirement of population proportion interval estimate at (1 − α)
confidence level, margin of error E and planned parameter
estimation.
Here, z1−α/2 is the 100(1 − α/2) percentile of the standard normal
distribution.
2
∗σ 2
z1−α/2
◮
For mean: n =
◮
For proportion: n =
E2
2
z1−α/2
ρ∗(1−ρ)
E2
Sample size definition - Exercises
◮
Mean: Assume the population standard deviation σ of the
student height in survey is 9.48. Find the sample size needed
to achieve a 1.2 centimeters margin of error at 95 per cent
confidence level.
Since there are two tails of the normal distribution, the 95 per
cent confidence level would imply the 97.5th percentile of the
normal distribution at the upper tail. Therefore, z1−α/2 is
given by qnorm(.975).
◮
Population: Using a 50 per cent planned proportion estimate,
find the sample size needed to achieve 5 per cent margin of
error for the female student survey at 95 per cent confidence
level.
Since there are two tails of the normal distribution, the 95 per
cent confidence level would imply the 97.5th percentile of the
normal distribution at the upper tail. Therefore, z1−α/2 is
given by qnorm(.975).
Homeworks
1: Confidence interval for the proportion. Suppose we have a
sample of size n = 25 of births. 15 of that are female. Define the
interval (at 99 per cent) for the proportion of female in the
population. HINT: Apply with the proper functions in R, the
formula in slide 11.
2: Hypotesis test to compare two proportions. Suppose we have
two schools. Sampling from the first, n = 20 and the Hispanics
students are 8. Sampling from the second, n = 18 and Hispanics
students are 4. Can we state (at 95 per cent) the frequency of
Hispanics are the same in the two schools? N.B.: the test here is
two tailed.
The hypotesis test here is:
q
−ρ̂2
z = ρ̂1sd
; where sd = ρ(1 − ρ)[ n11 +
ρ=
(ρ1 ∗n1 +ρ2 +n2 )
n1 +n2
1
n2 ];
Charts - 1
Figure: Representation of the critical point for the upper tail hypotesis
test
Charts - 2
Figure: Representation of the critical point for the lower tail hypotesis
test
Charts - 3
Figure: Representation of the critical point for the two-tailed hypotesis
test
Charts - 4
Figure: Type I and Type II errors in hypotesis testing
Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 5 - Introduction to Bootstrap and Introduction to
Markov Chains - 23.01.2014
Introduction
Let’s assume, for a moment, the CLT:
If random samples of n observations y1 , y2 , ..., yn are drawn from a
population of meran µ and sd σ, for n sufficienly large, the
sampling distribution of the sample mean can be approximated by
a normal density with mean µ and variance σ
ց
◮
Averages taken from any distribution will have a normal
distribution
◮
The standard deviation decreases as the number of
observation increases
But .. nobody tells us exactly how big the sample has to be.
Why Bootstrap?
Sometimes we can not make use of the CLT, because:
1. Nobody tells us exactly how big the sample has to be
2. The sample can be really small.
So that, we are not encouraged to hypotize any distribution
assumption. We just have the data and let the raw data
speak.
The bootstrap method attempts to determine the probability
distribution from the data itself, without recourse to CLT.
N.B. The Bootstrap method is not a way of reducing the error! it
only tries to estimate it.
Basic Idea of Bootstrap
Using the original sample as the population, and draw N samples
from the original sample (which are the bootstrap samples).
Defining the estimator using the bootstrap samples.
Figure: Real World versus Bootstrap World
Structure of Bootstrap
1. Originally, from a list of data (the sample), one compute a
statistic (an estimation)
2. Create an artifical list of data (a new sample), by randomly
drawing elements from the list
3. Compute a new statistic (estimation), from the new sample
4. Repeat, let’s say, 1000 times the point 2) and 3) and look to
the distribution of these 1000 statistics.
Sample mean
Suppose we extracted a sample x = (x1 , x2 , ..., xn ) from the
population X . Let’s say the sample size is small: n = 10.
We can compute the sample mean x̂ using the values of the
sample x. But, since n is small, the CLT does not hold, so that we
can say anything about the sample mean distribution.
APPROACH: We extract M samples (or sub-samples) of dimension
n from the sample x (with replacement).
We can define the bootstrap sample means: xbi ∀i = 1..., M. This
become the new sample with dimension M
Bootstrap sample mean:
P
Mb(X ) = M
i xbi /M
Bootstrap sample variance:
P
2
Vb(X ) = M
i (xbi − Mb(X )) /M − 1
Bootstrap Confidence interval with variance
estimation
Let’s take a random sample of size 25 from a normal distribution
with mean 10 and standard deviation 3.
We can consider the sampling distribution of the sample mean.
From that, we estimate the intervals.
The bootstrap estimates standard error by resampling the data in
our original sample. Instead of repeatedly drawing samples of size
100 from the population, we will repeatedly draw new samples of
size 100 from our original sample, resampling with
replacement.
We can estimate the standard error of the sample mean using the
standard deviation of the bootstrapped sample means.
Confidence interval with quantiles
Suppose we have a sample of data from an exponential distribution
with parameter λ:
f (x|λ) = λe −λx (remember the estimation of λ̂ = 1/x̂n ).
An alternative solution to the use of bootstrap estimated standard
errors (the estimation of the sd from an exponential is not
straightforward) is the use of bootstrap quantiles.
We can obtain M bootstrap estimates λ̂b and define q ∗ (α) the α
quantile of the bootstrap distribution.
The new bootstrap confidence interval for λ will be:
[2 ∗ λ̂ − q ∗ (1 − α/2); 2 ∗ λ̂ − q ∗ (α/2)]
Regression model coefficient estimate with Bootstrap
Now we will consider the situation where we have data on two
variables. This is the type of data that arises in a linear regression
situation. It doesnt make sense to bootstrap the two variables
separately, and so must remain linked when bootstrapped.
For example, if our original data contains the observations (1,3),
(2,6), (4,3), and (6, 2), we re-sample this original sample in
pairs.
Recall that the linear regression model is: y = β0 + β1 x
We are going to construct a bootstrap interval for the slope
coefficient β1
1. Draw M bootstrap samples
2. Define the olsβ1 coefficient for each bootstrap sample
3. Define the bootstrap quantiles, and use the 0.025 and the
0.975 to define the confidence interval for β1
Regression model coefficient estimate with Bootstrap:
sampling the residuals
An alternative solution for the regression coefficient is a two stage
methods in which:
1. You draw M samples, for eah one you run a regression and
you define M bootstrap regression residuals (dim=n)
2. You add those residuals to each M draw sampled dependent
variable, to define M bootstrapped β1
The method consists in using the 0.025 and the 0.975 quantiles of
bootstrapped β1 to define the confidence interval.
References
Efron, B., Tibshirani, R. (1993). An introduction to the
bootstrap (Vol. 57). CRC press
Figure: Efron and Tbishirani foundational book
Routines in R
1. boot, by Brian Ripley.
Functions and datasets for bootstrapping from the book
Bootstrap Methods and Their Applications by A. C. Davison
and D. V. Hinkley (1997, CUP).
2. bootstrap, by Rob Tibshirani.
Software (bootstrap, cross-validation, jackknife) and data for
the book An Introduction to the Bootstrap by B. Efron and
R. Tibshirani, 1993, Chapman and Hall
Markov Chain
Markov Chain are important concept in probability and many other
area of research.
They are used to model the probability to belong to a certain state
in a certain period, given the state in the past period.
Example of weather: What is the markov probability for the state
tomorrow will be sunny, given that today is rainy ?
The main properties of Markov Chain processes are:
◮
Memory of the process (usually the memory is fixed to 1)
◮
Stationarity of the distribution
Chart 1
A picture of an easy example of markov chain with 2 possible
states and transition probabilities.
Figure: An example of 2 states markov chain
Notation
We define a stochastic process {Xn , n = 0, 1, 2, ...} that takes on a
finite or countable number of possible values.
Let the possible values be non negative integers (i.e.Xn ∈ Z+ ). If
Xn = i, then the process is said to be in state i at time n.
The Markov process (in discrete time) is defined as follows:
Pij = P[Xn+1 = j|Xn = in , Xn−1 = in−1 , ..., X0 = i0 ] = P[Xn+1 =
j|Xn = in ], ∀i, j ∈ Z+
We call Pij a 1-step transition probability because we moved from
time n to time n + 1.
It is a first order Markov Chain (memory = 1) because the
probability of being in state j at time (n + 1) only depends on the
state at time n.
Notation - 2
The n − step transition probability
Pnij = P[Xn+k = j|Xk = i], ∀n ≥ 0, i, j ≥ 0
The Champman Kolmogorov equations allow us to compute these
n − step transition probabilities. It states that:
P
Pnij = k Pnik Pmkj , ∀n, m ≥ 0, ∀i, j ≥ 0
N.B. Base probability properties:
1. Pij ≥ 0, ∀i, j ≥ 0
P
2.
j≥0 Pij = 1, i = 0, 1, 2, ...
Example: conditional probability
Consider two states: 0 = rain and 1 = no rain.
Define two probabilities:
α = P00 = P[Xn+1 = 0|Xn = 0] the probability it will rain
tomorrow given it rained today
β = P01 = P[Xn+1 = 1|Xn = 0] the probability it will rain
tomorrow given it did not rain today.
What is the probability it will rain the day after tomorrow given it
rained today?
The transition probability matrix will be:
P = [α, β, 1 − α, 1 − β]
Example: uncoditional probababily
What is the unconditional probability it will rain the day after
tomorrow?
We need to define the uncoditional or marginal distribution of the
state at time n:
P
P
P[Xn = j] = i P[Xn = j|X0 = 1]P[X0 = i] = i Pnij ∗ αi ,
where αi = P[X0 = i], ∀i ≥ 0
and P[Xn = j|X0 = 1] is the conditional probability just computed
before.
Stationary distributions
A stationary distribution π is the probability distribution such that
when the Markov chain reaches the stationary distribution, them it
remains in that probability forever.
It means we are asking this question: What is the probability to be
in a particular state in the long-run?
Let’s define πj as the limiting probability that the process will be in
state j at time n, or
πj = limn→∞ Pnij
Using Fubini’s theorem, we can define the stationary distribution
as:
P
πj = i Pij πi
Example: stationary distribution
Back to out example.
We can compute the 2 step, 3 step, ..., n- step transition
distribution, and give a look WHEN it reach the
convergence.
An alternative method to compute the stationary transition
distribution consists in using this easy formula:
π0 =
β
α
π1 =
1−α
α
References
Ross, S. M. (2006). Introduction to probability models. Access
Online via Elsevier.
Figure: Cover of the 10th edition
Routines in R
◮
markovchain, by Giorgio Alfredo Spedicato.
A package for easily handling discrete Markov chains.
◮
MCMCpack, by Andrew D. Martin, Kevin M. Quinn, and
Jong Hee Park.
Perform Monte Carlo simulations based on Markov Chain
approach.
Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 2 - Application to the Central Limit Theory - 14.01.2014
Introduction
The modern statistics was built and developed around the normal
distribution.
Academic world use to say that, if the empirical distribution is
normal (or approximative normal), everything works good. This
depends mainly on the sample dimension
Said this, it is important to undestand in which circumstances we
can state the distribution is normal.
Two founding statistical theorems are helpful: The Central Limit
Theorem and The Law of Large Numbers.
The Law of Large Numbers (LLN)
Suppose we have a random variable X with expected value
E (X ) = µ.
We extract n observation from X (say {x = x1 , x2 , ..., xn }).
If we define X̂n =
n −→ ∞,
X̂n −→ µ
P
i
n
xi
=
x1 +x2 +...+xn
,
n
the LLN states that, for
The Central Limit Theorem (CLT)
Suppose we have a random variable X with expected value
E (X ) = µ and v (X ) = σ 2
We extract n observation from X (say {x = x1 , x2 , ..., xn }).
Lets define X̂n =
P
i
n
xi
=
x1 +x2 +...+xn
.
n
X̂n distributes with expected value µ and variance
σ2
.
n
In case n −→ ∞ (in pratice n > 30)
X̂n ∼ N(µ,
σ2
), whatever the distribution of x be.
n
σ2
N.B. If X is normal distributed, X̂n ∼ N(µ, ) even if
n
n < 30
CLT: Empiricals
To better understand the CLT, it is recommended to examine the
theorem empirically and step by step.
By the introduction of new commands in the R programming
language.
In the first part, we will show how to draw and visualize a sample
of random numbers from a distribution.
Then, we will examine the mean and standard deviation of the
sample, then the distribution of the sample means.
Drawing random numbers - 1
We already introduced the use of the letters d, p and q in relations
to the various distributions (e.g. normal, uniform, exponential). A
reminder of their use follows:
◮
d is for density: it is used to find values of the probability
density function.
◮
p is for probability: it is used to find the probability that the
random variable lies on the left of a giving number.
◮
q is for quantile: it is used to find the quantiles of a given
distribution.
There is a fourth letter, namely r, used to draw random numbers
from a distribution. For example runif and rexp would be used to
draw random numbers from the uniform and exponential
distributions, respectively.
Drawing random numbers - 2
Let use the rnorm command to draw 500 number atrandom from
a normal distribution having mean 100 and standard deviation (sd)
10.
> x= rnorm(500,mean=100,sd=10)
The resuls, typing in the r consolle x, is a list of 500 numbers
extracted at random from a normal distribution with mean 500 and
sd 100.
When you examine the numbers stored in the variable X , There is
a sense that you are pulling random numbers that are clumped
about a mean of 100. However, a histagram of this selection
provides a different picture of the data stored.
> hist(x,prob=TRUE)
Drawing random numbers - Comments
Several comments are in order regarding the histogram in the
figure.
1. The histogram is approximately normal in shape.
2. The balance point of the histogram appears to be located
near 100, suggesting that the random numbers were drawn
from a distribution having mean 100.
3. Almost all of the values are within 3 increments of 10 from
the mean, suggesting that random numbers were drawn from
a normal distribution having standard deviation 10.
Drawing random numbers - a new drawing
Lets try the experiment again, drawing a new set of 500 random
numbers from the normal distribution having mean 100 and
standard deviation 10:
> x = rnorm(500, mean = 100, sd = 10)
> hist(x, prob = TRUE , ylim = c(0, 0.04))
Give a look to the histogram ... It is different from the first one,
however, it share some common traits: (1) it appears normal in
shape; (2) it appears to be balanced around 100; (3) all values
appears to occur within 3 increments of 10 of the mean.
This is a strong evidence that the random numbers have been
drawn from a normal distribution having mean 100 and sd 10. We
can provide evidence of this claim by imposing a normal density
curve:
> curve(dnorm(x, mean = 100, sd = 10), 70, 130, add =
TRUE , lwd = 2, col = ”red”))
The curve command
The curve command is new. Some comments on its use
follow:
1. In its simplest form, the sintax curve(f (x), from =, to =)
draws the function defined by f(x) on the interval (from, to).
Our function is dnorm(x, mean = 100, sd = 10). The curve
command sketches this function of X on the interval
(from,to).
2. The notation from = and to = may be omitted if the
arguments are in the proper order to the curve command:
function first, value of from second, value of to third. That is
what we have done.
3. If the argument add is set to TRUE , then the curve is added
to the existing figure. If the arument is omitted (or FALSE )
then a new plot is drawn,erasing the prevoius graph.
The distribution of X̂n (sample mean)
In our prevous example we drew 500 random numbers from a
normal distribution with mean 100 and standard deviation 10. This
leads to ONE sample of n = 500. Now the question is: what is
the mean of our sample?
> mean(x)
[1]100.14132
If we take another sample of 500 random numbers from the SAME
distribution, we get a new sample with different mean.
> x = rnorm(500, mean = 100, sd = 10)
mean(x)
[1]100.07884
What happens if we draw a sample several times?
Producing a vector of sample means
We will repeatedly sample from the normal distribution. Each of
the 500 samples will select 5 random numbers (instead of 500)
from the normal distrib. having mean 100 and sd 10. We will then
find the mean of those samples.
We begin by declaring the mean and the standard deviation. Then,
we declare the sample mean.
> µ = 100; σ = 10
>n=5
We need some place to store the mean of the sample. We initalize
a vector xbar to initially contain 500 zeros.
> xbar = rep(0, 500)
Producing a vector of sample means - cycle for
It is easy to draw a sample of size n = 5 from the normal
distribution having mean µ = 100 and standard deviation σ = 10.
We simply issue the command
rnorm(n, mean = µ, sd = σ).
To find the mean of this results, we simply add the
adjustment
mean(rnorm(n, mean = µ, sd = σ)).
The final step is to store this results in the vector xbar . Then we
must repeat this same process an addintional 499 times. This
require the use of a for loop.
> for (iin1 : 500){xbar [i] = mean(rnorm(n, mean = µ, sd =
σ))}
Cycle for
◮
The i in for (iin1 : 500) is called theindex of the for loop.
◮
The index i is first set equal to 1, then the body of the for
loop is executed. On the next iteration, i is set equal to 2 and
the body of the loop is executed again. The loop continues in
this manner, incrementing by 1, finally setting the index i to
500. After executing the last loop, the for cycle is terminated
◮
In the body of the for loop, we have
xbar [i] = mean(rnorm(n, mean = µ, sd = σ)). This draws a
sample of size 5 from the normal distribution, calculates the
mean of the sample, and store the results in xbar [i].
◮
When the for loop completes 500 iterations, the vector xbar
contains the means of 500 samples of size 5 drawn from the
normal distribution having µ = 100 and σ = 10
> hist(xbar , prob = TRUE , breacks = 12, xlim = c(70, 130, ylim =
c(0, 0.1)))
Distribution of X̂n - observations
1. The previous histograms describes the shape of the 500
random number randomly selected, here, the histogram
describe the distribution of 500 different sample means, each
of which founded by selecting n = 5 random number from the
normal distribution.
2. The distribution of xbar appears normal in shape. This is so
even though the sample size is relatively small ( n = 5).
3. It appears that the balance point occurs near 100. This can
be checked with the following command:
> mean(xbar )
That is the mean of the sample means, that is almost equal to
the mean of the draw of random numbers.
4. The distribution of the sample means appears to be narrower
then the random number distributions.
Increasing the sample size
Lets repeat the last experiment, but this time let’s draw a sample
size of n = 10 from the same distribution (µ = 100, σ = 10)
> µ = 100; σ = 10
> n = 10
> xbar = rep(0, 500)
> for (iin1 : 500){xbar [i] = mean(rnorm(n, mean = µ, sd =
σ))}
hist(xbar , prob = TRUE , breaks = 12, xlim = c(70, 130), ylim =
c(0, 0.1))
The Histogram produced is even more narrow than using
n=5
Key Ideas
1. When we select samples from a normal distribution, then the
distribution of sample means is also normal in shape
2. The mean of the distribution of sample meana appears to be
the same as the mean of the random numbers
(parentpopulation) (see the balance points compared)
3. By increasing the sample size of our samples, the histograms
becomes narrower . Infact, we would expect a more accurate
estimate of the mean of the parent population if we take the
mean from a larger sample size.
4. Imagine to draw sample means from a sample of n = ∞. The
histogram will be exactly concentrated (P = 1) in Xbar = µ
Summarise
We finish replicating the statement about CLT:
1. If you draw samples from a norml distribution, then the
distribution of the sample means is also normal
2. The mean of the distribution of the sample means is identical
to the mean of the parent population
3. The higher the sample size that is drawn, the narrower will be
the spread of the distribution of the sample means.
Homeworks
Experiment 1: Draw the Xbar histogram for n = 1000. How is
the histogram shape?
Experiment 2: Repeat the full experiment drawing random
numbers and sample means from a (1) uniform and from (2) a
poisson distribution. Is the histogram of Xbar normal in shape for
n = 5 and for n=30?
Experiment 3: Repeat the full experiment using real data instead
of random numbers. (HINT: select samples of dimension n = 5
from the real data, not using rnorm)
Recommended: Try to evaluate the agreement of the sample mean
histogram with normal distribution by mean of the qq-plot and
shapiro wilk test.
Application to Large Number Law
Experiment: toss the coin 100 times.
This experiment is like repeating 100 times a random draw from a
bernoulli distribution with parameter ρ = 0.5
We expect to have 50 times (value = 1) head and 50 times cross
(value = 0), if the coin is not distorted
But, in practice, this not happen: repeating the experiment we are
going to have a distribution centered in 50, but spread out.
Let’s imagine to define X̂n as the mean of the number of heads
across n experiments. For n −→ ∞, X̂n −→ 50
Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Introduction to R - 09.01.2014
Getting help with functions
To get more information on any specific named function, for
example solve, the command is
> help(solve)
An alternative is:
> ?solve
Running:
> help.start()
we will launch a Web browser that allows to enter to the help
home page.
The ?? command allows searching for help in a different way. For
example, it is usefull to get a help of non installed packages.
objects and saving data
The entities that R creates and manipulates are known as
objects.
During an R session, objects are created and stored by name.
> objects() can be used to display the name of the objects which
are currently stored within R.
> rm() can be used to remove objects.
At the end of each R session, you are given the opportunity to save
all the currently available objects. You can save the objects (the
workspace) in .RData format in the current directory. You also can
save command lines in .Rhistory format.
scalars and vectors: manipulation
To set up a vector named x, namely 1, 2, 3, 4 and 5, let use the R
command:
> x = c(1,2,3,4,5)
or, identically, the assign function could be used.
> assign(”x”, c(1,2,3,4,5))
x is a vector of length 5. To check it we can use the following
function:
> length(x)
>1/x gives the reciprocal of x.
> y = c(x,0,x) would create a vector with 11 entries consisting of
two copies of x with a 0 in the middle.
scalars and vectors: manipulation
Vectors can be used in arithmetic expressions.
Vector in the same expression need not all to be of the same
length. If not, the output value have the length of the longest
vector in the expression.
For example:
>v =2∗x +y +1
generate a new vector of length 11 constructed by adding together,
element by element, 2*x repeated 2.2 times, y repeated just once,
and 1 repeated 11 times.
So, WARNING: R compute that kind of expression even if it is
wrongly defined.
scalars and vectors: manipulation - 2
In addition, are also available log, exp, sin, cos, tan, sqrt and, of
course, the classical arithmetic operators
min(x) and max(x) select the smallest and the largest element of
the vector.
sum(x) and prod(x) display the sum and the product, respectively,
of the numbers within the vector.
mean(x) calculates the sample (arithmetic) mean, wich is the same
of sum(x)/length(x); and var(x) gives the sample variance:
sum((x − mean(x))2 )/(length(x) − 1)
sort(x) returns a vector of the same size of x, with the elements in
increasing order.
seq and rep
There are facilities to generate commonly used sequences of
numbers.
> 1:30 is the vector c(1,2, ..., 29,30)
> 2*1:15 is the vector c(2,4, ..., 28,30) of length 15.
In addition, seq() is in use. seq(2:10) is the same of the vector
2:10
by=, from=, to= are usefull command:
>seq(from= 30, to = 1)
>seq(-10, 10, by = 0.5)
rep() can be used for replicating and object.
> rep(x, times=5) > rep(x, each=5)
logical vectors
As well as numerical vectors, R allows manipulation of logical
quantities.
The elements of a logical vector can have the value TRUE, FALSE
and NA (”not available”)
Logical vectors are generated by conditions. Example:
> temp = x > 3
The logical operator are : <, <=, >=, ==, ! = for inequality. In
addition, if c1 and c2 are logical expressions, then c1c2 is the
intersection (”and”), c1|c2 is the union (”or ”), and !c1 is the
negation of c1
missing Values
In some cases the components of a vector may not be completely
known: in this case we assign the value ”NA”
The function is.na(x) gives a logical vector of the same size as x
with value TRUE if the corresponding element in x is NA. > z =
c(1:3, NA); ind = is.na(z)
There is a second kind of ”missing” values that are produced by
numerical computation, the so-called Not a Number, NaN, values.
Example:
> 0/0
> Inf/Inf
index vectors: subsets of a vector
Subsets of the elements of a vector may be selected by appendix to
the name of the vector an index vector in square brackets.
1. A logical vector: Values corresponding to TRUE in the index
vector are selected: > y = x[!is.na(x)]
2. A vector of positive (negative) integer quantities: in this
case the values in the index vector must lie in the set
{1, 2, ..., length(x)}. In the second case the selected vales will
be excluded. > x[2:3]; x[-(2:3)]
3. A vector of character string: this is possible only after
applying a names to the objects.
> cars = c(1,2,3)
> names(cars)=c(”ferrari”,”lamborghini”,”bugatti”)
> pref = cars[c(”ferrari”,”bugatti”)]
Objects and attribute
To each object it is associated one (and only one) attribute (it’s
the reason why we called them ”atomic”)
The objects can be: numeric, logical, complex, character and
raw
Usefull commands: mode(), as.numeric(), is.numeric()
For example, create a numeric vector:
> z = 0:9
change it in character: > digits = as.character(z);
and coerce it in a numeric:> d = as.integer(digits)
d and z are the same!
arrays, matrices and data.frame
Vectors are the most important type of objects in R, but there are
several others. Between the others:
◮
matrix: they are multidimensional generalizations of vectors
◮
data.frame: matrix-like structures, but the column can be of
different types. This is used when we manage with both
numerical and categorical data.
How to transform a vector in matrix?
> v = 1:50
> dim(v) = c(10,5)
arrays, matrices and data.frame (2)
How to create by beginning a matrix?
> m = array(1:20, dim= c(4,5))
Subsetting a matrix or replacing a subset of a matrix with zeros?
Lets give a look to the examples in the codes.
matrix manipulation
The operator ÷ ∗ ÷ is used for the matrix moltiplication.
An nx1 or 1xn matrices are also valid matrices.
If for example, A and B are square matrix of the same size,
then:
>A*B
is the matrix of element by element products(it doesn’t work for
matrices with different dimension), and
> A ÷ ∗ ÷ t(B)
is the matrix product.
diag(A) return the elements in the main diagonal of A. ginv(A)
and t(A) return the inverse and the transposed matrix.
Ginv() require MASS package.
lists and data frames
An R list is an object consisting of an ordered collection of objects
known as its components.
Here is a simple example of how to make a list:
> Lst = list(name=”Rodolfo”, surname=”Metulini”, age =
”30”)
It is possible to concatenating two or more lists:
list.ABC = C(list.A, list.B, list.C)
A data.frame is a list with a specific class ”data.frame”.
We can convert a matrix object in a data.frame objects with the
command as.data.frame(matrix)
The Easiest way to create a data.frame object is by mean of
read.table () function.
reading data
Large data objects will usually be read as values from external files
rather than entered during an R session at the keyboard.
There are basically two similar commands to upload data.
1. read.table(): specific for .csv files.
2. read.delim(): specific for .txt files
Usefull commands:
sep = ” ”: to specify if data in the dataset are separated by ;, ., ,
or they are tab delimited.
header = TRUE : to specify that first row in the dataset refers to
variable names
moreover, read.dta() is used to upload data from STATA :)
distributions and co.
One convenient use of R is to provide a comprehensive set of
statistical tables. Functions are provided to evaluate the
comulative distribution P(X < x), the probability density function
and the quantile function (given q, the smallest x such that
P(X < x) > q), and to simulate from the distribution.
Here, by ”d” for the density , ”p” (pnorm, punif, pexp etc ..) for
the CDF, ”q” for the quantile function. and ”r ” for
simulation.
Let empirically examine the distribution of a variable
(codes).
covar and concentration indices
The covariance and the correlation measure the degree at which
two variables change togheter
The correlation is a index [-1,1], the covariance is a pure number
(depends on the values assumed by the variables)
> Cov = cov(A,B) > Cor = corr(A,B)
We can also calculate the correlation netween A and B as
follow:
> CorAB = Cov / sqrt(Var(A)*Var(B))
Gini index: it is the most popular concentration index, we need to
install ineq package
Mode: the most frequent value within the distribution, we need to
install modeest package, mfv command
homeworks
◮
For who of us is familiar with STATA, lets try to upload a .dta
file with read.dta() function.
◮
Study the agreement with other distributions (exponential?
uniform? it is up to you) of eruption data.
Statistics Lab
Rodolfo Metulini
IMT Institute for Advanced Studies, Lucca, Italy
Lesson 4 - The linear Regression Model: Theory and
Application - 21.01.2014
Introduction
In the past praticals we analyzed one variable.
For certain reasons, it is even usefull to analyze two or more
variables together.
The question we want to asnwer regards what are the relations,
the causal effects determining changes in a variable. Analyze if a
certain phenomenon is endogenous or exogenous.
In symbols, the idea can be represent as follow:
y = f (x1 , x2 , ...)
Y is the response, which is a function (it depends on) one or more
variables.
Objectives
All in all, the regression model is the instrument used to:
◮
measure the entity of the relations between two or more
variables: △Y / △ X ,
and to measure the causal direction (△X −→ △Y or
viceversa? )
◮
forecast the value of the variable Y in response to some
changes in the others X1 , X2 , ... (called explanatories),
or for some cases that are not considered in the sample.
Simple linear regression model
The regression model is stochastic, not deterministic.
Giving two sets of values (two variables) from a random sample of
length n: x = {x1 , x2 , ..., xi , ..xn }; y = {y1 , y2 , ..., yi , ..yn }:
Deterministic formula:
yi = β0 + β1 xi , ∀i = 1, .., n
Stochastic formula:
yi = β0 + β1 xi + ǫi ∀i = 1, .., n
where ǫi is the stochastic component.
β1 define the slope in the relations between X and Y (See graph in
chart 1)
Simple linear regression model - 2
We need to find β̂ = {β̂0 , β̂1 } as estimators of β0 and β1 .
After β is estimated, we can draw the estimated regression line,
which corresponds to the estimated regression model, as
follow:
ŷi = β̂0 + β̂1 xi
Here, ǫ̂i = ŷi − yi .
Where ŷi is the i-element of the estimated Y vector, and yi is the
i-elements of the real Y vector. (see graph in chart 2)
Steps in the Analysis
1. Study the relations (scatterplot, correlations) between two or
more variables.
2. Estimation of the parameters of the model β̂ = {β̂0 , β̂1 }.
3. Hypotesis tests on the estimated β̂1 to verify the casual
effects between X and Y
4. Robustness check of the model.
5. Use the model to analyze the causal effect and/or to do
forecasting.
Why linear?
◮
It is simple to estimate, to analyze and to interpret
◮
it likely fits with most of empirical cases, in which the
relations between two phenomenon is linear.
◮
There are a lot of implemented methods to transorm variables
in order to obtain a linear relationship (log transformation,
normalization, etc.. )
Model Hypotesis
In order the estimation and the utilization of the model to be
correct, certain hypotesis must hold:
◮
E (ǫi ) = 0, ∀i −→ E (yi ) = β0 + β1 xi
◮
Omoschedasticity: V (ǫi ) = σi2 = σ 2 , ∀i
◮
Null covariance: Cov (ǫi , ǫj ) = 0, ∀i 6= j
◮
Null covariance among residuals and explanatories:
Cov (xi , ǫi ) = 0, ∀i, since X is deterministic (known)
◮
Normal assumption: ǫi ∼ N(0, σ 2 )
Model Hypotesis - 2
From the hypotesis above, follow that:
◮
V (yi ) = σ 2 , ∀i. Y is stochastic only for the ǫ component.
◮
Cov (yi , yj ) = 0, ∀i 6= j. Since the residuals are uncorrelated.
◮
yi ∼ N[(β0 + β1 x1 ), σ 2 ] Since also the residuals are normal in
shape.
Ordinary Least Squares (OLS) Estimation
The OLS is the estimation method used to estimate the vector β.
The idea is to minimize the value of the residuals.
Since ei = yi − ŷi we are interested in minimize the component
yi − β̂0 − β̂1 xi .
N.B. ǫi = β0 − β1 xi , while ei = β̂0 − β̂1 xi
The method consist in minimize the sum of the square
differences:
Pn
Pn 2
2
i (yi − ŷi ) =
i ei = Min,
which is equal to solve this 2 equation system derived using
derivates.
Ordinary Least Squares (OLS) Estimation - 2
δ/δβ0
n
X
ei2 = 0
(1)
δ/δβ1
i
n
X
ei2 = 0
(2)
i
After some arithmetics, we end up with this estimators for the
vector β:
β0 = ȳ − β̂1 x̄
Pn
(yi − ȳ )(xi − x̄)
iP
β1 =
n
2
i (xi − x̄)
(3)
(4)
OLS estimators
◮
OLS β̂0 and β̂1 are stochastic estimators (they have a
distribution in a sample space of all the possible estimtors
define with different samples)
◮
β̂1 : measure the estimated variation in Y determined by a
unitary variation in X (δY /δX )
◮
The OLS estimators are correct (E (β̂1 ) = β1 ),
and they are BLUE (corrects and with the lowest variance)
Linear dependency index (R 2 )
The R 2 index is the most used index to measure the linear fitting
of the model.
R 2 is confined in the boundary [−1, 1], where, values near to 1 (or
-1) means the explanatories are usefull to describe the changes in
Y.
Let define
SQT = SQR + SQE , or
Pn
Pn
Pn
2
2
2
i (yi − ȳ ) =
i (ŷi − ȳ ) +
i (yi − ŷi )
The R 2 is defined as
R2 =
Pn
(ŷ −ȳ )2
Pin i
2
i (yi −ȳ )
SQR
SQT
or 1 −
SQE
SQT .
Or, equivalent:
Hypotesis testing on β1
The estimated slope parameter β1 is stochastic. It distributes as a
gaussian:
β̂1 ∼ N[β1 , σ 2 /SSx]
We can make use of the hypotesis tests approach to investigate on
the causal relation between Y and X :
H0 : β 1 = 0
H1 : β1 6= 0,
where, alternative hypotesis mean causal relation.
The test is:
z=
β̂1 −β1
sqrt(σ 2 /SSx)
∼ N(0, 1).
P
When SSx is unknown, we estimate it as : SSx = ni (xi − ȳ )2 ,
and we use t − test with n − 1 degrees of freedom
Forecasting within the regresion model
The question we want to answer is the following: Which is the
expected value of Y (say yn+1 ), for a certain observation that is
not in the sample?.
Suppose we have, for that observation, the value for the variable X
(say xn+1 )
We make use of the estimated β to determine:
ŷn+1 = β̂0 + β̂1 xn+1
Model Checking
Several methods are used to test the robustness of the model,
most of them based on the stochastic part of the the model: the
estimated residuals.
◮
Graphical checks: Plot residuals versus fitted values
◮
qq-plot for the normality
◮
Shapiro wilk test for normality
◮
Durbin-Watson test for serial correlation
◮
Breusch-Pagan test for heteroschedasticity
Moreover, the leverage is used to evaluate th importance of each
observation in determining the estimated coefficients β.
The Stepwise procedure is used to choice between different model
specifications.
Model Checking using estimated residuals - Linearity
An example of departure from the linearity assumption: we can
draw a curve (not a horizontal line) to interpolate the points
Figure: residuals (Y) versus estimated (X) values
Model Checking using estimated residuals Omoscedasticity
An example of departure from the omoschedasticity assumption
(the estimated residuals increases as the predicted values
increase)
Model Checking using estimated residuals - Normality
An example of departure from the normality assumption: the
qq-points do not follow the qq-line
Figure: residuals (Y) versus estimated (X) values
Model Checking using estimated residuals - Serial
correlation
An example of departure from the serial incorrelation assumption:
the residual at i depend on the value at i − 1
Homeworks
1. Using cement data (n = 13), determine the β0 and β1
coefficients manually, using OLS formula at page 11, of the
model y = β0 + β1 x1
2. Using cement data, estimate the R 2 index of the model
y = β0 + β1 x1 , using formula at page 13.
Charts - 1
Figure: Slope coefficient in the linear model
Charts - 2
Figure: Fitted (line) versus real (points) values
#Point Estimate of Population Mean
#For any particular random sample, we can always compute its sample mean.
#Although most often it is not the actual population mean, it does serve as a good point estimate.
#For example, in the data set survey, the survey is performed on a sample of the student population.
#We can compute the sample mean and use it as an estimate of the corresponding population parameter.
##### CHUNK 1
#Problem
#Find a point estimate of mean university student height with the sample data from survey.
#For convenience, we begin with saving the survey data of student heights in a variableheight.survey.
library(MASS) # load the MASS package
height.survey = survey$Height
height.survey
summary(height.survey)
str(height.survey)
#It turns out not all students have answered the question, and we must filter out the missing values.
#Hence we apply the mean function with the "na.rm" argument as TRUE.
mean(height.survey, na.rm=TRUE) # skip missing values
#A point estimate of the mean student height is 172.38 centimeters.
##Alternative (consistent) estimator of the mu parameter?
##geometric mean?
height.survey = na.omit(height.survey)
exp(mean(log(height.survey)))
##median?
median(height.survey)
##mode
require(modeest)
mfv(height.survey)
##What is a unconsistent estimator?
##selecting a a subset truncating the sample
height.survey.bias = height.survey[height.survey > 170]
mean(height.survey.bias)
##CHUNK 2
#Point Estimate of Population Proportion
#Multiple choice questionnaires in a survey are often used to determine the the proportion of a population with certain characteristic.
#For example, we can estimate the proportion of female students in the university based on the result in the sample data set survey.
#Problem
#Find a point estimate of the female student proportion from survey.
#We first filter out missing values in survey$Sex with the na.omit function,
#and save it ingender.response.
library(MASS) # load the MASS package
gender.response = na.omit(survey$Sex)
gender.response
summary(gender.response)
str(gender.response)
n = length(gender.response) # valid responses count
n
#To find out the number of female students, we compare gender.response with the factor’Female’, and compute the sum. Dividing it by n gives the female student proportion in the sample survey.
k = sum(gender.response == "Female")
pbar = k/n
pbar
#The point estimate of the female student proportion in survey is 50%.
##CHUNK 3
#Problem
#Find a point estimate of the number of civil wars within countries (poisson count distribution)
civilwars = read.delim("civilwars.txt")
summary(civilwars)
#the poisson parameter is lambda, that is both the mean and the variance estimate
#it represent the mean number of occurrence of civil wars around the world
mean(civilwars$cases) ##mean estimator
mean(civilwars$cases) ##standard error estimator
##CHUNK 4
#Problem
#find a point estimation for the parameter of the exponential distribution
## it is 1/mean
##the file contains the duration (in years) of a sample of washing machines (until it brokes for the first time)
wasmach = read.delim("wasmach.txt")
summary(wasmach)
mean(wasmach$dur)
freq = 1/mean(wasmach$dur)
freq
##we can state the sample comes from an exponential distribution with expected parameter 0.246
##CHUNK 5
#Calculating Confidence Intervals
#Here we look at some examples of calculating confidence intervals.
#The examples are for both normal and t distributions.
#Note that an easier way to calculate confidence intervals using thet.test command is discussed in section The Easy Way.
#Calculating a Confidence Interval From a Normal Distribution
#We will make some assumptions for what we might find in an experiment and find the resulting confidence interval using a normal distribution.
#Here we assume that the sample mean is 5, the standard deviation is 2, and the sample size is 20.
#In the example below we will use a 95% confidence level and wish to find the confidence interval.
#The commands to find the confidence interval in R are the following:
a <- 5 #mean
s <- 2 #sd
n <- 20 #sample
error <- qnorm(0.975)*s/sqrt(n) ##quantile corresponding to CDF=0.975 of the normal
left <- a-error
right <- a+error
left
right
#The true mean has a probability of 95% of being in the interval between 4.12 and 5.88 assuming that the original random variable is normally distributed, and the samples are independent.
##CHUNK 6
#Calculating a Confidence Interval From a t Distribution
#Calculating the confidence interval when using a t-test is similar to using a normal distribution.
#The only difference is that we use the command associated with the t-distribution rather than the normal distribution.
#Here we repeat the procedures above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation.
#Again we assume that the sample mean is 5, the ###sample#### standard deviation is 2, and the sample size is 20. We use a 95% confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following:
##we use t-students when n is small (CLT) and sd is estimated.
a <- 5
s <- 2
n <- 20
error <- qt(0.975,df=n-1)*s/sqrt(n) ##quantile corresponding to CDF=0.975 of the t student
left <- a-error
right <- a+error
left
right
#The true mean has a probability of 95% of being in the interval between 4.06 and 5.94 assuming that the original random variable is normally distributed, and the samples are independent.
##the interval is bigger, because t student is less narrow.
#We have less detailed information beacause the sample estimation of sd. So, we use the caution-method.
##CHUNK 7
#Calculating Many Confidence Intervals From a t Distribution.
##imagine have the population splitted in 2, and an extraction of samples from them.
#Suppose that you want to find the confidence intervals for many tests.
#This is a common task and most software packages will allow you to do this.
We have three different sets of results:
#Comparison 1
# Mean Std. Dev. Number (pop.)
#Group I 10 3 300
#Group II 10.5 2.5 230
#Comparison 2
# Mean Std. Dev. Number (pop.)
#Group I 12 4 210
#Group II 13 5.3 340
#Comparison 3
# Mean Std. Dev. Number (pop.)
#Group I 30 4.5 420
#Group II 28.5 3 400
#For each of these comparisons we want to calculate the associated confidence interval for the difference of the means.
#For each comparison there are two groups. We will refer to group one as the group whose results are in the first row of each comparison above.
#We will refer to group two as the group whose results are in the second row of each comparison above.
#Before we can do that we must first compute a standard error and a t-score.
#We will find general formulae which is necessary in order to do all three calculations at once.
#We assume that the means for the first group are defined in a variable called m1.
#The means for the second group are defined in a variable called m2.
#The standard deviations for the first group are in a variable called sd1.
#The standard deviations for the second group are in a variable called sd2.
#The number of samples for the first group are in a variable called num1.
#Finally, the number of samples for the second group are in a variable called num2.
#With these definitions the (global) standard error is the square root of (sd1^2)/num1+(sd2^2)/num2.
#The R commands to do this can be found below:
m1 <- c(10,12,30)
m2 <- c(10.5,13,28.5)
sd1 <- c(3,4,4.5)
sd2 <- c(2.5,5.3,3)
num1 <- c(300,210,420)
num2 <- c(230,340,400)
se <- sqrt(sd1*sd1/num1+sd2*sd2/num2)
error <- qt(0.975,df=pmin(num1,num2)-1)*se ###dregrees, spiegare
#Now we need to define the confidence interval around the assumed differences.
#Just as in the case of finding the p values in previous chapter we have to use the pmin command to get the number of degrees of freedom.
#In this case the null hypotheses are for a difference of zero, and we use a 95% confidence interval:
left <- (m1-m2)-error
right <- (m1-m2)+error
left
right
#This gives the confidence intervals for each of the three tests
###CONFIDENCE INTERVAL FOR PROPORTION??
#Hipotesis testing
#Researchers retain or reject hypothesis based on measurements of observed samples.
#The decision is often based on a statistical mechanism called hypothesis testing.
#A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true.
#The probability of committing a type I error is called the significance level of the hypothesis testing, and is denoted by the Greek letter alpha.
#We demonstrate the procedure of hypothesis testing in R first with the intuitive critical value approach ...
#Then we discuss the popular p-value approach as alternative.
## CHUNK 8
############Lower Tail Test of Population Mean with Known Variance #############
xbar = 9900 # sample mean
mu0 = 10000 # hypothesized value
sigma = 120 # population standard deviation
n = 30 # sample size
z = (xbar-mu0)/(sigma/sqrt(n))
z # test statistic
# We then compute the critical value at .05 significance level.
alpha = .05
z.alpha = qnorm(1-alpha)
-z.alpha # critical value
#Answer
#The test statistic -4.5644 is less than the critical value of -1.6449. Hence, at .05 significance level, we reject the claim that mean lifetime of a light bulb is above 10,000 hours.
#Alternative Solution
#Instead of using the critical value, we apply the pnorm function to compute the lower tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 10000.
pval = pnorm(z)
pval # lower tail p-value
##CHUNK 9 # Upper Tail Test of Population Mean with Known Variance
#Problem
#Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per cookie is 2.1 grams. Assume that the population standard deviation is 0.25 grams. At .05 significance level, can we reject the claim on food label?
#The null hypothesis is that µ = 2. We begin with computing the test statistic.
xbar = 2.1 # sample mean
mu0 = 2 # hypothesized value
sigma = 0.25 # population standard deviation
n = 35 # sample size
z = (xbar-mu0)/(sigma/sqrt(n))
z # test statistic
#We then compute the critical value at .05 significance level.
alpha = .05
z.alpha = qnorm(1-alpha)
z.alpha # critical value
#Answer
#The test statistic 2.3664 is greater than the critical value of 1.6449. Hence, at .05 significance level, we reject the claim that there is at most 2 grams of saturated fat in a cookie.
#Alternative Solution
#Instead of using the critical value, we apply the pnorm function to compute the upper tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 2.
pval = pnorm(z, lower.tail=FALSE)
pval # upper tail p-value
##values at which you can set the alpha coefficient: 0.01, 0.05, 0.1.
##CHUNK 10 #Two-Tailed Test of Population Mean with Known Variance
#the null hypothesis of the two-tailed test is to be rejected if z =-za/2 or z = za/2 , whereza/2 is the 100(1 - a/2) percentile of the standard normal distribution.
#Problem
#Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6 kg. Assume the population standard deviation is 2.5 kg. At .05 significance level, can we reject the null hypothesis that the mean penguin weight does not differ from last year?
#The null hypothesis is that µ = 15.4. We begin with computing the test statistic.
xbar = 14.6 # sample mean
mu0 = 15.4 # hypothesized value
sigma = 2.5 # population standard deviation
n = 35 # sample size
z = (xbar-mu0)/(sigma/sqrt(n))
z # test statistic
#We then compute the critical values at .05 significance level.
alpha = .05
z.half.alpha = qnorm(1-alpha/2)
c(-z.half.alpha, z.half.alpha)
#The test statistic -1.8931 lies between the critical values -1.9600 and 1.9600. Hence, at .05 significance level, we do not reject the null hypothesis that the mean penguin weight does not differ from last year.
#Alternative Solution
#Instead of using the critical value, we apply the pnorm function to compute the two-tailedp-value of the test statistic. It doubles the lower tail p-value as the sample mean is lessthan the hypothesized value. Since it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that µ = 15.4.
pval = 2 * pnorm(z) # lower tail
pval # two-tailed p-value
##CHUNK 11 #Lower Tail Test of Population Mean with Unknown Variance
#the null hypothesis of the lower tail test is to be rejected if t =-ta , where ta is the100(1 - a) percentile of the Student t distribution with n - 1 degrees of freedom.
#Suppose the manufacturer claims that the mean lifetime of a light bulb is more than 10,000 hours. In a sample of 30 light bulbs, it was found that they only last 9,900 hours on average. Assume the sample standard deviation is 125 hours. At .05 significance level, can we reject the claim by the manufacturer?
#The null hypothesis is that µ = 10000. We begin with computing the test statistic.
xbar = 9900 # sample mean
mu0 = 10000 # hypothesized value
s = 125 # sample standard deviation
n = 30 # sample size
t = (xbar-mu0)/(s/sqrt(n))
t # test statistic
#We then compute the critical value at .05 significance level.
alpha = .05
t.alpha = qt(1-alpha, df=n-1)
-t.alpha # critical value
#The test statistic -4.3818 is less than the critical value of -1.6991. Hence, at .05 significance level, we can reject the claim that mean lifetime of a light bulb is above 10,000 hours.
#Alternative Solution
#Instead of using the critical value, we apply the pt function to compute the lower tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 10000.
pval = pt(t, df=n-1)
pval # lower tail p-value
##CHUNK 12 #upper Tail Test of Population Mean with Unknown Variance
#the null hypothesis of the upper tail test is to be rejected if t = ta , where ta is the100(1 - a) percentile of the Student t distribution with n - 1 degrees of freedom.
#Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per cookie is 2.1 grams. Assume that the sample standard deviation is 0.3 gram. At .05 significance level, can we reject the claim on food label?
#The null hypothesis is that µ = 2. We begin with computing the test statistic.
xbar = 2.1 # sample mean
mu0 = 2 # hypothesized value
s = 0.3 # sample standard deviation
n = 35 # sample size
t = (xbar-mu0)/(s/sqrt(n))
t # test statistic
#We then compute the critical value at .05 significance level.
alpha = .05
t.alpha = qt(1-alpha, df=n-1)
t.alpha # critical value
#The test statistic 1.9720 is greater than the critical value of 1.6991. Hence, at .05 significance level, we can reject the claim that there is at most 2 grams of saturated fat in a cookie.
#Alternative Solution
#Instead of using the critical value, we apply the pt function to compute the upper tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 2.
pval = pt(t, df=n-1, lower.tail=FALSE)
pval # upper tail p-value
##CHUNK 13 #Two-Tailed Test of Population Mean with Unknown Variance
#the null hypothesis of the two-tailed test is to be rejected if t =-ta/2 or t = ta/2 , whereta/2 is the 100(1 - a) percentile of the Student t distribution with n - 1 degrees of freedom.
#Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6 kg. Assume the sample standard deviation is 2.5 kg. At .05 significance level, can we reject the null hypothesis that the mean penguin weight does not differ from last year?
#The null hypothesis is that µ = 15.4. We begin with computing the test statistic.
xbar = 14.6 # sample mean
mu0 = 15.4 # hypothesized value
s = 2.5 # sample standard deviation
n = 35 # sample size
t = (xbar-mu0)/(s/sqrt(n))
t # test statistic
#We then compute the critical values at .05 significance level.
alpha = .05
t.half.alpha = qt(1-alpha/2, df=n-1)
c(-t.half.alpha, t.half.alpha)
#The test statistic -1.8931 lies between the critical values -2.0322, and 2.0322. Hence, at .05 significance level, we do not reject the null hypothesis that the mean penguin weight does not differ from last year.
#Alternative Solution
#Instead of using the critical value, we apply the pt function to compute the two-tailed p-value of the test statistic. It doubles the lower tail p-value as the sample mean is less than the hypothesized value. Since it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that µ = 15.4.
pval = 2 * pt(t, df=n-1) # lower tail
pval # two-tailed p-value
##CHUNK 14 #Lower Tail Test of Population Proportion
#the null hypothesis of the lower tail test is to be rejected if z =-za , where za is the100(1 - a) percentile of the standard normal distribution.
#Suppose 60% of citizens voted in last election. 85 out of 148 people in a telephone survey said that they voted in current election. At 0.5 significance level, can we reject the null hypothesis that the proportion of voters in the population is above 60% this year?
#The null hypothesis is that p = 0.6. We begin with computing the test statistic.
pbar = 85/148 # sample proportion
p0 = .6 # hypothesized value
n = 148 # sample size
z = (pbar-p0)/sqrt(p0*(1-p0)/n)
z # test statistic
#We then compute the critical value at .05 significance level.
alpha = .05
z.alpha = qnorm(1-alpha)
-z.alpha # critical value
#The test statistic -0.6376 is not less than the critical value of -1.6449. Hence, at .05 significance level, we do not reject the null hypothesis that the proportion of voters in the population is above 60% this year.
#Alternative Solution 1
#Instead of using the critical value, we apply the pnorm function to compute the lower tail p-value of the test statistic. As it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that p = 0.6.
pval = pnorm(z)
pval # lower tail p-value
#Alternative Solution 2
#We apply the prop.test function to compute the p-value directly. The Yates continuity correction is disabled for pedagogical reasons.
prop.test(85, 148, p=.6, alt="less", correct=FALSE)
##CHUNK 15 #Upper Tail Test of Population Proportion
#the null hypothesis of the upper tail test is to be rejected if z = za , where za is the100(1 - a) percentile of the standard normal distribution.
#Suppose that 12% of apples harvested in an orchard last year was rotten. 30 out of 214 apples in a harvest sample this year turns out to be rotten. At .05 significance level, can we reject the null hypothesis that the proportion of rotten apples in harvest stays below 12% this year?
#The null hypothesis is that p = 0.12. We begin with computing the test statistic.
pbar = 30/214 # sample proportion
p0 = .12 # hypothesized value
n = 214 # sample size
z = (pbar-p0)/sqrt(p0*(1-p0)/n)
z # test statistic
#We then compute the critical value at .05 significance level.
alpha = .05
z.alpha = qnorm(1-alpha)
z.alpha # critical value
#The test statistic 0.90875 is not greater than the critical value of 1.6449. Hence, at .05 significance level, we do not reject the null hypothesis that the proportion of rotten apples in harvest stays below 12% this year.
#Alternative Solution 1
#Instead of using the critical value, we apply the pnorm function to compute the upper tail p-value of the test statistic. As it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that p = 0.12.
pval = pnorm(z, lower.tail=FALSE)
pval # upper tail p-value
#Alternative Solution 2
We apply the prop.test function to compute the p-value directly. The Yates continuity correction is disabled for pedagogical reasons.
prop.test(30, 214, p=.12, alt="greater", correct=FALSE)
##CHUNK 16 #Two-Tailed Test of Population Proportion
#the null hypothesis of the two-tailed test is to be rejected if z =-za/2 or z = za/2 , whereza/2 is the 100(1 - a) percentile of the standard normal distribution.
#Suppose a coin toss turns up 12 heads out of 20 trials. At .05 significance level, can one reject the null hypothesis that the coin toss is fair?
#The null hypothesis is that p = 0.5. We begin with computing the test statistic.
pbar = 12/20 # sample proportion
p0 = .5 # hypothesized value
n = 20 # sample size
z = (pbar-p0)/sqrt(p0*(1-p0)/n)
z # test statistic
#We then compute the critical values at .05 significance level.
alpha = .05
z.half.alpha = qnorm(1-alpha/2)
c(-z.half.alpha, z.half.alpha)
#The test statistic 0.89443 lies between the critical values -1.9600 and 1.9600. Hence, at .05 significance level, we do not reject the null hypothesis that the coin toss is fair.
#Alternative Solution 1
#Instead of using the critical value, we apply the pnorm function to compute the two-tailedp-value of the test statistic. It doubles the upper tail p-value as the sample proportion isgreater than the hypothesized value. Since it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that p = 0.5.
pval = 2 * pnorm(z, lower.tail=FALSE) # upper tail
pval # two-tailed p-value
#Alternative Solution 2
We apply the prop.test function to compute the p-value directly. The Yates continuity correction is disabled for pedagogical reasons.
prop.test(12, 20, p=0.5, correct=FALSE)
##CHUNK 17 ##Defining a sample size based on mean
#Problem
#Assume the population standard deviation s of the student height in survey is 9.48. Find the sample size needed to achieve a 1.2 centimeters margin of error at 95% confidence level.
#Since there are two tails of the normal distribution,
#the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, za/2 is given by qnorm(.975).
zstar = qnorm(.975)
sigma = 9.48
E = 1.2
zstar^2 * sigma^2/ E^2
#Based on the assumption of population standard deviation being 9.48, it needs a sample size of 240
#to achieve a 1.2 centimeters margin of error at 95% confidence level.
##CHUNK 18 ##Defining a sample size based on proportion
#The quality of a sample survey can be improved by increasing the sample size.
#The formula below provide the sample size needed under the requirement of population proportion interval estimate at (1 - a) confidence level,
#margin of error E, and planned proportion estimate p. Here, za/2 is the 100(1 - a/2) percentile of the standard normal distribution.
#Since there are two tails of the normal distribution,
#the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, za/2 is given by qnorm(.975).
zstar = qnorm(.975)
p = 0.5
E = 0.05
zstar^2 *p * (1-p) / E^2
#With a planned proportion estimate of 50% at 95% confidence level,
#it needs a sample size of 385 to achieve a 5% margin of error for the survey of female student proportion.
##CHUNK 1### set the directory
setwd("C:/Users/Rudolph/Desktop/lesson1")
##CHUNK 2 ### help
help(solve)
?solve
help.start()
##asking help for a non installed package
?spdep
??spdep
##CHUNK 3 ## objects and saving
##create objects
a = 1
b = 2
a
b
##display objects
objects()
##remove a
rm(a)
objects()
##remove all objects
rm(list=ls())
objects()
##CHUNK 4 ### scalars and vectors : manipulation
x = c(1,2,3,4,5)
assign("x", c(1,2,3,4,5))
x
length(x)
str(x)
##reciprocal
1/x
##
y = c(x,0,x)
y
length(y)
str(y)
###CHUNK 5 ### vector arithmetics
v = 2*x + y + 1
v
length(v)
z1 = x + x
z1
z2 = x - x
z2
z3 = x*x
z3
z4 = x/x
z4
z5 = x^3
z5
##logs
logx = log(x)
logx
log10x = log10(x)
log10x
##exp of a log = return the initial vector
xnew = exp(logx)
xnew2 = 10^log10x
xnew
xnew2
##min / max
min(x)
max(x)
##sum, prod, mean, var
sum(x)
prod(x)
mean(x)
sum(x)/length(x)
var(x)
sum((x-mean(x))^2) / (length(x) - 1)
##CHUNK 6 ### : sequences
c = 1:30
d = 2*1*15
##to practice
n = 10
1:n - 1
1:(n-1)
e = 1:30
f = 2*1:15
#In addition, seq() is in use. seq(2:10) is the same of the vector c(2:10)
#by=, from=, to= are usefull command:
seq(from= 30, to = 1)
seq(-10, 10, by = 0.5)
#rep() can be used for replicating and object.
rep(x, times=5)
rep(x, each=5)
## CHUNK 7 ## logicals and NA's
temp = x > 3
temp
##it is logical!
str(temp)
#it is the same length of x!
length(temp)
c1 = x > 3
c1
c2 = x < 3
c2
c3 = x != 3
c3
c1&c2
c1|c2
##NA's
z = c(1:3, NA)
z
ind = is.na(z)
ind
0/0
Inf
Inf - Inf
##CHUNK 8 ## subsetting
##1
z = c(1:3, NA)
y = z[!is.na(z)]
y
##2
x = seq(1,20)
z1 = x[1:10]
z2 = x[-(1:10)]
##3
cars = c(1,2,3)
names(cars)=c("ferrari","lamborghini","bugatti")
pref = cars[c("ferrari","bugatti")]
## CHUNK 9 ## attributes
z = 0:9
z
mode(z)
is.character(z)
digits = as.character(z)
digits
mode(digits)
is.numeric(digits)
d = as.integer(digits)
d
mode(d)
is.numeric(d)
##CHUNK 10 ### matrix and data frame, arrays
v = 1:50
is.vector(v)
is.numeric(v)
##change vector in matrix, defining the number of rows and columns
dim(v) = c(10,5)
is.matrix(v)
dim(v)
v
##creating a 4 x 5 array
m = array(1:20, dim = c(4,5))
is.matrix(m)
m
##subsetting an array
##only the first 2 rows
m1 = m[1:2, ]
m1
##only the first 3 columns
m2 = m[ ,1:3]
m2
##only the first 2 rows and the first 3 columns
m3 = m[1:2,1:3]
m3
##replace m with 0 in the first 2 rows and the first 3 columns
m[m3] = 0
m
##CHUNK 11## Matrix manipulation
A = array(1:20, dim=c(4,5))
B = array(seq(20,1), dim=c(4,5))
A
B
C = A*B
dim(C)
D= A%*%t(B)
dim(D)
D = diag(A)
D
is.matrix(D)
is.vector(D)
##transposed
T= t(A)
T
##inverse
require(MASS)
ginv(A)
##CHUNK 12 ### lists and data frames
Lst1 = list(name="Rodolfo", surname="Metulini", age = "30", sister.age = c(12,16,20,25))
Lst1
Lst2 = list(name = "Carlo", surname = "Rossi", age= "34", sister.age = c(30,38))
Lst2
##concatenating
list.12 = c(Lst1, Lst2)
list.12
##data.frame
m = array(seq(1:20), dim=c(4,5))
is.matrix(m)
df = as.data.frame(m)
is.data.frame(df)
df
str(df)
##CHUNK 13 ### read tables
ds = read.table("players.csv", sep=";", header=TRUE)
ds
ds = read.delim("players.txt", sep="", header=TRUE)
ds
summary(ds)
str(ds)
capture.output(summary(ds), file="output.txt")
##chunk 14 ### examine a distribution
attach(faithful)
summary(eruptions)
hist(eruptions)
##mmm... make the bins smaller (0.2)
hist(eruptions, seq(1.6,5.2,0.2), prob=TRUE)
##draw lines of the empirical density
density(eruptions)
lines(density(eruptions, bw=0.1))
### cumulative distribution function?
ecdf(eruptions)
plot(ecdf(eruptions))
##it is far from the normal distribution. lets try to split in two!
long = eruptions[eruptions > 3]
ecdf(long)
plot(ecdf(long))
##compare with the cumulative distribution of a normal distribution
##define the points in x in which to evaluate the ecdf.
x = seq(3, 5.4, 0.01)
lines(x, pnorm(x, mean= mean(long), sd = sqrt(var(long))), lty=3, col="red")
##with the QQ plot we can diagnostic the normality.
par(pty="s")
qqnorm(long)
qqline(long)
##the Shapiro wilk test is proposed to more formal test the agreement with normality
shapiro.test(long)
##pvalue is less than 0.05
###CHUNK 15 ### covariances and concentration indices
A = c(1,3,7,12,15)
B = c(8,7,6,10,18)
Cov = cov(A,B)
Cov
Cor = cor(A,B)
Cor
##We can also calculate the correlation between A and B as follow:
CorAB = Cov / sqrt(var(A)*var(B))
CorAB
##GINI index
require(ineq)
x = c(1,1,1,1,1)
y = c(0,0,200,1,4,2)
ineq(x, type="Gini")
ineq(y, type="Gini")
##mode
require(modeest)
x = c(1,3,7,7,7,8,9,12,14,15)
mfv(x)
##CHUNK 1 ###
## Estimating a population mean
mysample = rnorm(25, 10, 3)
mean(mysample)
##create a repository to store the sample means
samplemeans = rep(NA,1000)
for (i in 1:1000) {
mysample.i = rnorm(25, 10, 3)
samplemeans[i] = mean(mysample.i)
}
hist(samplemeans)
#Investigate the sampling distribution of the sample mean further by taking
#1000 samples from this normal distribution. Make a histogram of the means
#of these 1000 samples. Verify that about 95% of the time the sample mean
#will be within ± 2s*X of the true sample mean
sum(samplemeans >= 10-2*3/sqrt(25) & samplemeans <= 10+2*3/sqrt(25))/1000
##normal confidence interval
x.bar = mean(mysample)
sd = sd(mysample)
x.bar - 2*sd/sqrt(25)
x.bar + 2*sd/sqrt(25)
##bootstrap Confidence interval
#The bootstrap estimates standard error by resampling the data in our original sample. The idea is to treat
#the sample as a substitute for the population. Instead of repeatedly drawing samples of size 100 from the
#population, we will repeatedly draw new samples of size 100 from our original sample. In order to not end
#up with exactly the same sample every time, we will resample with replacement. This means that after a
#value is drawn, we replace it before drawing again. This way we can draw the same value out more than
#once! The following questions demonstrate how this resampling process works.
bootsample = sample(mysample, 25, replace=T)
bootsample
bootmean = mean(bootsample)
bootmean
#To do a full bootstrap, we would take a large number of bootstrap samples (at least 1000) from the original
#sample. For each bootstrap sample, we would calculate the mean of the bootstrap samples. The variation
#in the means of the bootstrap samples gives us an estimate of the variability in the sample mean. That is,
#we can estimate the standard error of the sample mean using the standard deviation of the bootstrapped
#sample means. We can then use this to construct a con?dence interval for the true population mean.
#7. Take 1000 bootstrap samples from your original sample and calculate the mean of each of these
#bootstrap samples. Make a histogram of these bootstrapped means.
bootmeans = rep(NA, 1000)
for (i in 1:1000) {
bootsample = sample(mysample, 25, replace=T)
bootmeans[i] = mean(bootsample)
}
hist(bootmeans)
sdb = sd(bootmeans)
x.bar - 2*sdb/sqrt(25)
x.bar + 2*sdb/sqrt(25)
##CHUNK 2 ###
##bootstrap conf interval of a EXPONENTIAL (using quantile method)
my.Test.Data = rexp(100,3)
mean(my.Test.Data)
lambda_hat = 1/mean(my.Test.Data) ##THE ESTIMATED PARAMETER
##normal quantiles (because CLT)
CIlow = lambda_hat - qnorm(0.975)*(lambda_hat/sqrt(100))
CIhigh = lambda_hat + qnorm(0.975)*(lambda_hat/sqrt(100))
CI =cbind(CIlow,CIhigh)
CI
##how to find a bootstrap condidence interval?
## initialize a matrix that will receive the value of the estimate
#for each sample
boot.sampling.dist = matrix(1,2000)
##compute 2000 bootstrap samples and compute the statistic for each of them
for (i in 1:2000) {
boot.sampling.dist[i] = 1/mean(rexp(100,lambda_hat))
}
## look at the sample distribution
windows()
hist(boot.sampling.dist, main = "Estimate of sampling distribution of lambda", breaks = 50)
##find the quantile of the new distribution
my.quantiles = quantile(boot.sampling.dist, c(.025,.975))
##calculate the bootstrap confidence interval boundaries
CIbootlow = 2*lambda_hat - my.quantiles[2]
CIboothigh = 2*lambda_hat - my.quantiles[1]
CIboot =cbind(CIbootlow,CIboothigh)
CIboot
###CHUNK 3 ###
#bootstrap in regression models
#We can now form a con?dence interval using the bootstrap. To do this, we need to implement this bivariate
#bootstrap sampling. We will use the percentile bootstrap method. Consider the script below:
### create our data and calculate thetahat, the slope of the regression line
library(MASS)
data(cats)
dim(cats)
thetahat = lm(Hwt ~ Bwt, data=cats)$coeff[2]
thetahat
##i construct 1000 samples on n=144, and, for each one,
# i compute the sample mean.
thetahat.b = rep(NA,1000)
for (i in 1:1000) {
### draw the bootstrap sample and calculate thetahat.b
index = 1:144
bootindex = sample(index, 144, replace=T)
bootsample = cats[bootindex,]
thetahat.b[i] = lm(Hwt ~ Bwt, data=bootsample)$coeff[2]
}
hist(thetahat.b)
quantile(thetahat.b, .025)
quantile(thetahat.b, .975)
##CHUNK 4 ###
##Alternative method: resampling the residuals
##to store the beta's
res = matrix(NA,1000,144)
##a 1000*144 matrix to store the 1000 replications of 144 fitted values
thetahat.b_a = rep(NA,1000)
##first step: define the bootsrap fitted values
for (i in 1:1000) {
index = 1:144
bootindex = sample(index, 144, replace=T)
bootsample = cats[bootindex,]
res[i,] = lm(Hwt ~ Bwt, data=bootsample)$res
}
##step 2: use the fitted values to obtain 1000 beta estimates
for (i in 1:1000) {
thetahat.b_a[i] = lm(Hwt + res[i,] ~ Bwt, data=bootsample)$coeff[2]
}
hist(thetahat.b_a)
quantile(thetahat.b_a, .025)
quantile(thetahat.b_a, .975)
hist(thetahat.b)
quantile(thetahat.b, .025)
quantile(thetahat.b, .975)
## MARKOV CHAINS
### CHUNK 5:
## what is the probability the day after tomorrow will rain,
## given it rained today?
##Construct transition matrix
trans.mat = matrix(c(0.7,0.3,0.4,0.6), 2, 2, byrow=TRUE)
##2step transition matrix
trans.mat.2 = trans.mat %*% trans.mat
trans.mat.2
### CHUNK 6:
## what is the unconditional probability of rain at time n?
trans.mat = matrix(c(0.7,0.3,0.4,0.6), 2, 2, byrow=TRUE)
##2step transition matrix
trans.mat.2 = trans.mat %*% trans.mat
init.dist = matrix (c(0.4,0.6), 1, 2)
init.dist
unc.prob.n = init.dist %*%trans.mat.2
unc.prob.n
##CHUNK 7:
##stationary transition distribution
trans.mat = matrix(c(0.7,0.3,0.4,0.6), 2, 2, byrow=TRUE)
##2step transition matrix
trans.mat.2 = trans.mat %*% trans.mat
trans.mat.2
##4step transition matrix
trans.mat.4 = trans.mat %*% trans.mat %*% trans.mat %*% trans.mat
trans.mat.4
##6 step transition matrix
trans.mat.6 = trans.mat %*% trans.mat %*% trans.mat %*% trans.mat%*% trans.mat %*% trans.mat
trans.mat.6
##CHUNK 8
## alternative methods
trans.mat
alpha_stat = trans.mat[2,1] / trans.mat[1,1]
alpha_stat
beta_stat = trans.mat[1,2] / trans.mat[1,1]
beta_stat
##CHUNK 1 ##
###correlation and plot
library("MASS")
data(cats)
str(cats)
summary(cats)
# "Bwt" is the body weight in kilograms, "Hwt" is the heart weight in grams,
# and "Sex" should be obvious.
# There are no missing values in any of the variables,
# so we are ready to begin by looking at a scatterplot...
##DRAW THE SCATTERPLOT
with(cats, plot(Bwt, Hwt))
##ADD TITLE
title(main="Heart Weight (g) vs. Body Weight (kg)\nof Domestic Cats")
##ALTERNATIVELY ...
with(cats, plot(Hwt ~ Bwt))
# The scatterplot shows a fairly strong and reasonably linear relationship between the two variables. A Pearson product-moment correlation coefficient can be calculated using the cor( ) function...
#CORRELATION
with(cats, cor(Bwt, Hwt))
##pearson test for correlation bigger (or lower) than 0.
with(cats, cor.test(Bwt, Hwt))
#Since we would expect a positive correlation here,
#we might have set the alternative to "greater"
#(we "forgot" the lower tail and perform a one-side "upper -tail" test)
with(cats, cor.test(Bwt, Hwt, alternative="greater", conf.level=.95))
#Using the formula interface makes it easy to subset the data
#by rows of the data frame, to analyze female cats only
with(cats, cor.test(~ Bwt + Hwt, subset=(Sex=="F")))
##here, for the females, correlation is lower, but still significant
###suppose a dataset with more than 2 variables.
data(cement) # also in the MASS library
str(cement)
cor(cement) ##corr smong all variables 2 by 2.
cov(cement)
# If you have a covariance matrix and want a correlation matrix...
cov.matr = cov(cement)
cov2cor(cov.matr)
# If you want a visual representation of the correlation matrix
pairs(cement) ##multidimensional scatterplot
##CHUNK 2 ###
###linear regre model
#Simple Linear Regression
#There is little question that R was written with regression analysis in mind!
#The number of functions and add-on packages devoted to regression analysis of one kind or another is staggering!
àLet's go back to our cats data...
data(cats) # still in the "MASS" library
#Simple (as well as multiple) linear regression is done using the lm( ) function.
#This function requires a formula interface, and for simple regression the formula takes the form DV ~ IV,
#It's critical to remember that the DV or response variable must come before the tilde and IV or explanatory variables after.
#Getting the regression equation for our "cats" data is simple
olscats = lm(Hwt ~ Bwt, data=cats)
##summary
summary(olscats)
#So the regression equation is: Hwt = 4.0341 (Bwt) - 0.3567.
##draw the real points
plot(Hwt ~ Bwt, main="Kitty Cat Plot", data=cats)
##add the regression (preducted values) line
abline(olscats, col="red")
###regression dignostics
par(mfrow=c(2,2))
plot(olscats)
str(olscats, max.level=1)
##leverage
leverage = influence(olscats)
leverage$coefficients
##measure a change (in absolute value) of the coefficient, omitting the i-elements from the sample
##CHUNK 3 ###
###prediction
#Problem
#Apply the simple linear regression model for the data set cats,
#and estimate the body weight of a cat with a 3 kg body.
#Solution
#We apply the lm function to a formula that describes the variable hwt over bwt,
#and save the linear regression model in a new variable olscats
olscats = lm(Hwt ~ Bwt, data=cats)
#Then we extract the parameters of the estimated regression equation with the coefficients function.
coeffs = coefficients(olscats)
coeffs
#We now fit the eruption duration using the estimated regression equation.
Bwt = 3
Hwt = coeffs[1] + coeffs[2]*Bwt
Hwt
#Answer
#Alternative Solution
#We wrap the waiting parameter value inside a new data frame named newdata.
newdata = data.frame(Bwt=3) # wrap the parameter
#Then we apply the predict function to eruption.lm along with newdata.
predict(olscats, newdata) # apply predict
##CHUNK 4 ###
###stepwise ###
summary(cement)
olscement = lm(y ~ x1 + x2 + x3 + x4,data=cement)
summary(olscement)
require(Rcmdr)
stepwise(olscement)
##AIC, BIC
##CHUNK 5 ###
##heterogeneity
ncvTest(olscement) ##for heterogeneity
robustcement = rlm(y ~ x1 + x2 + x3 + x4,data=cement) ##robust for heterogeneity
###robustness checks
plot(robustcement)
coeffs1 = coefficients(olscement)
coeffs1
coeffs2 = coefficients(robustcement)
coeffs2
all.equal(coeffs1, coeffs2)
##chunk 6 ###
### glm #####
library(pscl)
data(AustralianElections)
str(AustralianElections)
summary(AustralianElections)
poiselections = glm(Seats ~ Informal + Turnout,data=AustralianElections, family=poisson)
summary(poiselections)
##CHUNK 7 ##HOMEWORKS
summary(cement)
olscement = lm(y ~ x1 + x2 + x3 + x4,data=cement)
summary(olscement)
ols2 = lm(y ~ x1 ,data=cement)
summary(ols2)
1. DEFINE THE COEFFICIENTS
Y = cement$y ##vector of the dependent values
X = cement$x1 ##vector of the indipendent values (variable x1)
str(X)
str(Y)
is.vector(X)
is.vector(Y)
Xm = mean(X)##mean of the vector Y
Ym = mean(Y) ##mean of the vector X
Xd = X - Xm ##diff from Xm vector
Yd = Y - Ym ##diff from Ym vector
Xd
Yd
num = sum(Yd*Xd)
den = sum(Xd*Xd)
num
den
beta1hat = num / den
beta1hat
##deniminator
2. DEFINE THE R^2
Yhat = ols2$fitted ###fitted values vector
Y = cement$y ##vector of the dependent values
Ym = mean(Y) ##mean vector
num = sum((Yhat - Ym)^2) ##num: SQR
den = sum((Y - Ym)^2) ##den: SQT
R2 = num / den ##R^2
R2
##drawing random numbers
x = rnorm(500, mean=100, sd=10)
x
hist1 = hist(x, prob=TRUE)
x = rnorm(500, mean=100, sd=10)
hist2 = hist(x,prob=TRUE,ylim=c(0,0.04))
####
curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red")
##distribution of sample means
mean(x)
x=rnorm(500,mean=100,sd=10)
mean(x)
### producing a vector of sample means
mu=100
sigma=10
n=5
xbar= rep(0,500)
for (i in 1:500) { xbar[i] = mean(rnorm(n, mean=mu,sd=sigma))}
hist(xbar,prob=TRUE, breaks=12,xlim=c(80,120),ylim = c(0,0.1))
mean(xbar)
###increasing the sample size
mu=100
sigma=10
n=10
xbar=rep(0,500)
for (i in 1:500) { xbar[i] = mean(rnorm(n, mean=mu,sd=sigma))}
hist(xbar,prob=TRUE, breaks=12,xlim=c(80,120),ylim = c(0,0.15))
###law of large numbers
##drawing random numbers
x = rbinom(100,1,0.5)
#x2 = rbinom(100,2,0.5)
##hist random numbers
hist(x)
##define the empirical frequency
sum(x)
###define the empirical frequency for the sample mean
freq= rep(0,1000)
freq
##for loop
##define the number i
N = rep(0,1000)
for (i in 1:1000) {N[i] = i}
N
#define the cumulated frequency (total)
freq[1] = sum(rbinom(100,1,0.5))
for (i in 2:1000) { freq[i] = (sum(rbinom(100,1,0.5)) + freq[i-1])}
freq
##define the sample mean
freq2 = rep(0,1000)
for (i in 1:1000) {freq2[i] = freq[i]/i}
freq2
plot(freq2, ylim= c(48,52))
mu = rep(50,1000)
mu
points(mu,col="red")
#are equal?
all.equal(freq2[1], mu[1])
all.equal(freq2[10], mu[10])
all.equal(freq2[100], mu[100])
all.equal(freq2[1000],mu[1000])