Academia.eduAcademia.edu

Laboratory of Statistics with R

Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 3 - Point Estimate, Confidence Interval and Hypotesis Tests - 16.01.2014 Introduction Let’s start having empirical data (one variable of length N) extracted from external file, suppose to consider it to be the population. We define a sample of size n. Suppose we do not have information on population (or, better, we want to check if and how the sample can represent the population) We, in other words, want to make infererence using the information contained in the sample, in order to obtain an estimation for the population. That sample is one of several samples we can randomly draw from the population (the sample space). What are the instruments to obtain infos about the population? (1) Sample mean (point estimation) (2) Confidence interval (3) Hypotesis tests Sample space In probability theory, the sample space of an experiment or random trial is the set of all possible outcomes or results of that experiment. It is common to refer to a sample space by the labels S, Ω, or U. For example, for tossing two coins, the corresponding sample space would be {(head,head), (head,tail), (tail,head), (tail,tail)}, so that the dimension is 4. dim(Ω) = 4. It means that we can obtain 4 different samples with corresponding 4 different sample means. In pratice, we face up with only one sample took at random from the sample space. Point estimate Point estimate permit us to summarize the information contained in the population (dimension N), throughout only 1 value constructed using n vales. The most Pn used, unbiased point estimator is the sample mean. xi X̂ n = 1=1 n Other point estimators are: (1) Sample Median (2) Sample Mode (3) Geometric mean. pQ n 2 1 Pn Geometric Mean = Mg = i=1 xi = exp[ n 1=1 lnxi ] An example of what is not an estimator is when you use the sample mean after subsetting the sample truncating it on a certain value. P.S. A Naif definition of estimator: when the estimator is computed using all the n informations in the sample. Efficient estimators The BLUE (Best Linear Unbiased Estimator) is defined as follow: 1. is a linear function of all the sample values 2. is unbiased (E (X̂n ) = θ) 3. has the smallest sample variance among all unbiased estimators. The sample mean is BLUE for the parameter µ Some estimators are biased but consistent: An estimator is consistent when become unbiased for n −→ ∞ Point estimators - cases ◮ ◮ ◮ ◮ Normal samples: X̂n is the BLUE estimator for µ parameter (mean) Bernoulli samples f (x) = ρx (1 − ρ)1−x : X̂n is a unbiased estimator for ρ parameter (frequency) e −k k x ): X̂n is a unbiased estimator x! for k parameter (which represent both mean and variance of the distribution) Poisson samples f (x) = 1 :is a unbiased X̂n estimator for λ parameter (density at value 0) Exponential samples f (x) = λe −λy ) Confidence interval theory With point estimators we make use of only one value to infer about population. With confidence interval we define a minimum and a maximum value in which the population parameter we expect to lie. Formally, we need to calculate: σ µ1 = X̂n − z ∗ √ n σ µ2 = X̂n + z ∗ √ n and we end up with interval µ̂ = {µ1 ; µ2 } Here: X̂n is the sample mean; z is the upper (or lower) critical value of the theoretical distribution. σ is the standard deviation of the theoretical distribution. n the sample size. (See the graph) Confidence interval theory - Gaussian We will make some assumptions for what we might find in an experiment and find the resulting confidence interval using a normal distribution. Let assume that the sample mean is 5, the standard deviation in population is known and it is equal to 2, and the sample size is n = 20. In the example below we will use a 95 per cent confidence level and wish to find the confidence interval. N.B. Here, since the confidence interval is 95, the z (the critical value) to consider is the one corresponding with CDF (i.e. dnorm) = 0.975. We also can speak of α = 0.05, or 1 − α = 0.95, or 1 − α/2 = 0.975 Confidence interval theory - T-student We use T − student distribution when n is small and sd is unknown in population. We need to use a sample variance q estimation: σ̂ = P (xi −X̂n )2 n−1 The t-student distribution is more spread out. In simple words, since we do not know the population sd, we need for more large intervals (caution - approach). The only difference with normal distribution, is that we use the command associated with the t-distribution rather than the normal distribution. Here we repeat the procedures above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation. N.B. The T distribution is characterize by its degree of freedom. In this test the degree aere equal to n − 1, because we use 1 estimation (1 constraint) Confidence interval theory - comparison of two means In some case we can have an experiment called (for example) case-control. Let’s imagine to have the population splitted in 2: one is the treated group, the second is the non treated group. Suppose to extract two samples from them with aim to test if the two samples comes from a population with the same mean parameter (is the treatment effective?) The output of this test will be a confidence interval represting the difference between the two means. N.B. Here, the degree of freedom of the t-distribution are equal to min(n1 , n2 ) − 1 Formulas ◮ Gaussian confidence interval: µ̂ = {µ1 , µ2 } = X̂n ± z ∗ ◮ √σ n T - student confidence interval: µ̂ = {µ1 , µ2 } = X̂n ± tn−1 ∗ ◮ √σ̂ n T-student confidence interval for two sample difference: µ̂diff = {µdiff 1 , µdiff2 } = (X̂1 − X̂2 ) ± tn−1 ∗ sd; where sd = sd1 ∗ ◮ sd1 n1 + sd2 ∗ sd2 n2 Gussian confidence interval for proportion (bernoulli distribution): ρ̂ = {ρ1 , ρ2 } = fˆ1 ± z ∗ sd; q where sd = ρ(1−ρ) n2 Hypotesis testing Researchers retain or reject hypothesis based on measurements of observed samples. The decision is often based on a statistical mechanism called hypothesis testing. A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true. The probability of committing a type I error is called the significance level of the hypothesis testing, and is denoted by the Greek letter α (the same used in the confidence intervals). We demonstrate the procedure of hypothesis testing in R first with the intuitive critical value approach. Then we discuss the popular p − value (and very quick) approach as alternative. Hypotesis testing - lower tail The null hypothesis of the lower tail test of the population mean can be expressed as follows: µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true population mean µ. Let us define the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ: z= X̂n −µ √0 σ/ n Then the null hypothesis of the lower tail test is to be rejected if z ≤ zα , where zα is the 100(α) percentile of the standard normal distribution. Hypotesis testing - upper tail The null hypothesis of the upper tail test of the population mean can be expressed as follows: µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true population mean µ. Let us define the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ: z= X̂n −µ √0 σ/ n Then the null hypothesis of the upper tail test is to be rejected if z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the standard normal distribution. Hypotesis testing - two tailed The null hypothesis of the two-tailed test of the population mean can be expressed as follows: µ = µ0 ; where µ0 is a hypothesized value of the true population mean µ. Let us define the test statistic z in terms of the sample mean, the sample size and the population standard deviation σ: z= X̂n −µ √0 σ/ n Then the null hypothesis of the two-tailed test is to be rejected if z ≤ zα/2 or z ≥ z1−α/2 , where zα/2 is the 100(α/2) percentile of the standard normal distribution. Hypotesis testing - lower tail with Unknown variance The null hypothesis of the lower tail test of the population mean can be expressed as follows: µ ≥ µ0 ; where µ0 is a hypothesized lower bound of the true population mean µ. Let us define the test statistic t in terms of the sample mean, the sample size and the sample standard deviation σ̂: t= X̂n −µ √0 σ̂/ n Then the null hypothesis of the lower tail test is to be rejected if t ≤ tα , where tα is the 100(α) percentile of the Student t distribution with n − 1 degrees of freedom. Hypotesis testing - upper tail with Unknown variance The null hypothesis of the upper tail test of the population mean can be expressed as follows: µ ≤ µ0 ; where µ0 is a hypothesized upper bound of the true population mean µ. Let us define the test statistic t in terms of the sample mean, the sample size and the sample standard deviation σ̂: t= X̂n −µ √0 σ̂/ n Then the null hypothesis of the upper tail test is to be rejected if t ≥ t1−α , where t1−α is the 100(1 − α) percentile of the Student t distribution with n1 degrees of freedom. Hypotesis testing - two tailed with Unknown variance The null hypothesis of the two-tailed test of the population mean can be expressed as follows: µ = µ0 ; where µ0 is a hypothesized value of the true population mean µ. Let us define the test statistic t in terms of the sample mean, the sample size and the sample standard deviation σ̂: t= X̂n −µ √0 σ̂/ n Then the null hypothesis of the two-tailed test is to be rejected if t ≤ tα/2 or t ≥ t1−α/2 , where tα/2 is the 100(α/2) percentile of the Student t distribution with n − 1 degrees of freedom. Lower Tail Test of Population Proportion The null hypothesis of the lower tail test about population proportion can be expressed as follows: ρ ≥ ρ0 ; where ρ0 is a hypothesized lower bound of the true population proportion ρ. Let us define the test statistic z in terms of the sample proportion and the sample size: z= q ρ̂−ρ0 ρ0 (1−ρ0 ) n Then the null hypothesis of the lower tail test is to be rejected if z ≤ zα , where zα is the 100(α) percentile of the standard normal distribution. Upper Tail Test of Population Proportion The null hypothesis of the upper tail test about population proportion can be expressed as follows: ρ ≤ ρ0 ; where ρ0 is a hypothesized lower bound of the true population proportion ρ. Let us define the test statistic z in terms of the sample proportion and the sample size: z= q ρ̂−ρ0 ρ0 (1−ρ0 ) n Then the null hypothesis of the lower tail test is to be rejected if z ≥ z1−α , where z1−α is the 100(1 − α) percentile of the standard normal distribution. Two Tailed Test of Population Proportion The null hypothesis of the upper tail test about population proportion can be expressed as follows: ρ = ρ0 ; where ρ0 is a hypothesized true population proportion. Let us define the test statistic z in terms of the sample proportion and the sample size: z= q ρ̂−ρ0 ρ0 (1−ρ0 ) n Then the null hypothesis of the lower tail test is to be rejected if z ≤ zα/2 or z ≥ z1−α/2 Sample size definition The quality of a sample survey can be improved (worsened) by increasing (decreasing) the sample size. The formula below provide the sample size needed under the requirement of population proportion interval estimate at (1 − α) confidence level, margin of error E and planned parameter estimation. Here, z1−α/2 is the 100(1 − α/2) percentile of the standard normal distribution. 2 ∗σ 2 z1−α/2 ◮ For mean: n = ◮ For proportion: n = E2 2 z1−α/2 ρ∗(1−ρ) E2 Sample size definition - Exercises ◮ Mean: Assume the population standard deviation σ of the student height in survey is 9.48. Find the sample size needed to achieve a 1.2 centimeters margin of error at 95 per cent confidence level. Since there are two tails of the normal distribution, the 95 per cent confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, z1−α/2 is given by qnorm(.975). ◮ Population: Using a 50 per cent planned proportion estimate, find the sample size needed to achieve 5 per cent margin of error for the female student survey at 95 per cent confidence level. Since there are two tails of the normal distribution, the 95 per cent confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, z1−α/2 is given by qnorm(.975). Homeworks 1: Confidence interval for the proportion. Suppose we have a sample of size n = 25 of births. 15 of that are female. Define the interval (at 99 per cent) for the proportion of female in the population. HINT: Apply with the proper functions in R, the formula in slide 11. 2: Hypotesis test to compare two proportions. Suppose we have two schools. Sampling from the first, n = 20 and the Hispanics students are 8. Sampling from the second, n = 18 and Hispanics students are 4. Can we state (at 95 per cent) the frequency of Hispanics are the same in the two schools? N.B.: the test here is two tailed. The hypotesis test here is: q −ρ̂2 z = ρ̂1sd ; where sd = ρ(1 − ρ)[ n11 + ρ= (ρ1 ∗n1 +ρ2 +n2 ) n1 +n2 1 n2 ]; Charts - 1 Figure: Representation of the critical point for the upper tail hypotesis test Charts - 2 Figure: Representation of the critical point for the lower tail hypotesis test Charts - 3 Figure: Representation of the critical point for the two-tailed hypotesis test Charts - 4 Figure: Type I and Type II errors in hypotesis testing
Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 5 - Introduction to Bootstrap and Introduction to Markov Chains - 23.01.2014 Introduction Let’s assume, for a moment, the CLT: If random samples of n observations y1 , y2 , ..., yn are drawn from a population of meran µ and sd σ, for n sufficienly large, the sampling distribution of the sample mean can be approximated by a normal density with mean µ and variance σ ց ◮ Averages taken from any distribution will have a normal distribution ◮ The standard deviation decreases as the number of observation increases But .. nobody tells us exactly how big the sample has to be. Why Bootstrap? Sometimes we can not make use of the CLT, because: 1. Nobody tells us exactly how big the sample has to be 2. The sample can be really small. So that, we are not encouraged to hypotize any distribution assumption. We just have the data and let the raw data speak. The bootstrap method attempts to determine the probability distribution from the data itself, without recourse to CLT. N.B. The Bootstrap method is not a way of reducing the error! it only tries to estimate it. Basic Idea of Bootstrap Using the original sample as the population, and draw N samples from the original sample (which are the bootstrap samples). Defining the estimator using the bootstrap samples. Figure: Real World versus Bootstrap World Structure of Bootstrap 1. Originally, from a list of data (the sample), one compute a statistic (an estimation) 2. Create an artifical list of data (a new sample), by randomly drawing elements from the list 3. Compute a new statistic (estimation), from the new sample 4. Repeat, let’s say, 1000 times the point 2) and 3) and look to the distribution of these 1000 statistics. Sample mean Suppose we extracted a sample x = (x1 , x2 , ..., xn ) from the population X . Let’s say the sample size is small: n = 10. We can compute the sample mean x̂ using the values of the sample x. But, since n is small, the CLT does not hold, so that we can say anything about the sample mean distribution. APPROACH: We extract M samples (or sub-samples) of dimension n from the sample x (with replacement). We can define the bootstrap sample means: xbi ∀i = 1..., M. This become the new sample with dimension M Bootstrap sample mean: P Mb(X ) = M i xbi /M Bootstrap sample variance: P 2 Vb(X ) = M i (xbi − Mb(X )) /M − 1 Bootstrap Confidence interval with variance estimation Let’s take a random sample of size 25 from a normal distribution with mean 10 and standard deviation 3. We can consider the sampling distribution of the sample mean. From that, we estimate the intervals. The bootstrap estimates standard error by resampling the data in our original sample. Instead of repeatedly drawing samples of size 100 from the population, we will repeatedly draw new samples of size 100 from our original sample, resampling with replacement. We can estimate the standard error of the sample mean using the standard deviation of the bootstrapped sample means. Confidence interval with quantiles Suppose we have a sample of data from an exponential distribution with parameter λ: f (x|λ) = λe −λx (remember the estimation of λ̂ = 1/x̂n ). An alternative solution to the use of bootstrap estimated standard errors (the estimation of the sd from an exponential is not straightforward) is the use of bootstrap quantiles. We can obtain M bootstrap estimates λ̂b and define q ∗ (α) the α quantile of the bootstrap distribution. The new bootstrap confidence interval for λ will be: [2 ∗ λ̂ − q ∗ (1 − α/2); 2 ∗ λ̂ − q ∗ (α/2)] Regression model coefficient estimate with Bootstrap Now we will consider the situation where we have data on two variables. This is the type of data that arises in a linear regression situation. It doesnt make sense to bootstrap the two variables separately, and so must remain linked when bootstrapped. For example, if our original data contains the observations (1,3), (2,6), (4,3), and (6, 2), we re-sample this original sample in pairs. Recall that the linear regression model is: y = β0 + β1 x We are going to construct a bootstrap interval for the slope coefficient β1 1. Draw M bootstrap samples 2. Define the olsβ1 coefficient for each bootstrap sample 3. Define the bootstrap quantiles, and use the 0.025 and the 0.975 to define the confidence interval for β1 Regression model coefficient estimate with Bootstrap: sampling the residuals An alternative solution for the regression coefficient is a two stage methods in which: 1. You draw M samples, for eah one you run a regression and you define M bootstrap regression residuals (dim=n) 2. You add those residuals to each M draw sampled dependent variable, to define M bootstrapped β1 The method consists in using the 0.025 and the 0.975 quantiles of bootstrapped β1 to define the confidence interval. References Efron, B., Tibshirani, R. (1993). An introduction to the bootstrap (Vol. 57). CRC press Figure: Efron and Tbishirani foundational book Routines in R 1. boot, by Brian Ripley. Functions and datasets for bootstrapping from the book Bootstrap Methods and Their Applications by A. C. Davison and D. V. Hinkley (1997, CUP). 2. bootstrap, by Rob Tibshirani. Software (bootstrap, cross-validation, jackknife) and data for the book An Introduction to the Bootstrap by B. Efron and R. Tibshirani, 1993, Chapman and Hall Markov Chain Markov Chain are important concept in probability and many other area of research. They are used to model the probability to belong to a certain state in a certain period, given the state in the past period. Example of weather: What is the markov probability for the state tomorrow will be sunny, given that today is rainy ? The main properties of Markov Chain processes are: ◮ Memory of the process (usually the memory is fixed to 1) ◮ Stationarity of the distribution Chart 1 A picture of an easy example of markov chain with 2 possible states and transition probabilities. Figure: An example of 2 states markov chain Notation We define a stochastic process {Xn , n = 0, 1, 2, ...} that takes on a finite or countable number of possible values. Let the possible values be non negative integers (i.e.Xn ∈ Z+ ). If Xn = i, then the process is said to be in state i at time n. The Markov process (in discrete time) is defined as follows: Pij = P[Xn+1 = j|Xn = in , Xn−1 = in−1 , ..., X0 = i0 ] = P[Xn+1 = j|Xn = in ], ∀i, j ∈ Z+ We call Pij a 1-step transition probability because we moved from time n to time n + 1. It is a first order Markov Chain (memory = 1) because the probability of being in state j at time (n + 1) only depends on the state at time n. Notation - 2 The n − step transition probability Pnij = P[Xn+k = j|Xk = i], ∀n ≥ 0, i, j ≥ 0 The Champman Kolmogorov equations allow us to compute these n − step transition probabilities. It states that: P Pnij = k Pnik Pmkj , ∀n, m ≥ 0, ∀i, j ≥ 0 N.B. Base probability properties: 1. Pij ≥ 0, ∀i, j ≥ 0 P 2. j≥0 Pij = 1, i = 0, 1, 2, ... Example: conditional probability Consider two states: 0 = rain and 1 = no rain. Define two probabilities: α = P00 = P[Xn+1 = 0|Xn = 0] the probability it will rain tomorrow given it rained today β = P01 = P[Xn+1 = 1|Xn = 0] the probability it will rain tomorrow given it did not rain today. What is the probability it will rain the day after tomorrow given it rained today? The transition probability matrix will be: P = [α, β, 1 − α, 1 − β] Example: uncoditional probababily What is the unconditional probability it will rain the day after tomorrow? We need to define the uncoditional or marginal distribution of the state at time n: P P P[Xn = j] = i P[Xn = j|X0 = 1]P[X0 = i] = i Pnij ∗ αi , where αi = P[X0 = i], ∀i ≥ 0 and P[Xn = j|X0 = 1] is the conditional probability just computed before. Stationary distributions A stationary distribution π is the probability distribution such that when the Markov chain reaches the stationary distribution, them it remains in that probability forever. It means we are asking this question: What is the probability to be in a particular state in the long-run? Let’s define πj as the limiting probability that the process will be in state j at time n, or πj = limn→∞ Pnij Using Fubini’s theorem, we can define the stationary distribution as: P πj = i Pij πi Example: stationary distribution Back to out example. We can compute the 2 step, 3 step, ..., n- step transition distribution, and give a look WHEN it reach the convergence. An alternative method to compute the stationary transition distribution consists in using this easy formula: π0 = β α π1 = 1−α α References Ross, S. M. (2006). Introduction to probability models. Access Online via Elsevier. Figure: Cover of the 10th edition Routines in R ◮ markovchain, by Giorgio Alfredo Spedicato. A package for easily handling discrete Markov chains. ◮ MCMCpack, by Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park. Perform Monte Carlo simulations based on Markov Chain approach.
Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 2 - Application to the Central Limit Theory - 14.01.2014 Introduction The modern statistics was built and developed around the normal distribution. Academic world use to say that, if the empirical distribution is normal (or approximative normal), everything works good. This depends mainly on the sample dimension Said this, it is important to undestand in which circumstances we can state the distribution is normal. Two founding statistical theorems are helpful: The Central Limit Theorem and The Law of Large Numbers. The Law of Large Numbers (LLN) Suppose we have a random variable X with expected value E (X ) = µ. We extract n observation from X (say {x = x1 , x2 , ..., xn }). If we define X̂n = n −→ ∞, X̂n −→ µ P i n xi = x1 +x2 +...+xn , n the LLN states that, for The Central Limit Theorem (CLT) Suppose we have a random variable X with expected value E (X ) = µ and v (X ) = σ 2 We extract n observation from X (say {x = x1 , x2 , ..., xn }). Lets define X̂n = P i n xi = x1 +x2 +...+xn . n X̂n distributes with expected value µ and variance σ2 . n In case n −→ ∞ (in pratice n > 30) X̂n ∼ N(µ, σ2 ), whatever the distribution of x be. n σ2 N.B. If X is normal distributed, X̂n ∼ N(µ, ) even if n n < 30 CLT: Empiricals To better understand the CLT, it is recommended to examine the theorem empirically and step by step. By the introduction of new commands in the R programming language. In the first part, we will show how to draw and visualize a sample of random numbers from a distribution. Then, we will examine the mean and standard deviation of the sample, then the distribution of the sample means. Drawing random numbers - 1 We already introduced the use of the letters d, p and q in relations to the various distributions (e.g. normal, uniform, exponential). A reminder of their use follows: ◮ d is for density: it is used to find values of the probability density function. ◮ p is for probability: it is used to find the probability that the random variable lies on the left of a giving number. ◮ q is for quantile: it is used to find the quantiles of a given distribution. There is a fourth letter, namely r, used to draw random numbers from a distribution. For example runif and rexp would be used to draw random numbers from the uniform and exponential distributions, respectively. Drawing random numbers - 2 Let use the rnorm command to draw 500 number atrandom from a normal distribution having mean 100 and standard deviation (sd) 10. > x= rnorm(500,mean=100,sd=10) The resuls, typing in the r consolle x, is a list of 500 numbers extracted at random from a normal distribution with mean 500 and sd 100. When you examine the numbers stored in the variable X , There is a sense that you are pulling random numbers that are clumped about a mean of 100. However, a histagram of this selection provides a different picture of the data stored. > hist(x,prob=TRUE) Drawing random numbers - Comments Several comments are in order regarding the histogram in the figure. 1. The histogram is approximately normal in shape. 2. The balance point of the histogram appears to be located near 100, suggesting that the random numbers were drawn from a distribution having mean 100. 3. Almost all of the values are within 3 increments of 10 from the mean, suggesting that random numbers were drawn from a normal distribution having standard deviation 10. Drawing random numbers - a new drawing Lets try the experiment again, drawing a new set of 500 random numbers from the normal distribution having mean 100 and standard deviation 10: > x = rnorm(500, mean = 100, sd = 10) > hist(x, prob = TRUE , ylim = c(0, 0.04)) Give a look to the histogram ... It is different from the first one, however, it share some common traits: (1) it appears normal in shape; (2) it appears to be balanced around 100; (3) all values appears to occur within 3 increments of 10 of the mean. This is a strong evidence that the random numbers have been drawn from a normal distribution having mean 100 and sd 10. We can provide evidence of this claim by imposing a normal density curve: > curve(dnorm(x, mean = 100, sd = 10), 70, 130, add = TRUE , lwd = 2, col = ”red”)) The curve command The curve command is new. Some comments on its use follow: 1. In its simplest form, the sintax curve(f (x), from =, to =) draws the function defined by f(x) on the interval (from, to). Our function is dnorm(x, mean = 100, sd = 10). The curve command sketches this function of X on the interval (from,to). 2. The notation from = and to = may be omitted if the arguments are in the proper order to the curve command: function first, value of from second, value of to third. That is what we have done. 3. If the argument add is set to TRUE , then the curve is added to the existing figure. If the arument is omitted (or FALSE ) then a new plot is drawn,erasing the prevoius graph. The distribution of X̂n (sample mean) In our prevous example we drew 500 random numbers from a normal distribution with mean 100 and standard deviation 10. This leads to ONE sample of n = 500. Now the question is: what is the mean of our sample? > mean(x) [1]100.14132 If we take another sample of 500 random numbers from the SAME distribution, we get a new sample with different mean. > x = rnorm(500, mean = 100, sd = 10) mean(x) [1]100.07884 What happens if we draw a sample several times? Producing a vector of sample means We will repeatedly sample from the normal distribution. Each of the 500 samples will select 5 random numbers (instead of 500) from the normal distrib. having mean 100 and sd 10. We will then find the mean of those samples. We begin by declaring the mean and the standard deviation. Then, we declare the sample mean. > µ = 100; σ = 10 >n=5 We need some place to store the mean of the sample. We initalize a vector xbar to initially contain 500 zeros. > xbar = rep(0, 500) Producing a vector of sample means - cycle for It is easy to draw a sample of size n = 5 from the normal distribution having mean µ = 100 and standard deviation σ = 10. We simply issue the command rnorm(n, mean = µ, sd = σ). To find the mean of this results, we simply add the adjustment mean(rnorm(n, mean = µ, sd = σ)). The final step is to store this results in the vector xbar . Then we must repeat this same process an addintional 499 times. This require the use of a for loop. > for (iin1 : 500){xbar [i] = mean(rnorm(n, mean = µ, sd = σ))} Cycle for ◮ The i in for (iin1 : 500) is called theindex of the for loop. ◮ The index i is first set equal to 1, then the body of the for loop is executed. On the next iteration, i is set equal to 2 and the body of the loop is executed again. The loop continues in this manner, incrementing by 1, finally setting the index i to 500. After executing the last loop, the for cycle is terminated ◮ In the body of the for loop, we have xbar [i] = mean(rnorm(n, mean = µ, sd = σ)). This draws a sample of size 5 from the normal distribution, calculates the mean of the sample, and store the results in xbar [i]. ◮ When the for loop completes 500 iterations, the vector xbar contains the means of 500 samples of size 5 drawn from the normal distribution having µ = 100 and σ = 10 > hist(xbar , prob = TRUE , breacks = 12, xlim = c(70, 130, ylim = c(0, 0.1))) Distribution of X̂n - observations 1. The previous histograms describes the shape of the 500 random number randomly selected, here, the histogram describe the distribution of 500 different sample means, each of which founded by selecting n = 5 random number from the normal distribution. 2. The distribution of xbar appears normal in shape. This is so even though the sample size is relatively small ( n = 5). 3. It appears that the balance point occurs near 100. This can be checked with the following command: > mean(xbar ) That is the mean of the sample means, that is almost equal to the mean of the draw of random numbers. 4. The distribution of the sample means appears to be narrower then the random number distributions. Increasing the sample size Lets repeat the last experiment, but this time let’s draw a sample size of n = 10 from the same distribution (µ = 100, σ = 10) > µ = 100; σ = 10 > n = 10 > xbar = rep(0, 500) > for (iin1 : 500){xbar [i] = mean(rnorm(n, mean = µ, sd = σ))} hist(xbar , prob = TRUE , breaks = 12, xlim = c(70, 130), ylim = c(0, 0.1)) The Histogram produced is even more narrow than using n=5 Key Ideas 1. When we select samples from a normal distribution, then the distribution of sample means is also normal in shape 2. The mean of the distribution of sample meana appears to be the same as the mean of the random numbers (parentpopulation) (see the balance points compared) 3. By increasing the sample size of our samples, the histograms becomes narrower . Infact, we would expect a more accurate estimate of the mean of the parent population if we take the mean from a larger sample size. 4. Imagine to draw sample means from a sample of n = ∞. The histogram will be exactly concentrated (P = 1) in Xbar = µ Summarise We finish replicating the statement about CLT: 1. If you draw samples from a norml distribution, then the distribution of the sample means is also normal 2. The mean of the distribution of the sample means is identical to the mean of the parent population 3. The higher the sample size that is drawn, the narrower will be the spread of the distribution of the sample means. Homeworks Experiment 1: Draw the Xbar histogram for n = 1000. How is the histogram shape? Experiment 2: Repeat the full experiment drawing random numbers and sample means from a (1) uniform and from (2) a poisson distribution. Is the histogram of Xbar normal in shape for n = 5 and for n=30? Experiment 3: Repeat the full experiment using real data instead of random numbers. (HINT: select samples of dimension n = 5 from the real data, not using rnorm) Recommended: Try to evaluate the agreement of the sample mean histogram with normal distribution by mean of the qq-plot and shapiro wilk test. Application to Large Number Law Experiment: toss the coin 100 times. This experiment is like repeating 100 times a random draw from a bernoulli distribution with parameter ρ = 0.5 We expect to have 50 times (value = 1) head and 50 times cross (value = 0), if the coin is not distorted But, in practice, this not happen: repeating the experiment we are going to have a distribution centered in 50, but spread out. Let’s imagine to define X̂n as the mean of the number of heads across n experiments. For n −→ ∞, X̂n −→ 50
Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Introduction to R - 09.01.2014 Getting help with functions To get more information on any specific named function, for example solve, the command is > help(solve) An alternative is: > ?solve Running: > help.start() we will launch a Web browser that allows to enter to the help home page. The ?? command allows searching for help in a different way. For example, it is usefull to get a help of non installed packages. objects and saving data The entities that R creates and manipulates are known as objects. During an R session, objects are created and stored by name. > objects() can be used to display the name of the objects which are currently stored within R. > rm() can be used to remove objects. At the end of each R session, you are given the opportunity to save all the currently available objects. You can save the objects (the workspace) in .RData format in the current directory. You also can save command lines in .Rhistory format. scalars and vectors: manipulation To set up a vector named x, namely 1, 2, 3, 4 and 5, let use the R command: > x = c(1,2,3,4,5) or, identically, the assign function could be used. > assign(”x”, c(1,2,3,4,5)) x is a vector of length 5. To check it we can use the following function: > length(x) >1/x gives the reciprocal of x. > y = c(x,0,x) would create a vector with 11 entries consisting of two copies of x with a 0 in the middle. scalars and vectors: manipulation Vectors can be used in arithmetic expressions. Vector in the same expression need not all to be of the same length. If not, the output value have the length of the longest vector in the expression. For example: >v =2∗x +y +1 generate a new vector of length 11 constructed by adding together, element by element, 2*x repeated 2.2 times, y repeated just once, and 1 repeated 11 times. So, WARNING: R compute that kind of expression even if it is wrongly defined. scalars and vectors: manipulation - 2 In addition, are also available log, exp, sin, cos, tan, sqrt and, of course, the classical arithmetic operators min(x) and max(x) select the smallest and the largest element of the vector. sum(x) and prod(x) display the sum and the product, respectively, of the numbers within the vector. mean(x) calculates the sample (arithmetic) mean, wich is the same of sum(x)/length(x); and var(x) gives the sample variance: sum((x − mean(x))2 )/(length(x) − 1) sort(x) returns a vector of the same size of x, with the elements in increasing order. seq and rep There are facilities to generate commonly used sequences of numbers. > 1:30 is the vector c(1,2, ..., 29,30) > 2*1:15 is the vector c(2,4, ..., 28,30) of length 15. In addition, seq() is in use. seq(2:10) is the same of the vector 2:10 by=, from=, to= are usefull command: >seq(from= 30, to = 1) >seq(-10, 10, by = 0.5) rep() can be used for replicating and object. > rep(x, times=5) > rep(x, each=5) logical vectors As well as numerical vectors, R allows manipulation of logical quantities. The elements of a logical vector can have the value TRUE, FALSE and NA (”not available”) Logical vectors are generated by conditions. Example: > temp = x > 3 The logical operator are : <, <=, >=, ==, ! = for inequality. In addition, if c1 and c2 are logical expressions, then c1c2 is the intersection (”and”), c1|c2 is the union (”or ”), and !c1 is the negation of c1 missing Values In some cases the components of a vector may not be completely known: in this case we assign the value ”NA” The function is.na(x) gives a logical vector of the same size as x with value TRUE if the corresponding element in x is NA. > z = c(1:3, NA); ind = is.na(z) There is a second kind of ”missing” values that are produced by numerical computation, the so-called Not a Number, NaN, values. Example: > 0/0 > Inf/Inf index vectors: subsets of a vector Subsets of the elements of a vector may be selected by appendix to the name of the vector an index vector in square brackets. 1. A logical vector: Values corresponding to TRUE in the index vector are selected: > y = x[!is.na(x)] 2. A vector of positive (negative) integer quantities: in this case the values in the index vector must lie in the set {1, 2, ..., length(x)}. In the second case the selected vales will be excluded. > x[2:3]; x[-(2:3)] 3. A vector of character string: this is possible only after applying a names to the objects. > cars = c(1,2,3) > names(cars)=c(”ferrari”,”lamborghini”,”bugatti”) > pref = cars[c(”ferrari”,”bugatti”)] Objects and attribute To each object it is associated one (and only one) attribute (it’s the reason why we called them ”atomic”) The objects can be: numeric, logical, complex, character and raw Usefull commands: mode(), as.numeric(), is.numeric() For example, create a numeric vector: > z = 0:9 change it in character: > digits = as.character(z); and coerce it in a numeric:> d = as.integer(digits) d and z are the same! arrays, matrices and data.frame Vectors are the most important type of objects in R, but there are several others. Between the others: ◮ matrix: they are multidimensional generalizations of vectors ◮ data.frame: matrix-like structures, but the column can be of different types. This is used when we manage with both numerical and categorical data. How to transform a vector in matrix? > v = 1:50 > dim(v) = c(10,5) arrays, matrices and data.frame (2) How to create by beginning a matrix? > m = array(1:20, dim= c(4,5)) Subsetting a matrix or replacing a subset of a matrix with zeros? Lets give a look to the examples in the codes. matrix manipulation The operator ÷ ∗ ÷ is used for the matrix moltiplication. An nx1 or 1xn matrices are also valid matrices. If for example, A and B are square matrix of the same size, then: >A*B is the matrix of element by element products(it doesn’t work for matrices with different dimension), and > A ÷ ∗ ÷ t(B) is the matrix product. diag(A) return the elements in the main diagonal of A. ginv(A) and t(A) return the inverse and the transposed matrix. Ginv() require MASS package. lists and data frames An R list is an object consisting of an ordered collection of objects known as its components. Here is a simple example of how to make a list: > Lst = list(name=”Rodolfo”, surname=”Metulini”, age = ”30”) It is possible to concatenating two or more lists: list.ABC = C(list.A, list.B, list.C) A data.frame is a list with a specific class ”data.frame”. We can convert a matrix object in a data.frame objects with the command as.data.frame(matrix) The Easiest way to create a data.frame object is by mean of read.table () function. reading data Large data objects will usually be read as values from external files rather than entered during an R session at the keyboard. There are basically two similar commands to upload data. 1. read.table(): specific for .csv files. 2. read.delim(): specific for .txt files Usefull commands: sep = ” ”: to specify if data in the dataset are separated by ;, ., , or they are tab delimited. header = TRUE : to specify that first row in the dataset refers to variable names moreover, read.dta() is used to upload data from STATA :) distributions and co. One convenient use of R is to provide a comprehensive set of statistical tables. Functions are provided to evaluate the comulative distribution P(X < x), the probability density function and the quantile function (given q, the smallest x such that P(X < x) > q), and to simulate from the distribution. Here, by ”d” for the density , ”p” (pnorm, punif, pexp etc ..) for the CDF, ”q” for the quantile function. and ”r ” for simulation. Let empirically examine the distribution of a variable (codes). covar and concentration indices The covariance and the correlation measure the degree at which two variables change togheter The correlation is a index [-1,1], the covariance is a pure number (depends on the values assumed by the variables) > Cov = cov(A,B) > Cor = corr(A,B) We can also calculate the correlation netween A and B as follow: > CorAB = Cov / sqrt(Var(A)*Var(B)) Gini index: it is the most popular concentration index, we need to install ineq package Mode: the most frequent value within the distribution, we need to install modeest package, mfv command homeworks ◮ For who of us is familiar with STATA, lets try to upload a .dta file with read.dta() function. ◮ Study the agreement with other distributions (exponential? uniform? it is up to you) of eruption data.
Statistics Lab Rodolfo Metulini IMT Institute for Advanced Studies, Lucca, Italy Lesson 4 - The linear Regression Model: Theory and Application - 21.01.2014 Introduction In the past praticals we analyzed one variable. For certain reasons, it is even usefull to analyze two or more variables together. The question we want to asnwer regards what are the relations, the causal effects determining changes in a variable. Analyze if a certain phenomenon is endogenous or exogenous. In symbols, the idea can be represent as follow: y = f (x1 , x2 , ...) Y is the response, which is a function (it depends on) one or more variables. Objectives All in all, the regression model is the instrument used to: ◮ measure the entity of the relations between two or more variables: △Y / △ X , and to measure the causal direction (△X −→ △Y or viceversa? ) ◮ forecast the value of the variable Y in response to some changes in the others X1 , X2 , ... (called explanatories), or for some cases that are not considered in the sample. Simple linear regression model The regression model is stochastic, not deterministic. Giving two sets of values (two variables) from a random sample of length n: x = {x1 , x2 , ..., xi , ..xn }; y = {y1 , y2 , ..., yi , ..yn }: Deterministic formula: yi = β0 + β1 xi , ∀i = 1, .., n Stochastic formula: yi = β0 + β1 xi + ǫi ∀i = 1, .., n where ǫi is the stochastic component. β1 define the slope in the relations between X and Y (See graph in chart 1) Simple linear regression model - 2 We need to find β̂ = {β̂0 , β̂1 } as estimators of β0 and β1 . After β is estimated, we can draw the estimated regression line, which corresponds to the estimated regression model, as follow: ŷi = β̂0 + β̂1 xi Here, ǫ̂i = ŷi − yi . Where ŷi is the i-element of the estimated Y vector, and yi is the i-elements of the real Y vector. (see graph in chart 2) Steps in the Analysis 1. Study the relations (scatterplot, correlations) between two or more variables. 2. Estimation of the parameters of the model β̂ = {β̂0 , β̂1 }. 3. Hypotesis tests on the estimated β̂1 to verify the casual effects between X and Y 4. Robustness check of the model. 5. Use the model to analyze the causal effect and/or to do forecasting. Why linear? ◮ It is simple to estimate, to analyze and to interpret ◮ it likely fits with most of empirical cases, in which the relations between two phenomenon is linear. ◮ There are a lot of implemented methods to transorm variables in order to obtain a linear relationship (log transformation, normalization, etc.. ) Model Hypotesis In order the estimation and the utilization of the model to be correct, certain hypotesis must hold: ◮ E (ǫi ) = 0, ∀i −→ E (yi ) = β0 + β1 xi ◮ Omoschedasticity: V (ǫi ) = σi2 = σ 2 , ∀i ◮ Null covariance: Cov (ǫi , ǫj ) = 0, ∀i 6= j ◮ Null covariance among residuals and explanatories: Cov (xi , ǫi ) = 0, ∀i, since X is deterministic (known) ◮ Normal assumption: ǫi ∼ N(0, σ 2 ) Model Hypotesis - 2 From the hypotesis above, follow that: ◮ V (yi ) = σ 2 , ∀i. Y is stochastic only for the ǫ component. ◮ Cov (yi , yj ) = 0, ∀i 6= j. Since the residuals are uncorrelated. ◮ yi ∼ N[(β0 + β1 x1 ), σ 2 ] Since also the residuals are normal in shape. Ordinary Least Squares (OLS) Estimation The OLS is the estimation method used to estimate the vector β. The idea is to minimize the value of the residuals. Since ei = yi − ŷi we are interested in minimize the component yi − β̂0 − β̂1 xi . N.B. ǫi = β0 − β1 xi , while ei = β̂0 − β̂1 xi The method consist in minimize the sum of the square differences: Pn Pn 2 2 i (yi − ŷi ) = i ei = Min, which is equal to solve this 2 equation system derived using derivates. Ordinary Least Squares (OLS) Estimation - 2 δ/δβ0 n X ei2 = 0 (1) δ/δβ1 i n X ei2 = 0 (2) i After some arithmetics, we end up with this estimators for the vector β: β0 = ȳ − β̂1 x̄ Pn (yi − ȳ )(xi − x̄) iP β1 = n 2 i (xi − x̄) (3) (4) OLS estimators ◮ OLS β̂0 and β̂1 are stochastic estimators (they have a distribution in a sample space of all the possible estimtors define with different samples) ◮ β̂1 : measure the estimated variation in Y determined by a unitary variation in X (δY /δX ) ◮ The OLS estimators are correct (E (β̂1 ) = β1 ), and they are BLUE (corrects and with the lowest variance) Linear dependency index (R 2 ) The R 2 index is the most used index to measure the linear fitting of the model. R 2 is confined in the boundary [−1, 1], where, values near to 1 (or -1) means the explanatories are usefull to describe the changes in Y. Let define SQT = SQR + SQE , or Pn Pn Pn 2 2 2 i (yi − ȳ ) = i (ŷi − ȳ ) + i (yi − ŷi ) The R 2 is defined as R2 = Pn (ŷ −ȳ )2 Pin i 2 i (yi −ȳ ) SQR SQT or 1 − SQE SQT . Or, equivalent: Hypotesis testing on β1 The estimated slope parameter β1 is stochastic. It distributes as a gaussian: β̂1 ∼ N[β1 , σ 2 /SSx] We can make use of the hypotesis tests approach to investigate on the causal relation between Y and X : H0 : β 1 = 0 H1 : β1 6= 0, where, alternative hypotesis mean causal relation. The test is: z= β̂1 −β1 sqrt(σ 2 /SSx) ∼ N(0, 1). P When SSx is unknown, we estimate it as : SSx = ni (xi − ȳ )2 , and we use t − test with n − 1 degrees of freedom Forecasting within the regresion model The question we want to answer is the following: Which is the expected value of Y (say yn+1 ), for a certain observation that is not in the sample?. Suppose we have, for that observation, the value for the variable X (say xn+1 ) We make use of the estimated β to determine: ŷn+1 = β̂0 + β̂1 xn+1 Model Checking Several methods are used to test the robustness of the model, most of them based on the stochastic part of the the model: the estimated residuals. ◮ Graphical checks: Plot residuals versus fitted values ◮ qq-plot for the normality ◮ Shapiro wilk test for normality ◮ Durbin-Watson test for serial correlation ◮ Breusch-Pagan test for heteroschedasticity Moreover, the leverage is used to evaluate th importance of each observation in determining the estimated coefficients β. The Stepwise procedure is used to choice between different model specifications. Model Checking using estimated residuals - Linearity An example of departure from the linearity assumption: we can draw a curve (not a horizontal line) to interpolate the points Figure: residuals (Y) versus estimated (X) values Model Checking using estimated residuals Omoscedasticity An example of departure from the omoschedasticity assumption (the estimated residuals increases as the predicted values increase) Model Checking using estimated residuals - Normality An example of departure from the normality assumption: the qq-points do not follow the qq-line Figure: residuals (Y) versus estimated (X) values Model Checking using estimated residuals - Serial correlation An example of departure from the serial incorrelation assumption: the residual at i depend on the value at i − 1 Homeworks 1. Using cement data (n = 13), determine the β0 and β1 coefficients manually, using OLS formula at page 11, of the model y = β0 + β1 x1 2. Using cement data, estimate the R 2 index of the model y = β0 + β1 x1 , using formula at page 13. Charts - 1 Figure: Slope coefficient in the linear model Charts - 2 Figure: Fitted (line) versus real (points) values
#Point Estimate of Population Mean #For any particular random sample, we can always compute its sample mean. #Although most often it is not the actual population mean, it does serve as a good point estimate. #For example, in the data set survey, the survey is performed on a sample of the student population. #We can compute the sample mean and use it as an estimate of the corresponding population parameter. ##### CHUNK 1 #Problem #Find a point estimate of mean university student height with the sample data from survey. #For convenience, we begin with saving the survey data of student heights in a variableheight.survey. library(MASS) # load the MASS package height.survey = survey$Height height.survey summary(height.survey) str(height.survey) #It turns out not all students have answered the question, and we must filter out the missing values. #Hence we apply the mean function with the "na.rm" argument as TRUE. mean(height.survey, na.rm=TRUE) # skip missing values #A point estimate of the mean student height is 172.38 centimeters. ##Alternative (consistent) estimator of the mu parameter? ##geometric mean? height.survey = na.omit(height.survey) exp(mean(log(height.survey))) ##median? median(height.survey) ##mode require(modeest) mfv(height.survey) ##What is a unconsistent estimator? ##selecting a a subset truncating the sample height.survey.bias = height.survey[height.survey > 170] mean(height.survey.bias) ##CHUNK 2 #Point Estimate of Population Proportion #Multiple choice questionnaires in a survey are often used to determine the the proportion of a population with certain characteristic. #For example, we can estimate the proportion of female students in the university based on the result in the sample data set survey. #Problem #Find a point estimate of the female student proportion from survey. #We first filter out missing values in survey$Sex with the na.omit function, #and save it ingender.response. library(MASS) # load the MASS package gender.response = na.omit(survey$Sex) gender.response summary(gender.response) str(gender.response) n = length(gender.response) # valid responses count n #To find out the number of female students, we compare gender.response with the factor’Female’, and compute the sum. Dividing it by n gives the female student proportion in the sample survey. k = sum(gender.response == "Female") pbar = k/n pbar #The point estimate of the female student proportion in survey is 50%. ##CHUNK 3 #Problem #Find a point estimate of the number of civil wars within countries (poisson count distribution) civilwars = read.delim("civilwars.txt") summary(civilwars) #the poisson parameter is lambda, that is both the mean and the variance estimate #it represent the mean number of occurrence of civil wars around the world mean(civilwars$cases) ##mean estimator mean(civilwars$cases) ##standard error estimator ##CHUNK 4 #Problem #find a point estimation for the parameter of the exponential distribution ## it is 1/mean ##the file contains the duration (in years) of a sample of washing machines (until it brokes for the first time) wasmach = read.delim("wasmach.txt") summary(wasmach) mean(wasmach$dur) freq = 1/mean(wasmach$dur) freq ##we can state the sample comes from an exponential distribution with expected parameter 0.246 ##CHUNK 5 #Calculating Confidence Intervals #Here we look at some examples of calculating confidence intervals. #The examples are for both normal and t distributions. #Note that an easier way to calculate confidence intervals using thet.test command is discussed in section The Easy Way. #Calculating a Confidence Interval From a Normal Distribution #We will make some assumptions for what we might find in an experiment and find the resulting confidence interval using a normal distribution. #Here we assume that the sample mean is 5, the standard deviation is 2, and the sample size is 20. #In the example below we will use a 95% confidence level and wish to find the confidence interval. #The commands to find the confidence interval in R are the following: a <- 5 #mean s <- 2 #sd n <- 20 #sample error <- qnorm(0.975)*s/sqrt(n) ##quantile corresponding to CDF=0.975 of the normal left <- a-error right <- a+error left right #The true mean has a probability of 95% of being in the interval between 4.12 and 5.88 assuming that the original random variable is normally distributed, and the samples are independent. ##CHUNK 6 #Calculating a Confidence Interval From a t Distribution #Calculating the confidence interval when using a t-test is similar to using a normal distribution. #The only difference is that we use the command associated with the t-distribution rather than the normal distribution. #Here we repeat the procedures above, but we will assume that we are working with a sample standard deviation rather than an exact standard deviation. #Again we assume that the sample mean is 5, the ###sample#### standard deviation is 2, and the sample size is 20. We use a 95% confidence level and wish to find the confidence interval. The commands to find the confidence interval in R are the following: ##we use t-students when n is small (CLT) and sd is estimated. a <- 5 s <- 2 n <- 20 error <- qt(0.975,df=n-1)*s/sqrt(n) ##quantile corresponding to CDF=0.975 of the t student left <- a-error right <- a+error left right #The true mean has a probability of 95% of being in the interval between 4.06 and 5.94 assuming that the original random variable is normally distributed, and the samples are independent. ##the interval is bigger, because t student is less narrow. #We have less detailed information beacause the sample estimation of sd. So, we use the caution-method. ##CHUNK 7 #Calculating Many Confidence Intervals From a t Distribution. ##imagine have the population splitted in 2, and an extraction of samples from them. #Suppose that you want to find the confidence intervals for many tests. #This is a common task and most software packages will allow you to do this. We have three different sets of results: #Comparison 1 # Mean Std. Dev. Number (pop.) #Group I 10 3 300 #Group II 10.5 2.5 230 #Comparison 2 # Mean Std. Dev. Number (pop.) #Group I 12 4 210 #Group II 13 5.3 340 #Comparison 3 # Mean Std. Dev. Number (pop.) #Group I 30 4.5 420 #Group II 28.5 3 400 #For each of these comparisons we want to calculate the associated confidence interval for the difference of the means. #For each comparison there are two groups. We will refer to group one as the group whose results are in the first row of each comparison above. #We will refer to group two as the group whose results are in the second row of each comparison above. #Before we can do that we must first compute a standard error and a t-score. #We will find general formulae which is necessary in order to do all three calculations at once. #We assume that the means for the first group are defined in a variable called m1. #The means for the second group are defined in a variable called m2. #The standard deviations for the first group are in a variable called sd1. #The standard deviations for the second group are in a variable called sd2. #The number of samples for the first group are in a variable called num1. #Finally, the number of samples for the second group are in a variable called num2. #With these definitions the (global) standard error is the square root of (sd1^2)/num1+(sd2^2)/num2. #The R commands to do this can be found below: m1 <- c(10,12,30) m2 <- c(10.5,13,28.5) sd1 <- c(3,4,4.5) sd2 <- c(2.5,5.3,3) num1 <- c(300,210,420) num2 <- c(230,340,400) se <- sqrt(sd1*sd1/num1+sd2*sd2/num2) error <- qt(0.975,df=pmin(num1,num2)-1)*se ###dregrees, spiegare #Now we need to define the confidence interval around the assumed differences. #Just as in the case of finding the p values in previous chapter we have to use the pmin command to get the number of degrees of freedom. #In this case the null hypotheses are for a difference of zero, and we use a 95% confidence interval: left <- (m1-m2)-error right <- (m1-m2)+error left right #This gives the confidence intervals for each of the three tests ###CONFIDENCE INTERVAL FOR PROPORTION?? #Hipotesis testing #Researchers retain or reject hypothesis based on measurements of observed samples. #The decision is often based on a statistical mechanism called hypothesis testing. #A type I error is the mishap of falsely rejecting a null hypothesis when the null hypothesis is true. #The probability of committing a type I error is called the significance level of the hypothesis testing, and is denoted by the Greek letter alpha. #We demonstrate the procedure of hypothesis testing in R first with the intuitive critical value approach ... #Then we discuss the popular p-value approach as alternative. ## CHUNK 8 ############Lower Tail Test of Population Mean with Known Variance ############# xbar = 9900 # sample mean mu0 = 10000 # hypothesized value sigma = 120 # population standard deviation n = 30 # sample size z = (xbar-mu0)/(sigma/sqrt(n)) z # test statistic # We then compute the critical value at .05 significance level. alpha = .05 z.alpha = qnorm(1-alpha) -z.alpha # critical value #Answer #The test statistic -4.5644 is less than the critical value of -1.6449. Hence, at .05 significance level, we reject the claim that mean lifetime of a light bulb is above 10,000 hours. #Alternative Solution #Instead of using the critical value, we apply the pnorm function to compute the lower tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 10000. pval = pnorm(z) pval # lower tail p-value ##CHUNK 9 # Upper Tail Test of Population Mean with Known Variance #Problem #Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per cookie is 2.1 grams. Assume that the population standard deviation is 0.25 grams. At .05 significance level, can we reject the claim on food label? #The null hypothesis is that µ = 2. We begin with computing the test statistic. xbar = 2.1 # sample mean mu0 = 2 # hypothesized value sigma = 0.25 # population standard deviation n = 35 # sample size z = (xbar-mu0)/(sigma/sqrt(n)) z # test statistic #We then compute the critical value at .05 significance level. alpha = .05 z.alpha = qnorm(1-alpha) z.alpha # critical value #Answer #The test statistic 2.3664 is greater than the critical value of 1.6449. Hence, at .05 significance level, we reject the claim that there is at most 2 grams of saturated fat in a cookie. #Alternative Solution #Instead of using the critical value, we apply the pnorm function to compute the upper tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 2. pval = pnorm(z, lower.tail=FALSE) pval # upper tail p-value ##values at which you can set the alpha coefficient: 0.01, 0.05, 0.1. ##CHUNK 10 #Two-Tailed Test of Population Mean with Known Variance #the null hypothesis of the two-tailed test is to be rejected if z =-za/2 or z = za/2 , whereza/2 is the 100(1 - a/2) percentile of the standard normal distribution. #Problem #Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6 kg. Assume the population standard deviation is 2.5 kg. At .05 significance level, can we reject the null hypothesis that the mean penguin weight does not differ from last year? #The null hypothesis is that µ = 15.4. We begin with computing the test statistic. xbar = 14.6 # sample mean mu0 = 15.4 # hypothesized value sigma = 2.5 # population standard deviation n = 35 # sample size z = (xbar-mu0)/(sigma/sqrt(n)) z # test statistic #We then compute the critical values at .05 significance level. alpha = .05 z.half.alpha = qnorm(1-alpha/2) c(-z.half.alpha, z.half.alpha) #The test statistic -1.8931 lies between the critical values -1.9600 and 1.9600. Hence, at .05 significance level, we do not reject the null hypothesis that the mean penguin weight does not differ from last year. #Alternative Solution #Instead of using the critical value, we apply the pnorm function to compute the two-tailedp-value of the test statistic. It doubles the lower tail p-value as the sample mean is lessthan the hypothesized value. Since it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that µ = 15.4. pval = 2 * pnorm(z) # lower tail pval # two-tailed p-value ##CHUNK 11 #Lower Tail Test of Population Mean with Unknown Variance #the null hypothesis of the lower tail test is to be rejected if t =-ta , where ta is the100(1 - a) percentile of the Student t distribution with n - 1 degrees of freedom. #Suppose the manufacturer claims that the mean lifetime of a light bulb is more than 10,000 hours. In a sample of 30 light bulbs, it was found that they only last 9,900 hours on average. Assume the sample standard deviation is 125 hours. At .05 significance level, can we reject the claim by the manufacturer? #The null hypothesis is that µ = 10000. We begin with computing the test statistic. xbar = 9900 # sample mean mu0 = 10000 # hypothesized value s = 125 # sample standard deviation n = 30 # sample size t = (xbar-mu0)/(s/sqrt(n)) t # test statistic #We then compute the critical value at .05 significance level. alpha = .05 t.alpha = qt(1-alpha, df=n-1) -t.alpha # critical value #The test statistic -4.3818 is less than the critical value of -1.6991. Hence, at .05 significance level, we can reject the claim that mean lifetime of a light bulb is above 10,000 hours. #Alternative Solution #Instead of using the critical value, we apply the pt function to compute the lower tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 10000. pval = pt(t, df=n-1) pval # lower tail p-value ##CHUNK 12 #upper Tail Test of Population Mean with Unknown Variance #the null hypothesis of the upper tail test is to be rejected if t = ta , where ta is the100(1 - a) percentile of the Student t distribution with n - 1 degrees of freedom. #Suppose the food label on a cookie bag states that there is at most 2 grams of saturated fat in a single cookie. In a sample of 35 cookies, it is found that the mean amount of saturated fat per cookie is 2.1 grams. Assume that the sample standard deviation is 0.3 gram. At .05 significance level, can we reject the claim on food label? #The null hypothesis is that µ = 2. We begin with computing the test statistic. xbar = 2.1 # sample mean mu0 = 2 # hypothesized value s = 0.3 # sample standard deviation n = 35 # sample size t = (xbar-mu0)/(s/sqrt(n)) t # test statistic #We then compute the critical value at .05 significance level. alpha = .05 t.alpha = qt(1-alpha, df=n-1) t.alpha # critical value #The test statistic 1.9720 is greater than the critical value of 1.6991. Hence, at .05 significance level, we can reject the claim that there is at most 2 grams of saturated fat in a cookie. #Alternative Solution #Instead of using the critical value, we apply the pt function to compute the upper tail p-value of the test statistic. As it turns out to be less than the .05 significance level, we reject the null hypothesis that µ = 2. pval = pt(t, df=n-1, lower.tail=FALSE) pval # upper tail p-value ##CHUNK 13 #Two-Tailed Test of Population Mean with Unknown Variance #the null hypothesis of the two-tailed test is to be rejected if t =-ta/2 or t = ta/2 , whereta/2 is the 100(1 - a) percentile of the Student t distribution with n - 1 degrees of freedom. #Suppose the mean weight of King Penguins found in an Antarctic colony last year was 15.4 kg. In a sample of 35 penguins same time this year in the same colony, the mean penguin weight is 14.6 kg. Assume the sample standard deviation is 2.5 kg. At .05 significance level, can we reject the null hypothesis that the mean penguin weight does not differ from last year? #The null hypothesis is that µ = 15.4. We begin with computing the test statistic. xbar = 14.6 # sample mean mu0 = 15.4 # hypothesized value s = 2.5 # sample standard deviation n = 35 # sample size t = (xbar-mu0)/(s/sqrt(n)) t # test statistic #We then compute the critical values at .05 significance level. alpha = .05 t.half.alpha = qt(1-alpha/2, df=n-1) c(-t.half.alpha, t.half.alpha) #The test statistic -1.8931 lies between the critical values -2.0322, and 2.0322. Hence, at .05 significance level, we do not reject the null hypothesis that the mean penguin weight does not differ from last year. #Alternative Solution #Instead of using the critical value, we apply the pt function to compute the two-tailed p-value of the test statistic. It doubles the lower tail p-value as the sample mean is less than the hypothesized value. Since it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that µ = 15.4. pval = 2 * pt(t, df=n-1) # lower tail pval # two-tailed p-value ##CHUNK 14 #Lower Tail Test of Population Proportion #the null hypothesis of the lower tail test is to be rejected if z =-za , where za is the100(1 - a) percentile of the standard normal distribution. #Suppose 60% of citizens voted in last election. 85 out of 148 people in a telephone survey said that they voted in current election. At 0.5 significance level, can we reject the null hypothesis that the proportion of voters in the population is above 60% this year? #The null hypothesis is that p = 0.6. We begin with computing the test statistic. pbar = 85/148 # sample proportion p0 = .6 # hypothesized value n = 148 # sample size z = (pbar-p0)/sqrt(p0*(1-p0)/n) z # test statistic #We then compute the critical value at .05 significance level. alpha = .05 z.alpha = qnorm(1-alpha) -z.alpha # critical value #The test statistic -0.6376 is not less than the critical value of -1.6449. Hence, at .05 significance level, we do not reject the null hypothesis that the proportion of voters in the population is above 60% this year. #Alternative Solution 1 #Instead of using the critical value, we apply the pnorm function to compute the lower tail p-value of the test statistic. As it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that p = 0.6. pval = pnorm(z) pval # lower tail p-value #Alternative Solution 2 #We apply the prop.test function to compute the p-value directly. The Yates continuity correction is disabled for pedagogical reasons. prop.test(85, 148, p=.6, alt="less", correct=FALSE) ##CHUNK 15 #Upper Tail Test of Population Proportion #the null hypothesis of the upper tail test is to be rejected if z = za , where za is the100(1 - a) percentile of the standard normal distribution. #Suppose that 12% of apples harvested in an orchard last year was rotten. 30 out of 214 apples in a harvest sample this year turns out to be rotten. At .05 significance level, can we reject the null hypothesis that the proportion of rotten apples in harvest stays below 12% this year? #The null hypothesis is that p = 0.12. We begin with computing the test statistic. pbar = 30/214 # sample proportion p0 = .12 # hypothesized value n = 214 # sample size z = (pbar-p0)/sqrt(p0*(1-p0)/n) z # test statistic #We then compute the critical value at .05 significance level. alpha = .05 z.alpha = qnorm(1-alpha) z.alpha # critical value #The test statistic 0.90875 is not greater than the critical value of 1.6449. Hence, at .05 significance level, we do not reject the null hypothesis that the proportion of rotten apples in harvest stays below 12% this year. #Alternative Solution 1 #Instead of using the critical value, we apply the pnorm function to compute the upper tail p-value of the test statistic. As it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that p = 0.12. pval = pnorm(z, lower.tail=FALSE) pval # upper tail p-value #Alternative Solution 2 We apply the prop.test function to compute the p-value directly. The Yates continuity correction is disabled for pedagogical reasons. prop.test(30, 214, p=.12, alt="greater", correct=FALSE) ##CHUNK 16 #Two-Tailed Test of Population Proportion #the null hypothesis of the two-tailed test is to be rejected if z =-za/2 or z = za/2 , whereza/2 is the 100(1 - a) percentile of the standard normal distribution. #Suppose a coin toss turns up 12 heads out of 20 trials. At .05 significance level, can one reject the null hypothesis that the coin toss is fair? #The null hypothesis is that p = 0.5. We begin with computing the test statistic. pbar = 12/20 # sample proportion p0 = .5 # hypothesized value n = 20 # sample size z = (pbar-p0)/sqrt(p0*(1-p0)/n) z # test statistic #We then compute the critical values at .05 significance level. alpha = .05 z.half.alpha = qnorm(1-alpha/2) c(-z.half.alpha, z.half.alpha) #The test statistic 0.89443 lies between the critical values -1.9600 and 1.9600. Hence, at .05 significance level, we do not reject the null hypothesis that the coin toss is fair. #Alternative Solution 1 #Instead of using the critical value, we apply the pnorm function to compute the two-tailedp-value of the test statistic. It doubles the upper tail p-value as the sample proportion isgreater than the hypothesized value. Since it turns out to be greater than the .05 significance level, we do not reject the null hypothesis that p = 0.5. pval = 2 * pnorm(z, lower.tail=FALSE) # upper tail pval # two-tailed p-value #Alternative Solution 2 We apply the prop.test function to compute the p-value directly. The Yates continuity correction is disabled for pedagogical reasons. prop.test(12, 20, p=0.5, correct=FALSE) ##CHUNK 17 ##Defining a sample size based on mean #Problem #Assume the population standard deviation s of the student height in survey is 9.48. Find the sample size needed to achieve a 1.2 centimeters margin of error at 95% confidence level. #Since there are two tails of the normal distribution, #the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, za/2 is given by qnorm(.975). zstar = qnorm(.975) sigma = 9.48 E = 1.2 zstar^2 * sigma^2/ E^2 #Based on the assumption of population standard deviation being 9.48, it needs a sample size of 240 #to achieve a 1.2 centimeters margin of error at 95% confidence level. ##CHUNK 18 ##Defining a sample size based on proportion #The quality of a sample survey can be improved by increasing the sample size. #The formula below provide the sample size needed under the requirement of population proportion interval estimate at (1 - a) confidence level, #margin of error E, and planned proportion estimate p. Here, za/2 is the 100(1 - a/2) percentile of the standard normal distribution. #Since there are two tails of the normal distribution, #the 95% confidence level would imply the 97.5th percentile of the normal distribution at the upper tail. Therefore, za/2 is given by qnorm(.975). zstar = qnorm(.975) p = 0.5 E = 0.05 zstar^2 *p * (1-p) / E^2 #With a planned proportion estimate of 50% at 95% confidence level, #it needs a sample size of 385 to achieve a 5% margin of error for the survey of female student proportion.
##CHUNK 1### set the directory setwd("C:/Users/Rudolph/Desktop/lesson1") ##CHUNK 2 ### help help(solve) ?solve help.start() ##asking help for a non installed package ?spdep ??spdep ##CHUNK 3 ## objects and saving ##create objects a = 1 b = 2 a b ##display objects objects() ##remove a rm(a) objects() ##remove all objects rm(list=ls()) objects() ##CHUNK 4 ### scalars and vectors : manipulation x = c(1,2,3,4,5) assign("x", c(1,2,3,4,5)) x length(x) str(x) ##reciprocal 1/x ## y = c(x,0,x) y length(y) str(y) ###CHUNK 5 ### vector arithmetics v = 2*x + y + 1 v length(v) z1 = x + x z1 z2 = x - x z2 z3 = x*x z3 z4 = x/x z4 z5 = x^3 z5 ##logs logx = log(x) logx log10x = log10(x) log10x ##exp of a log = return the initial vector xnew = exp(logx) xnew2 = 10^log10x xnew xnew2 ##min / max min(x) max(x) ##sum, prod, mean, var sum(x) prod(x) mean(x) sum(x)/length(x) var(x) sum((x-mean(x))^2) / (length(x) - 1) ##CHUNK 6 ### : sequences c = 1:30 d = 2*1*15 ##to practice n = 10 1:n - 1 1:(n-1) e = 1:30 f = 2*1:15 #In addition, seq() is in use. seq(2:10) is the same of the vector c(2:10) #by=, from=, to= are usefull command: seq(from= 30, to = 1) seq(-10, 10, by = 0.5) #rep() can be used for replicating and object. rep(x, times=5) rep(x, each=5) ## CHUNK 7 ## logicals and NA's temp = x > 3 temp ##it is logical! str(temp) #it is the same length of x! length(temp) c1 = x > 3 c1 c2 = x < 3 c2 c3 = x != 3 c3 c1&c2 c1|c2 ##NA's z = c(1:3, NA) z ind = is.na(z) ind 0/0 Inf Inf - Inf ##CHUNK 8 ## subsetting ##1 z = c(1:3, NA) y = z[!is.na(z)] y ##2 x = seq(1,20) z1 = x[1:10] z2 = x[-(1:10)] ##3 cars = c(1,2,3) names(cars)=c("ferrari","lamborghini","bugatti") pref = cars[c("ferrari","bugatti")] ## CHUNK 9 ## attributes z = 0:9 z mode(z) is.character(z) digits = as.character(z) digits mode(digits) is.numeric(digits) d = as.integer(digits) d mode(d) is.numeric(d) ##CHUNK 10 ### matrix and data frame, arrays v = 1:50 is.vector(v) is.numeric(v) ##change vector in matrix, defining the number of rows and columns dim(v) = c(10,5) is.matrix(v) dim(v) v ##creating a 4 x 5 array m = array(1:20, dim = c(4,5)) is.matrix(m) m ##subsetting an array ##only the first 2 rows m1 = m[1:2, ] m1 ##only the first 3 columns m2 = m[ ,1:3] m2 ##only the first 2 rows and the first 3 columns m3 = m[1:2,1:3] m3 ##replace m with 0 in the first 2 rows and the first 3 columns m[m3] = 0 m ##CHUNK 11## Matrix manipulation A = array(1:20, dim=c(4,5)) B = array(seq(20,1), dim=c(4,5)) A B C = A*B dim(C) D= A%*%t(B) dim(D) D = diag(A) D is.matrix(D) is.vector(D) ##transposed T= t(A) T ##inverse require(MASS) ginv(A) ##CHUNK 12 ### lists and data frames Lst1 = list(name="Rodolfo", surname="Metulini", age = "30", sister.age = c(12,16,20,25)) Lst1 Lst2 = list(name = "Carlo", surname = "Rossi", age= "34", sister.age = c(30,38)) Lst2 ##concatenating list.12 = c(Lst1, Lst2) list.12 ##data.frame m = array(seq(1:20), dim=c(4,5)) is.matrix(m) df = as.data.frame(m) is.data.frame(df) df str(df) ##CHUNK 13 ### read tables ds = read.table("players.csv", sep=";", header=TRUE) ds ds = read.delim("players.txt", sep="", header=TRUE) ds summary(ds) str(ds) capture.output(summary(ds), file="output.txt") ##chunk 14 ### examine a distribution attach(faithful) summary(eruptions) hist(eruptions) ##mmm... make the bins smaller (0.2) hist(eruptions, seq(1.6,5.2,0.2), prob=TRUE) ##draw lines of the empirical density density(eruptions) lines(density(eruptions, bw=0.1)) ### cumulative distribution function? ecdf(eruptions) plot(ecdf(eruptions)) ##it is far from the normal distribution. lets try to split in two! long = eruptions[eruptions > 3] ecdf(long) plot(ecdf(long)) ##compare with the cumulative distribution of a normal distribution ##define the points in x in which to evaluate the ecdf. x = seq(3, 5.4, 0.01) lines(x, pnorm(x, mean= mean(long), sd = sqrt(var(long))), lty=3, col="red") ##with the QQ plot we can diagnostic the normality. par(pty="s") qqnorm(long) qqline(long) ##the Shapiro wilk test is proposed to more formal test the agreement with normality shapiro.test(long) ##pvalue is less than 0.05 ###CHUNK 15 ### covariances and concentration indices A = c(1,3,7,12,15) B = c(8,7,6,10,18) Cov = cov(A,B) Cov Cor = cor(A,B) Cor ##We can also calculate the correlation between A and B as follow: CorAB = Cov / sqrt(var(A)*var(B)) CorAB ##GINI index require(ineq) x = c(1,1,1,1,1) y = c(0,0,200,1,4,2) ineq(x, type="Gini") ineq(y, type="Gini") ##mode require(modeest) x = c(1,3,7,7,7,8,9,12,14,15) mfv(x)
##CHUNK 1 ### ## Estimating a population mean mysample = rnorm(25, 10, 3) mean(mysample) ##create a repository to store the sample means samplemeans = rep(NA,1000) for (i in 1:1000) { mysample.i = rnorm(25, 10, 3) samplemeans[i] = mean(mysample.i) } hist(samplemeans) #Investigate the sampling distribution of the sample mean further by taking #1000 samples from this normal distribution. Make a histogram of the means #of these 1000 samples. Verify that about 95% of the time the sample mean #will be within ± 2s*X of the true sample mean sum(samplemeans >= 10-2*3/sqrt(25) & samplemeans <= 10+2*3/sqrt(25))/1000 ##normal confidence interval x.bar = mean(mysample) sd = sd(mysample) x.bar - 2*sd/sqrt(25) x.bar + 2*sd/sqrt(25) ##bootstrap Confidence interval #The bootstrap estimates standard error by resampling the data in our original sample. The idea is to treat #the sample as a substitute for the population. Instead of repeatedly drawing samples of size 100 from the #population, we will repeatedly draw new samples of size 100 from our original sample. In order to not end #up with exactly the same sample every time, we will resample with replacement. This means that after a #value is drawn, we replace it before drawing again. This way we can draw the same value out more than #once! The following questions demonstrate how this resampling process works. bootsample = sample(mysample, 25, replace=T) bootsample bootmean = mean(bootsample) bootmean #To do a full bootstrap, we would take a large number of bootstrap samples (at least 1000) from the original #sample. For each bootstrap sample, we would calculate the mean of the bootstrap samples. The variation #in the means of the bootstrap samples gives us an estimate of the variability in the sample mean. That is, #we can estimate the standard error of the sample mean using the standard deviation of the bootstrapped #sample means. We can then use this to construct a con?dence interval for the true population mean. #7. Take 1000 bootstrap samples from your original sample and calculate the mean of each of these #bootstrap samples. Make a histogram of these bootstrapped means. bootmeans = rep(NA, 1000) for (i in 1:1000) { bootsample = sample(mysample, 25, replace=T) bootmeans[i] = mean(bootsample) } hist(bootmeans) sdb = sd(bootmeans) x.bar - 2*sdb/sqrt(25) x.bar + 2*sdb/sqrt(25) ##CHUNK 2 ### ##bootstrap conf interval of a EXPONENTIAL (using quantile method) my.Test.Data = rexp(100,3) mean(my.Test.Data) lambda_hat = 1/mean(my.Test.Data) ##THE ESTIMATED PARAMETER ##normal quantiles (because CLT) CIlow = lambda_hat - qnorm(0.975)*(lambda_hat/sqrt(100)) CIhigh = lambda_hat + qnorm(0.975)*(lambda_hat/sqrt(100)) CI =cbind(CIlow,CIhigh) CI ##how to find a bootstrap condidence interval? ## initialize a matrix that will receive the value of the estimate #for each sample boot.sampling.dist = matrix(1,2000) ##compute 2000 bootstrap samples and compute the statistic for each of them for (i in 1:2000) { boot.sampling.dist[i] = 1/mean(rexp(100,lambda_hat)) } ## look at the sample distribution windows() hist(boot.sampling.dist, main = "Estimate of sampling distribution of lambda", breaks = 50) ##find the quantile of the new distribution my.quantiles = quantile(boot.sampling.dist, c(.025,.975)) ##calculate the bootstrap confidence interval boundaries CIbootlow = 2*lambda_hat - my.quantiles[2] CIboothigh = 2*lambda_hat - my.quantiles[1] CIboot =cbind(CIbootlow,CIboothigh) CIboot ###CHUNK 3 ### #bootstrap in regression models #We can now form a con?dence interval using the bootstrap. To do this, we need to implement this bivariate #bootstrap sampling. We will use the percentile bootstrap method. Consider the script below: ### create our data and calculate thetahat, the slope of the regression line library(MASS) data(cats) dim(cats) thetahat = lm(Hwt ~ Bwt, data=cats)$coeff[2] thetahat ##i construct 1000 samples on n=144, and, for each one, # i compute the sample mean. thetahat.b = rep(NA,1000) for (i in 1:1000) { ### draw the bootstrap sample and calculate thetahat.b index = 1:144 bootindex = sample(index, 144, replace=T) bootsample = cats[bootindex,] thetahat.b[i] = lm(Hwt ~ Bwt, data=bootsample)$coeff[2] } hist(thetahat.b) quantile(thetahat.b, .025) quantile(thetahat.b, .975) ##CHUNK 4 ### ##Alternative method: resampling the residuals ##to store the beta's res = matrix(NA,1000,144) ##a 1000*144 matrix to store the 1000 replications of 144 fitted values thetahat.b_a = rep(NA,1000) ##first step: define the bootsrap fitted values for (i in 1:1000) { index = 1:144 bootindex = sample(index, 144, replace=T) bootsample = cats[bootindex,] res[i,] = lm(Hwt ~ Bwt, data=bootsample)$res } ##step 2: use the fitted values to obtain 1000 beta estimates for (i in 1:1000) { thetahat.b_a[i] = lm(Hwt + res[i,] ~ Bwt, data=bootsample)$coeff[2] } hist(thetahat.b_a) quantile(thetahat.b_a, .025) quantile(thetahat.b_a, .975) hist(thetahat.b) quantile(thetahat.b, .025) quantile(thetahat.b, .975) ## MARKOV CHAINS ### CHUNK 5: ## what is the probability the day after tomorrow will rain, ## given it rained today? ##Construct transition matrix trans.mat = matrix(c(0.7,0.3,0.4,0.6), 2, 2, byrow=TRUE) ##2step transition matrix trans.mat.2 = trans.mat %*% trans.mat trans.mat.2 ### CHUNK 6: ## what is the unconditional probability of rain at time n? trans.mat = matrix(c(0.7,0.3,0.4,0.6), 2, 2, byrow=TRUE) ##2step transition matrix trans.mat.2 = trans.mat %*% trans.mat init.dist = matrix (c(0.4,0.6), 1, 2) init.dist unc.prob.n = init.dist %*%trans.mat.2 unc.prob.n ##CHUNK 7: ##stationary transition distribution trans.mat = matrix(c(0.7,0.3,0.4,0.6), 2, 2, byrow=TRUE) ##2step transition matrix trans.mat.2 = trans.mat %*% trans.mat trans.mat.2 ##4step transition matrix trans.mat.4 = trans.mat %*% trans.mat %*% trans.mat %*% trans.mat trans.mat.4 ##6 step transition matrix trans.mat.6 = trans.mat %*% trans.mat %*% trans.mat %*% trans.mat%*% trans.mat %*% trans.mat trans.mat.6 ##CHUNK 8 ## alternative methods trans.mat alpha_stat = trans.mat[2,1] / trans.mat[1,1] alpha_stat beta_stat = trans.mat[1,2] / trans.mat[1,1] beta_stat
##CHUNK 1 ## ###correlation and plot library("MASS") data(cats) str(cats) summary(cats) # "Bwt" is the body weight in kilograms, "Hwt" is the heart weight in grams, # and "Sex" should be obvious. # There are no missing values in any of the variables, # so we are ready to begin by looking at a scatterplot... ##DRAW THE SCATTERPLOT with(cats, plot(Bwt, Hwt)) ##ADD TITLE title(main="Heart Weight (g) vs. Body Weight (kg)\nof Domestic Cats") ##ALTERNATIVELY ... with(cats, plot(Hwt ~ Bwt)) # The scatterplot shows a fairly strong and reasonably linear relationship between the two variables. A Pearson product-moment correlation coefficient can be calculated using the cor( ) function... #CORRELATION with(cats, cor(Bwt, Hwt)) ##pearson test for correlation bigger (or lower) than 0. with(cats, cor.test(Bwt, Hwt)) #Since we would expect a positive correlation here, #we might have set the alternative to "greater" #(we "forgot" the lower tail and perform a one-side "upper -tail" test) with(cats, cor.test(Bwt, Hwt, alternative="greater", conf.level=.95)) #Using the formula interface makes it easy to subset the data #by rows of the data frame, to analyze female cats only with(cats, cor.test(~ Bwt + Hwt, subset=(Sex=="F"))) ##here, for the females, correlation is lower, but still significant ###suppose a dataset with more than 2 variables. data(cement) # also in the MASS library str(cement) cor(cement) ##corr smong all variables 2 by 2. cov(cement) # If you have a covariance matrix and want a correlation matrix... cov.matr = cov(cement) cov2cor(cov.matr) # If you want a visual representation of the correlation matrix pairs(cement) ##multidimensional scatterplot ##CHUNK 2 ### ###linear regre model #Simple Linear Regression #There is little question that R was written with regression analysis in mind! #The number of functions and add-on packages devoted to regression analysis of one kind or another is staggering! àLet's go back to our cats data... data(cats) # still in the "MASS" library #Simple (as well as multiple) linear regression is done using the lm( ) function. #This function requires a formula interface, and for simple regression the formula takes the form DV ~ IV, #It's critical to remember that the DV or response variable must come before the tilde and IV or explanatory variables after. #Getting the regression equation for our "cats" data is simple olscats = lm(Hwt ~ Bwt, data=cats) ##summary summary(olscats) #So the regression equation is: Hwt = 4.0341 (Bwt) - 0.3567. ##draw the real points plot(Hwt ~ Bwt, main="Kitty Cat Plot", data=cats) ##add the regression (preducted values) line abline(olscats, col="red") ###regression dignostics par(mfrow=c(2,2)) plot(olscats) str(olscats, max.level=1) ##leverage leverage = influence(olscats) leverage$coefficients ##measure a change (in absolute value) of the coefficient, omitting the i-elements from the sample ##CHUNK 3 ### ###prediction #Problem #Apply the simple linear regression model for the data set cats, #and estimate the body weight of a cat with a 3 kg body. #Solution #We apply the lm function to a formula that describes the variable hwt over bwt, #and save the linear regression model in a new variable olscats olscats = lm(Hwt ~ Bwt, data=cats) #Then we extract the parameters of the estimated regression equation with the coefficients function. coeffs = coefficients(olscats) coeffs #We now fit the eruption duration using the estimated regression equation. Bwt = 3 Hwt = coeffs[1] + coeffs[2]*Bwt Hwt #Answer #Alternative Solution #We wrap the waiting parameter value inside a new data frame named newdata. newdata = data.frame(Bwt=3) # wrap the parameter #Then we apply the predict function to eruption.lm along with newdata. predict(olscats, newdata) # apply predict ##CHUNK 4 ### ###stepwise ### summary(cement) olscement = lm(y ~ x1 + x2 + x3 + x4,data=cement) summary(olscement) require(Rcmdr) stepwise(olscement) ##AIC, BIC ##CHUNK 5 ### ##heterogeneity ncvTest(olscement) ##for heterogeneity robustcement = rlm(y ~ x1 + x2 + x3 + x4,data=cement) ##robust for heterogeneity ###robustness checks plot(robustcement) coeffs1 = coefficients(olscement) coeffs1 coeffs2 = coefficients(robustcement) coeffs2 all.equal(coeffs1, coeffs2) ##chunk 6 ### ### glm ##### library(pscl) data(AustralianElections) str(AustralianElections) summary(AustralianElections) poiselections = glm(Seats ~ Informal + Turnout,data=AustralianElections, family=poisson) summary(poiselections) ##CHUNK 7 ##HOMEWORKS summary(cement) olscement = lm(y ~ x1 + x2 + x3 + x4,data=cement) summary(olscement) ols2 = lm(y ~ x1 ,data=cement) summary(ols2) 1. DEFINE THE COEFFICIENTS Y = cement$y ##vector of the dependent values X = cement$x1 ##vector of the indipendent values (variable x1) str(X) str(Y) is.vector(X) is.vector(Y) Xm = mean(X)##mean of the vector Y Ym = mean(Y) ##mean of the vector X Xd = X - Xm ##diff from Xm vector Yd = Y - Ym ##diff from Ym vector Xd Yd num = sum(Yd*Xd) den = sum(Xd*Xd) num den beta1hat = num / den beta1hat ##deniminator 2. DEFINE THE R^2 Yhat = ols2$fitted ###fitted values vector Y = cement$y ##vector of the dependent values Ym = mean(Y) ##mean vector num = sum((Yhat - Ym)^2) ##num: SQR den = sum((Y - Ym)^2) ##den: SQT R2 = num / den ##R^2 R2
##drawing random numbers x = rnorm(500, mean=100, sd=10) x hist1 = hist(x, prob=TRUE) x = rnorm(500, mean=100, sd=10) hist2 = hist(x,prob=TRUE,ylim=c(0,0.04)) #### curve(dnorm(x,mean=100,sd=10),70,130,add=TRUE,lwd=2,col="red") ##distribution of sample means mean(x) x=rnorm(500,mean=100,sd=10) mean(x) ### producing a vector of sample means mu=100 sigma=10 n=5 xbar= rep(0,500) for (i in 1:500) { xbar[i] = mean(rnorm(n, mean=mu,sd=sigma))} hist(xbar,prob=TRUE, breaks=12,xlim=c(80,120),ylim = c(0,0.1)) mean(xbar) ###increasing the sample size mu=100 sigma=10 n=10 xbar=rep(0,500) for (i in 1:500) { xbar[i] = mean(rnorm(n, mean=mu,sd=sigma))} hist(xbar,prob=TRUE, breaks=12,xlim=c(80,120),ylim = c(0,0.15)) ###law of large numbers ##drawing random numbers x = rbinom(100,1,0.5) #x2 = rbinom(100,2,0.5) ##hist random numbers hist(x) ##define the empirical frequency sum(x) ###define the empirical frequency for the sample mean freq= rep(0,1000) freq ##for loop ##define the number i N = rep(0,1000) for (i in 1:1000) {N[i] = i} N #define the cumulated frequency (total) freq[1] = sum(rbinom(100,1,0.5)) for (i in 2:1000) { freq[i] = (sum(rbinom(100,1,0.5)) + freq[i-1])} freq ##define the sample mean freq2 = rep(0,1000) for (i in 1:1000) {freq2[i] = freq[i]/i} freq2 plot(freq2, ylim= c(48,52)) mu = rep(50,1000) mu points(mu,col="red") #are equal? all.equal(freq2[1], mu[1]) all.equal(freq2[10], mu[10]) all.equal(freq2[100], mu[100]) all.equal(freq2[1000],mu[1000])