2. My Little Stats Book

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Basics of statistics

Introduction
1. NOIR – Nominal, Ordinal, Interval, Ratio
Nominal – e.g. countries of football player. There is a difference but no order. Basically you
can’t say one is better than other.
Ordinal – e.g. movie ratings. There is a difference and order but you can’t define difference
accurately. Basically you can say Movie 1 (8/10) is better than movie 2 (4/10) but you can’t
say it is 100% better.
Interval – e.g. temperature. There is difference, order, and measurement but no zero point.
Basically, zero doesn’t mean there is no existence. For this reason, you can sum up or
differentiate but you can’t divide or multiply (hence no ratio)
Ratio – e.g. runs scored by batsman. This one has a natural zero. Zero runs means the
batsman has scored nothing (notice zero in temperature doesn’t mean temperature is
nothing. That’s absurd).

2. Data matrix is simply a table/dataset. Frequency table is used to summarize a variable


(country of IPL players or runs scored by them) in the form of table. For quantitative (interval
or ratio) variable you’ll need to recode it to ordinal. For instance, for runs scored you’ll make
categories such as 0-50, 50-100, 100-150, etc. If you don’t you’ll end with a huge table which
won’t make sense as there will be no summary.

3. Pie charts and graphs are feasible for categorical data and histogram (or dot plot) is feasible
for quantitative data.

4. Mode is feasible for categorical data, and mean and median are feasible for quantitative
data. If the mean is abnormally distant from median, there is a possibility of an outlier.
5. Range is difference between the highest and lowest value. Higher the ranger, greater the
dispersion. Disadvantage is that it takes only the extreme values. Interquartile range is
calculated by calculating the median of a set and then calculating median on either side of
the median of the set. Three median are called quartile 1, 2, 3. Interquartile range = Q3 – Q1.
Any value lower than (Q1-1.5*IQR) and higher than (Q3+1.5*IQR) is termed as outlier.

6. Box Plot - The end of whiskers are the maximum value after weeding outliers. Q2 is median
and the box represents the 50% of the data.
7. Variance
2
s=
∑ ( x−x )2
n−1

Take out the mean x and subtract it with each value. Square these values and add them.
Finally divide them with (n-1). The larger the variance the more values are spread out or
dispersed. The metric of variance is the squared metric of the variable.
Why (n-1): In short, when you take a sample of a population, it is most likely that you are
underestimating the population mean. So by reducing the denominator you’ll increase the
mean size. For more detail explanation go to khan academy and watch 3 consecutive videos.

8. Standard Deviation

s=
√ ∑ ( x−x )2
n−1

Standard deviation is square of variance. Its metric is the same as that of variable. It could be
called as the average distance of an observation from the mean. 68% of the data falls within
one SD and 95% of the data falls within two SD.

9. Z-score
x−x
z=
s
Z-score gives the no. of standard deviation removed from the mean of a particular
observation. This helps us determine whether the observation is common or exceptional. It
can also help us determine whether a particular observation is common in group1 or group2.
Generally, a z-score >±3 is considered exceptional (but it could be otherwise if histogram is
skewed)

10. Bessel’s Correction (Descriptive Statistics Chp 4)


Samples underestimate the amount of variability in a population because samples tend to be
in the middle of the distribution, especially in the normal distribution. Hence, we use
bessel’s correction and divide by (n-1) rather than n

11. Probability Density Function


PDF is nothing but the area under the curve of the distribution. It is called so because it gives
us a probability of a continuous random variable.`

12. Sampling Distribution (Lesson 8 Summary)


Sampling distribution is the distribution of the means of all the sample means. Its mean is
equal to the population mean and its SD is equal to σ/√ n
This is due to central limit theorem. CLT says that if we take the samples from a population
and draw a distribution of the means of this sample, it will be a normal distribution,
whatever be the shape of the population distribution.
This can be used to see where on a distribution of the means for a specific sample size does
a specific mean lies.

//////////////////////////////////////////////////////////////////////////////////////////////////////////
Confidence Interval
13. The Bieber Tweeter Example (Lesson 8)
Let’s say we have a population of Klout score which assigns a score to a person based on the
tweets they do. The population is N = 1048, µ = 37.72, σ = 16.04. Now we take a sample of n
= 35 and people who use an app called Bieber Tweeter and found out their mean score is 40.
We draw the sampling distribution for this, for this we calculate mean and SD (which is also
known as Standard error in case of samples) where m = µ and SE = σ/√ 40

Now we want to know what if everyone started using this app.


What will the mean of the population now?
Well, we could say 40 but it won’t be exactly that. Let’s call this number as point estimate.

Now assume that the population after using the app would be in some range. We know that
40 will be in that range. Now let’s say we want to be sure about 95% so we want to µ BT on
either side of µBT so that 40 lies on the 95th percentile of these µBT. Use your imagination and
look at the image below. If we calculate 95% range around 40 we get the value for µ BT. This
range is called confidence interval.
We can say that we 95% sure that if everyone started using bieber tweeter app then the
mean will lie between these range.
Point Estimate = x I = the sample mean we get.
95% confidence interval = x I – 1.96(σ/√ n) and x I + 1.96(σ/√ n)
98% confidence interval = x I – 2.33(σ/√ n) and x I + 2.33(σ/√ n)
Y% confidence interval = x I – z*(σ/√ n) and x I + z*(σ/√ n)
1.96 and 2.33 are known as critical value for z scores.
The bigger the sample size, the smaller would be the range for confidence interval.

14. Margin of error


z*(σ/√ n) is called as the margin of error. Margin of error is half the width of confidence
interval.

//////////////////////////////////////////////////////////////////////////////////////////////////////////

Hypothesis Testing

15. Alpha Levels and Critical Region


There are three important alpha levels.
α = 0.5 α = 0.1 α = 0.01 and the critical region related to the same are z = 1.65, z = 2.32, and z
= 3.08. This is a single tailed test.
If we consider two tailed tests, then the alpha level would be halved i.e. α = 0.25 α = 0.05 α
= 0.005 and the corresponding z critical region would be z = +¿ ¿ 1.96, z = +¿ ¿ 2.57, and z =
+¿ ¿ 3.32

If a mean of any sample of any intervention or treatment (e.g. population using Bieber
tweeter) falls below these levels than we say that the value is significant and highly unlikely.
This would mean that the treatment is effective.

16. Hypothesis
H0 = Null hypothesis which considers that there is no significant difference between the
current population parameters and new population parameter after the intervention
Ha = H1 = Alternate hypothesis which considers there is a significant difference
H0 : µ = µI
Ha : µ < µI
µ > µI
µ ≠ µI (mu sub i)
We use the alpha levels and critical region to reject (or fail to reject) the null hypothesis.
To reject the null hypothesis means the following
1) Our sample mean falls within the critical region
2) The z score of the sample mean is greater than the z critical value
3) The probability of getting the sample mean is less than the alpha level

Remember the greater the sample size (while keeping mean constant), the greater will be
the z value and the greater the chances to reject the null. This is because
x−µ
x−x
z=
s
= σ
√n
Note : You may think it is absurd that increasing the sample size will shift our result from fail
to reject -> rejecting null. However, in real scenario as you increase the sample size, the
more real mean you’ll get i.e. things will take a natural course. Here we keep the mean
constant just to understand theoretically what happens if we increase the sample size

17. Type I and Type II Error (Statistical Decision Error)


Type I error is when we reject the null hypothesis when in reality it is true.
Type II error is when we retain the null when in reality it is not true.
We can make this error due to numerous errors, for instance, biased samples. However
note, this is not due to calculation error.

//////////////////////////////////////////////////////////////////////////////////////////////////////////

T-Tests

18. The t Distribution


You might have noticed that while using the sampling distribution we generally find z score
to look where are sample mean lies. This z score is dependent on σ which in turn comes
from population distribution. However, we always don’t have µ and σ. Therefore,
statisticians have design the t-distribution. T-distribution is more prone to error and hence, it
is more spread out and thicker in the tails. It is also called student’s t as someone under that
penname discovered t-statistic to measure quality of beer at the Guinness Brewery in Dublin

19. Degrees of Freedom


In short if you have n number in your sample then you have (n-1) degrees of freedom. If you
really want to know why then watch the Lesson10a in Inferential Statistics at Udacity.
However, it is unnecessary.
20. T-Statistic
t-statistic will use s, which is the standard deviation of the sample, rather than σ which was
the population standard deviation.
( x−µ )
t=
s/√n
The larger the x , the larger the value of t and that means that µ > µ0
The further the value of x , the stronger the evidence that µ ≠ µ0

21. P value
The p value is the probability of getting a t-statistic. If you notice in the z-table we get the
exact probability. However, in t-table we get a range of probability. Hence we can use the p-
value to get the exact value. (This needs review)
In a two tailed test, we will have two t-statistic and hence we will add p-value to get the total
probability. If you just try to think, we are doing the reverse of alpha level. When we want
0.5 alpha level in two tailed we halve it. In this case, we are adding it so that we get total
probability on both sides of tail.

Correlation and Regression


The major focus here is to compare two different variables. For instance, chocolate consumption and
body weight. We might be interested whether greater chocolate consumption results in increased
body weight.

Contingency Table and Scatterplot

Contingency table is similar to frequency table but it considers two variables. It could represented by
raw numbers or column percentages. Contingency table is used for categorical variable. Scatterplots
and X-Y axis plots with dependent variable on Y axis (to find out the dependent variable just think
about the conclusion you want. Chocolate consumption increases body weight or body weight
increases chocolate consumption). Scatterplot is used for quantitative variable.

Scatterplot can show whether the relationship is linear or curvilinear. Further, it can describe the
relationship as strong if observations are tight knit or weak if observations are spread out.

Pearson’s R

r=
∑ Z x∗Z y
n−1
Pearson’s R can be calculated by first calculating the z-score for each observation of both variables
(chocolate consumption and body weight). Then divide this by one less than the total no. of
observations. This is very complicated and tedious calculation and it might take a while to
understand it.

Advantage of Pearson’s r is that it can express the direction and strength of liner correlation with
one single number. It is always between -1 and +1. Although it should be only used on linear
relations. So it would be wise to check the scatterplot to determine whether the relationship is linear
or curvilinear

Regression

A regression line is computed to get the least value for the squared residuals. A regression line has a
formula

^y =a+bx
Sy
b=r ( ) a= y−b(x)
Sx
The intercept or a represents the expected value of y when x is zero. The regression coefficient or b
represents the change in y per unit change in x.

Estimation (Refer Udacity > Descriptive Statistics > 8. Estimation)

95% of the people fall under 2 standard deviation (margin of error in case of sample means)
If 40 is one of those values than mu + 2*sigma/root (n) > 40 > mu - 2*sigma/root(n)
Now solve this equation for mu.
We will get the range of population mean.

You might also like