2. My Little Stats Book
2. My Little Stats Book
2. My Little Stats Book
Introduction
1. NOIR – Nominal, Ordinal, Interval, Ratio
Nominal – e.g. countries of football player. There is a difference but no order. Basically you
can’t say one is better than other.
Ordinal – e.g. movie ratings. There is a difference and order but you can’t define difference
accurately. Basically you can say Movie 1 (8/10) is better than movie 2 (4/10) but you can’t
say it is 100% better.
Interval – e.g. temperature. There is difference, order, and measurement but no zero point.
Basically, zero doesn’t mean there is no existence. For this reason, you can sum up or
differentiate but you can’t divide or multiply (hence no ratio)
Ratio – e.g. runs scored by batsman. This one has a natural zero. Zero runs means the
batsman has scored nothing (notice zero in temperature doesn’t mean temperature is
nothing. That’s absurd).
3. Pie charts and graphs are feasible for categorical data and histogram (or dot plot) is feasible
for quantitative data.
4. Mode is feasible for categorical data, and mean and median are feasible for quantitative
data. If the mean is abnormally distant from median, there is a possibility of an outlier.
5. Range is difference between the highest and lowest value. Higher the ranger, greater the
dispersion. Disadvantage is that it takes only the extreme values. Interquartile range is
calculated by calculating the median of a set and then calculating median on either side of
the median of the set. Three median are called quartile 1, 2, 3. Interquartile range = Q3 – Q1.
Any value lower than (Q1-1.5*IQR) and higher than (Q3+1.5*IQR) is termed as outlier.
6. Box Plot - The end of whiskers are the maximum value after weeding outliers. Q2 is median
and the box represents the 50% of the data.
7. Variance
2
s=
∑ ( x−x )2
n−1
Take out the mean x and subtract it with each value. Square these values and add them.
Finally divide them with (n-1). The larger the variance the more values are spread out or
dispersed. The metric of variance is the squared metric of the variable.
Why (n-1): In short, when you take a sample of a population, it is most likely that you are
underestimating the population mean. So by reducing the denominator you’ll increase the
mean size. For more detail explanation go to khan academy and watch 3 consecutive videos.
8. Standard Deviation
s=
√ ∑ ( x−x )2
n−1
Standard deviation is square of variance. Its metric is the same as that of variable. It could be
called as the average distance of an observation from the mean. 68% of the data falls within
one SD and 95% of the data falls within two SD.
9. Z-score
x−x
z=
s
Z-score gives the no. of standard deviation removed from the mean of a particular
observation. This helps us determine whether the observation is common or exceptional. It
can also help us determine whether a particular observation is common in group1 or group2.
Generally, a z-score >±3 is considered exceptional (but it could be otherwise if histogram is
skewed)
//////////////////////////////////////////////////////////////////////////////////////////////////////////
Confidence Interval
13. The Bieber Tweeter Example (Lesson 8)
Let’s say we have a population of Klout score which assigns a score to a person based on the
tweets they do. The population is N = 1048, µ = 37.72, σ = 16.04. Now we take a sample of n
= 35 and people who use an app called Bieber Tweeter and found out their mean score is 40.
We draw the sampling distribution for this, for this we calculate mean and SD (which is also
known as Standard error in case of samples) where m = µ and SE = σ/√ 40
Now assume that the population after using the app would be in some range. We know that
40 will be in that range. Now let’s say we want to be sure about 95% so we want to µ BT on
either side of µBT so that 40 lies on the 95th percentile of these µBT. Use your imagination and
look at the image below. If we calculate 95% range around 40 we get the value for µ BT. This
range is called confidence interval.
We can say that we 95% sure that if everyone started using bieber tweeter app then the
mean will lie between these range.
Point Estimate = x I = the sample mean we get.
95% confidence interval = x I – 1.96(σ/√ n) and x I + 1.96(σ/√ n)
98% confidence interval = x I – 2.33(σ/√ n) and x I + 2.33(σ/√ n)
Y% confidence interval = x I – z*(σ/√ n) and x I + z*(σ/√ n)
1.96 and 2.33 are known as critical value for z scores.
The bigger the sample size, the smaller would be the range for confidence interval.
//////////////////////////////////////////////////////////////////////////////////////////////////////////
Hypothesis Testing
If a mean of any sample of any intervention or treatment (e.g. population using Bieber
tweeter) falls below these levels than we say that the value is significant and highly unlikely.
This would mean that the treatment is effective.
16. Hypothesis
H0 = Null hypothesis which considers that there is no significant difference between the
current population parameters and new population parameter after the intervention
Ha = H1 = Alternate hypothesis which considers there is a significant difference
H0 : µ = µI
Ha : µ < µI
µ > µI
µ ≠ µI (mu sub i)
We use the alpha levels and critical region to reject (or fail to reject) the null hypothesis.
To reject the null hypothesis means the following
1) Our sample mean falls within the critical region
2) The z score of the sample mean is greater than the z critical value
3) The probability of getting the sample mean is less than the alpha level
Remember the greater the sample size (while keeping mean constant), the greater will be
the z value and the greater the chances to reject the null. This is because
x−µ
x−x
z=
s
= σ
√n
Note : You may think it is absurd that increasing the sample size will shift our result from fail
to reject -> rejecting null. However, in real scenario as you increase the sample size, the
more real mean you’ll get i.e. things will take a natural course. Here we keep the mean
constant just to understand theoretically what happens if we increase the sample size
//////////////////////////////////////////////////////////////////////////////////////////////////////////
T-Tests
21. P value
The p value is the probability of getting a t-statistic. If you notice in the z-table we get the
exact probability. However, in t-table we get a range of probability. Hence we can use the p-
value to get the exact value. (This needs review)
In a two tailed test, we will have two t-statistic and hence we will add p-value to get the total
probability. If you just try to think, we are doing the reverse of alpha level. When we want
0.5 alpha level in two tailed we halve it. In this case, we are adding it so that we get total
probability on both sides of tail.
Contingency table is similar to frequency table but it considers two variables. It could represented by
raw numbers or column percentages. Contingency table is used for categorical variable. Scatterplots
and X-Y axis plots with dependent variable on Y axis (to find out the dependent variable just think
about the conclusion you want. Chocolate consumption increases body weight or body weight
increases chocolate consumption). Scatterplot is used for quantitative variable.
Scatterplot can show whether the relationship is linear or curvilinear. Further, it can describe the
relationship as strong if observations are tight knit or weak if observations are spread out.
Pearson’s R
r=
∑ Z x∗Z y
n−1
Pearson’s R can be calculated by first calculating the z-score for each observation of both variables
(chocolate consumption and body weight). Then divide this by one less than the total no. of
observations. This is very complicated and tedious calculation and it might take a while to
understand it.
Advantage of Pearson’s r is that it can express the direction and strength of liner correlation with
one single number. It is always between -1 and +1. Although it should be only used on linear
relations. So it would be wise to check the scatterplot to determine whether the relationship is linear
or curvilinear
Regression
A regression line is computed to get the least value for the squared residuals. A regression line has a
formula
^y =a+bx
Sy
b=r ( ) a= y−b(x)
Sx
The intercept or a represents the expected value of y when x is zero. The regression coefficient or b
represents the change in y per unit change in x.
95% of the people fall under 2 standard deviation (margin of error in case of sample means)
If 40 is one of those values than mu + 2*sigma/root (n) > 40 > mu - 2*sigma/root(n)
Now solve this equation for mu.
We will get the range of population mean.