What is statistics?
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data
A summary statistic - a fact about or summary of some data
INTRODUCTION TO STATISTICS IN PYTHON
What can statistics do?
What is statistics?
The field of statistics - the practice and study of collecting and analyzing data
A summary statistic - a fact about or summary of some data
What can statistics do?
How likely is someone to purchase a product? Are people more likely to purchase it if they
can use a different payment system?
How many occupants will your hotel have? How can you optimize occupancy?
How many sizes of jeans need to be manufactured so they can fit 95% of the population?
Should the same number of each size be produced?
A/B tests: Which ad is more effective in getting people to purchase a product?
INTRODUCTION TO STATISTICS IN PYTHON
What can't statistics do?
Why is Game of Thrones so popular?
Instead...
Are series with more violent scenes viewed by more people?
But...
Even so, this can't tell us if more violent scenes lead to more views
INTRODUCTION TO STATISTICS IN PYTHON
Types of statistics
Descriptive statistics Inferential statistics
Describe and summarize data Use a sample of data to make inferences
about a larger population
50% of friends drive to work
25% take the bus
25% bike What percent of people drive to work?
INTRODUCTION TO STATISTICS IN PYTHON
Types of data
Numeric (Quantitative) Categorical (Qualitative)
Continuous (Measured) Nominal (Unordered)
Airplane speed Married/unmarried
Time spent waiting in line Country of residence
Discrete (Counted) Ordinal (Ordered)
Number of pets
Number of packages shipped
INTRODUCTION TO STATISTICS IN PYTHON
Categorical data can be represented as numbers
Nominal (Unordered) Ordinal (Ordered)
Married/unmarried ( 1 / 0 ) Strongly disagree ( 1 )
Country of residence ( 1 , 2 , ...) Somewhat disagree ( 2 )
Neither agree nor disagree ( 3 )
Somewhat agree ( 4 )
Strongly agree ( 5 )
INTRODUCTION TO STATISTICS IN PYTHON
Why does data type matter?
Summary statistics Plots
import numpy as np
np.mean(car_speeds['speed_mph'])
40.09062
INTRODUCTION TO STATISTICS IN PYTHON
Why does data type matter?
Summary statistics Plots
demographics['marriage_status'].value_counts()
single 188
married 143
divorced 124
dtype: int64
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Measures of center
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Mammal sleep data
print(msleep)
name genus vore order ... sleep_cycle awake brainwt bodywt
1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230
INTRODUCTION TO STATISTICS IN PYTHON
Histograms
INTRODUCTION TO STATISTICS IN PYTHON
How long do mammals in this dataset typically sleep?
What's a typical value?
Where is the center of the data?
Mean
Median
Mode
INTRODUCTION TO STATISTICS IN PYTHON
Measures of center: mean
name sleep_total import numpy as np
1 Cheetah 12.1 np.mean(msleep['sleep_total'])
2 Owl monkey 17.0
3 Mountain beaver 14.4 10.43373
4 Greater short-t... 14.9
5 Cow 4.0
.. ... ...
Mean sleep time =
12.1 + 17.0 + 14.4 + 14.9 + ...
= 10.43
83
INTRODUCTION TO STATISTICS IN PYTHON
Measures of center: median
msleep['sleep_total'].sort_values() msleep['sleep_total'].sort_values().iloc[41]
29 1.9 10.1
30 2.7
22 2.9
9 3.0
23 3.1
np.median(msleep['sleep_total'])
...
19 18.0
61 18.1 10.1
36 19.4
21 19.7
42 19.9
INTRODUCTION TO STATISTICS IN PYTHON
Measures of center: mode
Most frequent value msleep['vore'].value_counts()
msleep['sleep_total'].value_counts()
herbi 32
omni 20
12.5 4 carni 19
10.1 3 insecti 5
14.9 2 Name: vore, dtype: int64
11.0 2
8.4 2
import statistics
...
statistics.mode(msleep['vore'])
14.3 1
17.0 1
'herbi'
Name: sleep_total, Length: 65, dtype: int64
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == 'insecti']
name genus vore order sleep_total
22 Big brown bat Eptesicus insecti Chiroptera 19.7
43 Little brown bat Myotis insecti Chiroptera 19.9
62 Giant armadillo Priodontes insecti Cingulata 18.1
67 Eastern american mole Scalopus insecti Soricomorpha 8.4
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == "insecti"]['sleep_total'].agg([np.mean, np.median])
mean 16.53
median 18.9
Name: sleep_total, dtype: float64
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == 'insecti']
name genus vore order sleep_total
22 Big brown bat Eptesicus insecti Chiroptera 19.7
43 Little brown bat Myotis insecti Chiroptera 19.9
62 Giant armadillo Priodontes insecti Cingulata 18.1
67 Eastern american mole Scalopus insecti Soricomorpha 8.4
84 Mystery insectivore ... insecti ... 0.0
INTRODUCTION TO STATISTICS IN PYTHON
Adding an outlier
msleep[msleep['vore'] == "insecti"]['sleep_total'].agg([np.mean, np.median])
mean 13.22
median 18.1
Name: sleep_total, dtype: float64
Mean: 16.5 → 13.2
Median: 18.9 → 18.1
INTRODUCTION TO STATISTICS IN PYTHON
Which measure to use?
INTRODUCTION TO STATISTICS IN PYTHON
Skew
Left-skewed Right-skewed
INTRODUCTION TO STATISTICS IN PYTHON
Which measure to use?
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Measures of spread
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is spread?
INTRODUCTION TO STATISTICS IN PYTHON
Variance
Average distance from each data point to the data's mean
INTRODUCTION TO STATISTICS IN PYTHON
Variance
Average distance from each data point to the data's mean
INTRODUCTION TO STATISTICS IN PYTHON
Calculating variance
1. Subtract mean from each data point 2. Square each distance
dists = msleep['sleep_total'] - sq_dists = dists ** 2
np.mean(msleep['sleep_total']) print(sq_dists)
print(dists)
0 2.776439
0 1.666265 1 43.115837
1 6.566265 2 15.731259
2 3.966265 3 19.947524
3 4.466265 4 41.392945
4 -6.433735 ...
...
INTRODUCTION TO STATISTICS IN PYTHON
Calculating variance
3. Sum squared distances Use np.var()
sum_sq_dists = np.sum(sq_dists) np.var(msleep['sleep_total'], ddof=1)
print(sum_sq_dists)
19.805677
1624.065542
Without ddof=1 , population variance is
4. Divide by number of data points - 1 calculated instead of sample variance:
variance = sum_sq_dists / (83 - 1) np.var(msleep['sleep_total'])
print(variance)
19.567055
19.805677
INTRODUCTION TO STATISTICS IN PYTHON
Standard deviation
np.sqrt(np.var(msleep['sleep_total'], ddof=1))
4.450357
np.std(msleep['sleep_total'], ddof=1)
4.450357
INTRODUCTION TO STATISTICS IN PYTHON
Mean absolute deviation
dists = msleep['sleep_total'] - mean(msleep$sleep_total)
np.mean(np.abs(dists))
3.566701
Standard deviation vs. mean absolute deviation
Standard deviation squares distances, penalizing longer distances more than shorter ones.
Mean absolute deviation penalizes each distance equally.
One isn't better than the other, but SD is more common than MAD.
INTRODUCTION TO STATISTICS IN PYTHON
Quantiles
np.quantile(msleep['sleep_total'], 0.5)
0.5 quantile = median
10.1
Quartiles:
np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])
array([ 1.9 , 7.85, 10.1 , 13.75, 19.9 ])
INTRODUCTION TO STATISTICS IN PYTHON
Boxplots use quartiles
import matplotlib.pyplot as plt
plt.boxplot(msleep['sleep_total'])
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Quantiles using np.linspace()
np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])
array([ 1.9 , 6.24, 9.48, 11.14, 14.4 , 19.9 ])
np.linspace(start, stop, num)
np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))
array([ 1.9 , 7.85, 10.1 , 13.75, 19.9 ])
INTRODUCTION TO STATISTICS IN PYTHON
Interquartile range (IQR)
Height of the box in a boxplot
np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)
5.9
from scipy.stats import iqr
iqr(msleep['sleep_total'])
5.9
INTRODUCTION TO STATISTICS IN PYTHON
Outliers
Outlier: data point that is substantially different from the others
How do we know what a substantial difference is? A data point is an outlier if:
data < Q1 − 1.5 × IQR or
data > Q3 + 1.5 × IQR
INTRODUCTION TO STATISTICS IN PYTHON
Finding outliers
from scipy.stats import iqr
iqr = iqr(msleep['bodywt'])
lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr
upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr
msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]
name vore sleep_total bodywt
4 Cow herbi 4.0 600.000
20 Asian elephant herbi 3.9 2547.000
22 Horse herbi 2.9 521.000
...
INTRODUCTION TO STATISTICS IN PYTHON
All in one go
msleep['bodywt'].describe()
count 83.000000
mean 166.136349
std 786.839732
min 0.005000
25% 0.174000
50% 1.670000
75% 41.750000
max 6654.000000
Name: bodywt, dtype: float64
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
What are the
chances?
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Measuring chance
What's the probability of an event?
# ways event can happen
P (event) =
total # of possible outcomes
Example: a coin flip
1 way to get heads 1
P (heads) = = = 50%
2 possible outcomes 2
INTRODUCTION TO STATISTICS IN PYTHON
Assigning salespeople
INTRODUCTION TO STATISTICS IN PYTHON
Assigning salespeople
1
P (Brian) = = 25%
4
INTRODUCTION TO STATISTICS IN PYTHON
Sampling from a DataFrame
print(sales_counts) sales_counts.sample()
name n_sales name n_sales
0 Amir 178 1 Brian 128
1 Brian 128
2 Claire 75 sales_counts.sample()
3 Damian 69
name n_sales
2 Claire 75
INTRODUCTION TO STATISTICS IN PYTHON
Setting a random seed
np.random.seed(10) np.random.seed(10)
sales_counts.sample() sales_counts.sample()
name n_sales name n_sales
1 Brian 128 1 Brian 128
np.random.seed(10)
sales_counts.sample()
name n_sales
1 Brian 128
INTRODUCTION TO STATISTICS IN PYTHON
A second meeting
Sampling without replacement
INTRODUCTION TO STATISTICS IN PYTHON
A second meeting
1
P (Claire) = = 33%
3
INTRODUCTION TO STATISTICS IN PYTHON
Sampling twice in Python
sales_counts.sample(2)
name n_sales
1 Brian 128
2 Claire 75
INTRODUCTION TO STATISTICS IN PYTHON
Sampling with replacement
INTRODUCTION TO STATISTICS IN PYTHON
Sampling with replacement
1
P (Claire) = = 25%
4
INTRODUCTION TO STATISTICS IN PYTHON
Sampling with/without replacement in Python
sales_counts.sample(5, replace = True)
name n_sales
1 Brian 128
2 Claire 75
1 Brian 128
3 Damian 69
0 Amir 178
INTRODUCTION TO STATISTICS IN PYTHON
Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.
INTRODUCTION TO STATISTICS IN PYTHON
Independent events
Two events are independent if the probability
of the second event isn't affected by the
outcome of the first event.
Sampling with replacement = each pick is
independent
INTRODUCTION TO STATISTICS IN PYTHON
Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.
INTRODUCTION TO STATISTICS IN PYTHON
Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.
INTRODUCTION TO STATISTICS IN PYTHON
Dependent events
Two events are dependent if the probability
of the second event is affected by the
outcome of the first event.
Sampling without replacement = each pick is
dependent
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Discrete
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Rolling the dice
INTRODUCTION TO STATISTICS IN PYTHON
Rolling the dice
INTRODUCTION TO STATISTICS IN PYTHON
Choosing salespeople
INTRODUCTION TO STATISTICS IN PYTHON
Probability distribution
Describes the probability of each possible outcome in a scenario
Expected value: mean of a probability distribution
Expected value of a fair die roll =
(1 × 16 ) + (2 × 16 ) + (3 × 16 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing a probability distribution
INTRODUCTION TO STATISTICS IN PYTHON
Probability = area
P (die roll) ≤ 2 = ?
INTRODUCTION TO STATISTICS IN PYTHON
Probability = area
P (die roll) ≤ 2 = 1/3
INTRODUCTION TO STATISTICS IN PYTHON
Uneven die
Expected value of uneven die roll =
(1 × 16 ) + (2 × 0) + (3 × 13 ) + (4 × 16 ) + (5 × 16 ) + (6 × 16 ) = 3.67
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing uneven probabilities
INTRODUCTION TO STATISTICS IN PYTHON
Adding areas
P (uneven die roll) ≤ 2 = ?
INTRODUCTION TO STATISTICS IN PYTHON
Adding areas
P (uneven die roll) ≤ 2 = 1/6
INTRODUCTION TO STATISTICS IN PYTHON
Discrete probability distributions
Describe probabilities for discrete outcomes
Fair die Uneven die
Discrete uniform distribution
INTRODUCTION TO STATISTICS IN PYTHON
Sampling from discrete distributions
print(die) rolls_10 = die.sample(10, replace = True)
rolls_10
number prob
0 1 0.166667 number prob
1 2 0.166667 0 1 0.166667
2 3 0.166667 0 1 0.166667
3 4 0.166667 4 5 0.166667
4 5 0.166667 1 2 0.166667
5 6 0.166667 0 1 0.166667
0 1 0.166667
5 6 0.166667
np.mean(die['number'])
5 6 0.166667
...
3.5
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing a sample
rolls_10['number'].hist(bins=np.linspace(1,7,7))
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Sample distribution vs. theoretical distribution
Sample of 10 rolls Theoretical probability distribution
np.mean(rolls_10['number']) = 3.0
mean(die['number']) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
A bigger sample
Sample of 100 rolls Theoretical probability distribution
np.mean(rolls_100['number']) = 3.4
mean(die['number']) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
An even bigger sample
Sample of 1000 rolls Theoretical probability distribution
np.mean(rolls_1000['number']) = 3.48
mean(die['number']) = 3.5
INTRODUCTION TO STATISTICS IN PYTHON
Law of large numbers
As the size of your sample increases, the sample mean will approach the expected value.
Sample size Mean
10 3.00
100 3.40
1000 3.48
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Continuous
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Waiting for the bus
INTRODUCTION TO STATISTICS IN PYTHON
Continuous uniform distribution
INTRODUCTION TO STATISTICS IN PYTHON
Continuous uniform distribution
INTRODUCTION TO STATISTICS IN PYTHON
Probability still = area
P (4 ≤ wait time ≤ 7) = ?
INTRODUCTION TO STATISTICS IN PYTHON
Probability still = area
P (4 ≤ wait time ≤ 7) = ?
INTRODUCTION TO STATISTICS IN PYTHON
Probability still = area
P (4 ≤ wait time ≤ 7) = 3 × 1/12 = 3/12
INTRODUCTION TO STATISTICS IN PYTHON
Uniform distribution in Python
P (wait time ≤ 7)
from scipy.stats import uniform
uniform.cdf(7, 0, 12)
0.5833333
INTRODUCTION TO STATISTICS IN PYTHON
"Greater than" probabilities
P (wait time ≥ 7) = 1 − P (wait time ≤ 7)
from scipy.stats import uniform
1 - uniform.cdf(7, 0, 12)
0.4166667
INTRODUCTION TO STATISTICS IN PYTHON
P (4 ≤ wait time ≤ 7)
INTRODUCTION TO STATISTICS IN PYTHON
P (4 ≤ wait time ≤ 7)
INTRODUCTION TO STATISTICS IN PYTHON
P (4 ≤ wait time ≤ 7)
from scipy.stats import uniform
uniform.cdf(7, 0, 12) - uniform.cdf(4, 0, 12)
0.25
INTRODUCTION TO STATISTICS IN PYTHON
Total area = 1
P (0 ≤ wait time ≤ 12) = ?
INTRODUCTION TO STATISTICS IN PYTHON
Total area = 1
P (0 ≤ outcome ≤ 12) = 12 × 1/12 = 1
INTRODUCTION TO STATISTICS IN PYTHON
Generating random numbers according to uniform
distribution
from scipy.stats import uniform
uniform.rvs(0, 5, size=10)
array([1.89740094, 4.70673196, 0.33224683, 1.0137103 , 2.31641255,
3.49969897, 0.29688598, 0.92057234, 4.71086658, 1.56815855])
INTRODUCTION TO STATISTICS IN PYTHON
Other continuous distributions
INTRODUCTION TO STATISTICS IN PYTHON
Other continuous distributions
INTRODUCTION TO STATISTICS IN PYTHON
Other special types of distributions
Normal distribution Exponential distribution
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The binomial
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Coin flipping
INTRODUCTION TO STATISTICS IN PYTHON
Binary outcomes
INTRODUCTION TO STATISTICS IN PYTHON
A single flip
binom.rvs(# of coins, probability of heads/success, size=# of trials)
1 = head, 0 = tails
from scipy.stats import binom
binom.rvs(1, 0.5, size=1)
array([1])
INTRODUCTION TO STATISTICS IN PYTHON
One flip many times
binom.rvs(1, 0.5, size=8)
array([0, 1, 1, 0, 1, 0, 1, 1])
INTRODUCTION TO STATISTICS IN PYTHON
Many flips one time
binom.rvs(8, 0.5, size=1)
array([5])
INTRODUCTION TO STATISTICS IN PYTHON
Many flips many times
binom.rvs(3, 0.5, size=10)
array([0, 3, 2, 1, 3, 0, 2, 2, 0, 0])
INTRODUCTION TO STATISTICS IN PYTHON
Other probabilities
binom.rvs(3, 0.25, size=10)
array([1, 1, 1, 1, 0, 0, 2, 0, 1, 0])
INTRODUCTION TO STATISTICS IN PYTHON
Binomial distribution
Probability distribution of the number of
successes in a sequence of independent
trials
E.g. Number of heads in a sequence of coin
flips
Described by n and p
n: total number of trials
p: probability of success
INTRODUCTION TO STATISTICS IN PYTHON
What's the probability of 7 heads?
P (heads = 7)
# binom.pmf(num heads, num trials, prob of heads)
binom.pmf(7, 10, 0.5)
0.1171875
INTRODUCTION TO STATISTICS IN PYTHON
What's the probability of 7 or fewer heads?
P (heads ≤ 7)
binom.cdf(7, 10, 0.5)
0.9453125
INTRODUCTION TO STATISTICS IN PYTHON
What's the probability of more than 7 heads?
P (heads > 7)
1 - binom.cdf(7, 10, 0.5)
0.0546875
INTRODUCTION TO STATISTICS IN PYTHON
Expected value
Expected value = n × p
Expected number of heads out of 10 flips = 10 × 0.5 = 5
INTRODUCTION TO STATISTICS IN PYTHON
Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials
INTRODUCTION TO STATISTICS IN PYTHON
Independence
The binomial distribution is a probability
distribution of the number of successes in a
sequence of independent trials
Probabilities of second trial are altered due to
outcome of the first
If trials are not independent, the binomial
distribution does not apply!
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The normal
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
What is the normal distribution?
INTRODUCTION TO STATISTICS IN PYTHON
Symmetrical
INTRODUCTION TO STATISTICS IN PYTHON
Area = 1
INTRODUCTION TO STATISTICS IN PYTHON
Curve never hits 0
INTRODUCTION TO STATISTICS IN PYTHON
Described by mean and standard deviation
Mean: 20
Standard deviation: 3
Standard normal distribution
Mean: 0
Standard deviation: 1
INTRODUCTION TO STATISTICS IN PYTHON
Described by mean and standard deviation
Mean: 20
Standard deviation: 3
Standard normal distribution
Mean: 0
Standard deviation: 1
INTRODUCTION TO STATISTICS IN PYTHON
Areas under the normal distribution
68% falls within 1 standard deviation
INTRODUCTION TO STATISTICS IN PYTHON
Areas under the normal distribution
95% falls within 2 standard deviations
INTRODUCTION TO STATISTICS IN PYTHON
Areas under the normal distribution
99.7% falls within 3 standard deviations
INTRODUCTION TO STATISTICS IN PYTHON
Lots of histograms look normal
Normal distribution Women's heights from NHANES
Mean: 161 cm Standard deviation: 7 cm
INTRODUCTION TO STATISTICS IN PYTHON
Approximating data with the normal distribution
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are shorter than 154 cm?
from scipy.stats import norm
norm.cdf(154, 161, 7)
0.158655
16% of women in the survey are shorter than
154 cm
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are taller than 154 cm?
from scipy.stats import norm
1 - norm.cdf(154, 161, 7)
0.841345
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are 154-157 cm?
norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)
INTRODUCTION TO STATISTICS IN PYTHON
What percent of women are 154-157 cm?
norm.cdf(157, 161, 7) - norm.cdf(154, 161, 7)
0.1252
INTRODUCTION TO STATISTICS IN PYTHON
What height are 90% of women shorter than?
norm.ppf(0.9, 161, 7)
169.97086
INTRODUCTION TO STATISTICS IN PYTHON
What height are 90% of women taller than?
norm.ppf((1-0.9), 161, 7)
152.029
INTRODUCTION TO STATISTICS IN PYTHON
Generating random numbers
# Generate 10 random heights
norm.rvs(161, 7, size=10)
array([155.5758223 , 155.13133235, 160.06377097, 168.33345778,
165.92273375, 163.32677057, 165.13280753, 146.36133538,
149.07845021, 160.5790856 ])
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The central limit
theorem
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Rolling the dice 5 times
die = pd.Series([1, 2, 3, 4, 5, 6])
# Roll 5 times
samp_5 = die.sample(5, replace=True)
print(samp_5)
array([3, 1, 4, 1, 1])
np.mean(samp_5)
2.0
INTRODUCTION TO STATISTICS IN PYTHON
Rolling the dice 5 times
# Roll 5 times and take mean
samp_5 = die.sample(5, replace=True)
np.mean(samp_5)
4.4
samp_5 = die.sample(5, replace=True)
np.mean(samp_5)
3.8
INTRODUCTION TO STATISTICS IN PYTHON
Rolling the dice 5 times 10 times
Repeat 10 times: sample_means = []
for i in range(10):
Roll 5 times
samp_5 = die.sample(5, replace=True)
Take the mean sample_means.append(np.mean(samp_5))
print(sample_means)
[3.8, 4.0, 3.8, 3.6, 3.2, 4.8, 2.6,
3.0, 2.6, 2.0]
INTRODUCTION TO STATISTICS IN PYTHON
Sampling distributions
Sampling distribution of the sample mean
INTRODUCTION TO STATISTICS IN PYTHON
100 sample means
sample_means = []
for i in range(100):
sample_means.append(np.mean(die.sample(5, replace=True)))
INTRODUCTION TO STATISTICS IN PYTHON
1000 sample means
sample_means = []
for i in range(1000):
sample_means.append(np.mean(die.sample(5, replace=True)))
INTRODUCTION TO STATISTICS IN PYTHON
Central limit theorem
The sampling distribution of a statistic becomes closer to the normal distribution as the
number of trials increases.
* Samples should be random and independent
INTRODUCTION TO STATISTICS IN PYTHON
Standard deviation and the CLT
sample_sds = []
for i in range(1000):
sample_sds.append(np.std(die.sample(5, replace=True)))
INTRODUCTION TO STATISTICS IN PYTHON
Proportions and the CLT
sales_team = pd.Series(["Amir", "Brian", "Claire", "Damian"])
sales_team.sample(10, replace=True)
array(['Claire', 'Damian', 'Brian', 'Damian', 'Damian', 'Amir', 'Amir', 'Amir',
'Amir', 'Damian'], dtype=object)
sales_team.sample(10, replace=True)
array(['Brian', 'Amir', 'Brian', 'Claire', 'Brian', 'Damian', 'Claire', 'Brian',
'Claire', 'Claire'], dtype=object)
INTRODUCTION TO STATISTICS IN PYTHON
Sampling distribution of proportion
INTRODUCTION TO STATISTICS IN PYTHON
Mean of sampling distribution
# Estimate expected value of die
np.mean(sample_means)
3.48
# Estimate proportion of "Claire"s
np.mean(sample_props)
Estimate characteristics of unknown
0.26
underlying distribution
More easily estimate characteristics of
large populations
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
The Poisson
distribution
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Poisson processes
Events appear to happen at a certain rate,
but completely at random
Examples
Number of animals adopted from an
animal shelter per week
Number of people arriving at a
restaurant per hour
Number of earthquakes in California per
year
Time unit is irrelevant, as long as you use
the same unit when talking about the same
situation
INTRODUCTION TO STATISTICS IN PYTHON
Poisson distribution
Probability of some # of events occurring over a fixed period of time
Examples
Probability of ≥ 5 animals adopted from an animal shelter per week
Probability of 12 people arriving at a restaurant per hour
Probability of < 20 earthquakes in California per year
INTRODUCTION TO STATISTICS IN PYTHON
Lambda (λ)
λ = average number of events per time interval
Average number of adoptions per week = 8
INTRODUCTION TO STATISTICS IN PYTHON
Lambda is the distribution's peak
INTRODUCTION TO STATISTICS IN PYTHON
Probability of a single value
If the average number of adoptions per week is 8, what is P (# adoptions in a week = 5)?
from scipy.stats import poisson
poisson.pmf(5, 8)
0.09160366
INTRODUCTION TO STATISTICS IN PYTHON
Probability of less than or equal to
If the average number of adoptions per week is 8, what is P (# adoptions in a week ≤ 5)?
from scipy.stats import poisson
poisson.cdf(5, 8)
0.1912361
INTRODUCTION TO STATISTICS IN PYTHON
Probability of greater than
If the average number of adoptions per week is 8, what is P (# adoptions in a week > 5)?
1 - poisson.cdf(5, 8)
0.8087639
If the average number of adoptions per week is 10, what is P (# adoptions in a week > 5)?
1 - poisson.cdf(5, 10)
0.932914
INTRODUCTION TO STATISTICS IN PYTHON
Sampling from a Poisson distribution
from scipy.stats import poisson
poisson.rvs(8, size=10)
array([ 9, 9, 8, 7, 11, 3, 10, 6, 8, 14])
INTRODUCTION TO STATISTICS IN PYTHON
The CLT still applies!
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
More probability
distributions
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Exponential distribution
Probability of time between Poisson events
Examples
Probability of > 1 day between adoptions
Probability of < 10 minutes between restaurant arrivals
Probability of 6-8 months between earthquakes
Also uses lambda (rate)
Continuous (time)
INTRODUCTION TO STATISTICS IN PYTHON
Customer service requests
On average, one customer service ticket is created every 2 minutes
λ = 0.5 customer service tickets created each minute
INTRODUCTION TO STATISTICS IN PYTHON
Lambda in exponential distribution
INTRODUCTION TO STATISTICS IN PYTHON
Expected value of exponential distribution
In terms of rate (Poisson):
λ = 0.5 requests per minute
In terms of time between events (exponential):
1/λ = 1 request per 2 minutes
1/0.5 = 2
INTRODUCTION TO STATISTICS IN PYTHON
How long until a new request is created?
P (wait < 1 min) =
from scipy.stats import expon expon.cdf(1, scale=2)
scale = 1/λ = 1/0.5 = 2 0.3934693402873666
P (wait > 4 min) = P (1 min < wait < 4 min) =
1- expon.cdf(4, scale=2) expon.cdf(4, scale=2) - expon.cdf(1, scale=2)
0.1353352832366127 0.4711953764760207
INTRODUCTION TO STATISTICS IN PYTHON
(Student's) t-distribution
Similar shape as the normal distribution
INTRODUCTION TO STATISTICS IN PYTHON
Degrees of freedom
Has parameter degrees of freedom (df) which affects the thickness of the tails
Lower df = thicker tails, higher standard deviation
Higher df = closer to normal distribution
INTRODUCTION TO STATISTICS IN PYTHON
Log-normal distribution
Variable whose logarithm is normally
distributed
Examples:
Length of chess games
Adult blood pressure
Number of hospitalizations in the 2003
SARS outbreak
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Relationships between two variables
x = explanatory/independent variable
y = response/dependent variable
INTRODUCTION TO STATISTICS IN PYTHON
Correlation coefficient
Quantifies the linear relationship between two variables
Number between -1 and 1
Magnitude corresponds to strength of relationship
Sign (+ or -) corresponds to direction of relationship
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.99 (very strong relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.99 (very strong relationship) 0.75 (strong relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.56 (moderate relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.56 (moderate relationship) 0.21 (weak relationship)
INTRODUCTION TO STATISTICS IN PYTHON
Magnitude = strength of relationship
0.04 (no relationship) Knowing the value of x doesn't tell us
anything about y
INTRODUCTION TO STATISTICS IN PYTHON
Sign = direction
0.75: as x increases, y increases -0.75: as x increases, y decreases
INTRODUCTION TO STATISTICS IN PYTHON
Visualizing relationships
import seaborn as sns
sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Adding a trendline
import seaborn as sns
sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)
plt.show()
INTRODUCTION TO STATISTICS IN PYTHON
Computing correlation
msleep['sleep_total'].corr(msleep['sleep_rem'])
0.751755
msleep['sleep_rem'].corr(msleep['sleep_total'])
0.751755
INTRODUCTION TO STATISTICS IN PYTHON
Many ways to calculate correlation
Used in this course: Pearson product-moment correlation (r )
Most common
x̄ = mean of x
σx = standard deviation of x
n
(xi − x̄)(yi − ȳ )
r=∑
σx × σy
i=1
Variations on this formula:
Kendall's tau
Spearman's rho
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Correlation caveats
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Non-linear relationships
r = 0.18
INTRODUCTION TO STATISTICS IN PYTHON
Non-linear relationships
What we see: What the correlation coefficient sees:
INTRODUCTION TO STATISTICS IN PYTHON
Correlation only accounts for linear relationships
Correlation shouldn't be used blindly Always visualize your data
df['x'].corr(df['y'])
0.081094
INTRODUCTION TO STATISTICS IN PYTHON
Mammal sleep data
print(msleep)
name genus vore order ... sleep_cycle awake brainwt bodywt
1 Cheetah Acinonyx carni Carnivora ... NaN 11.9 NaN 50.000
2 Owl monkey Aotus omni Primates ... NaN 7.0 0.01550 0.480
3 Mountain beaver Aplodontia herbi Rodentia ... NaN 9.6 NaN 1.350
4 Greater short-ta... Blarina omni Soricomorpha ... 0.133333 9.1 0.00029 0.019
5 Cow Bos herbi Artiodactyla ... 0.666667 20.0 0.42300 600.000
.. ... ... ... ... ... ... ... ... ...
79 Tree shrew Tupaia omni Scandentia ... 0.233333 15.1 0.00250 0.104
80 Bottle-nosed do... Tursiops carni Cetacea ... NaN 18.8 NaN 173.330
81 Genet Genetta carni Carnivora ... NaN 17.7 0.01750 2.000
82 Arctic fox Vulpes carni Carnivora ... NaN 11.5 0.04450 3.380
83 Red fox Vulpes carni Carnivora ... 0.350000 14.2 0.05040 4.230
INTRODUCTION TO STATISTICS IN PYTHON
Body weight vs. awake time
msleep['bodywt'].corr(msleep['awake'])
0.3119801
INTRODUCTION TO STATISTICS IN PYTHON
Distribution of body weight
INTRODUCTION TO STATISTICS IN PYTHON
Log transformation
msleep['log_bodywt'] = np.log(msleep['bodywt'])
sns.lmplot(x='log_bodywt',
y='awake',
data=msleep,
ci=None)
plt.show()
msleep['log_bodywt'].corr(msleep['awake'])
0.5687943
INTRODUCTION TO STATISTICS IN PYTHON
Other transformations
Log transformation ( log(x) )
Square root transformation ( sqrt(x) )
Reciprocal transformation ( 1 / x )
Combinations of these, e.g.:
log(x) and log(y)
sqrt(x) and 1 / y
INTRODUCTION TO STATISTICS IN PYTHON
Why use a transformation?
Certain statistical methods rely on variables having a linear relationship
Correlation coefficient
Linear regression
Introduction to Linear Modeling in Python
INTRODUCTION TO STATISTICS IN PYTHON
Correlation does not imply causation
x is correlated with y does not mean x causes y
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Confounding
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Design of
experiments
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Vocabulary
Experiment aims to answer: What is the effect of the treatment on the response?
Treatment: explanatory/independent variable
Response: response/dependent variable
E.g.: What is the effect of an advertisement on the number of products purchased?
Treatment: advertisement
Response: number of products purchased
INTRODUCTION TO STATISTICS IN PYTHON
Controlled experiments
Participants are assigned by researchers to either treatment group or control group
Treatment group sees advertisement
Control group does not
Groups should be comparable so that causation can be inferred
If groups are not comparable, this could lead to confounding (bias)
Treatment group average age: 25
Control group average age: 50
Age is a potential confounder
INTRODUCTION TO STATISTICS IN PYTHON
The gold standard of experiments will use...
Randomized controlled trial
Participants are assigned to treatment/control randomly, not based on any other
characteristics
Choosing randomly helps ensure that groups are comparable
Placebo
Resembles treatment, but has no effect
Participants will not know which group they're in
In clinical trials, a sugar pill ensures that the effect of the drug is actually due to the drug
itself and not the idea of receiving the drug
INTRODUCTION TO STATISTICS IN PYTHON
The gold standard of experiments will use...
Double-blind trial
Person administering the treatment/running the study doesn't know whether the
treatment is real or a placebo
Prevents bias in the response and/or analysis of results
Fewer opportunities for bias = more reliable conclusion about causation
INTRODUCTION TO STATISTICS IN PYTHON
Observational studies
Participants are not assigned randomly to groups
Participants assign themselves, usually based on pre-existing characteristics
Many research questions are not conducive to a controlled experiment
You can't force someone to smoke or have a disease
You can't make someone have certain past behavior
Establish association, not causation
Effects can be confounded by factors that got certain people into the control or
treatment group
There are ways to control for confounders to get more reliable conclusions about
association
INTRODUCTION TO STATISTICS IN PYTHON
Longitudinal vs. cross-sectional studies
Longitudinal study Cross-sectional study
Participants are followed over a period of Data on participants is collected from a
time to examine effect of treatment on single snapshot in time
response Effect of age on height is confounded by
Effect of age on height is not confounded generation
by generation Cheaper, faster, more convenient
More expensive, results take longer
INTRODUCTION TO STATISTICS IN PYTHON
Let's practice!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Congratulations!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N
Maggie Matsui
Content Developer, DataCamp
Overview
Chapter 1 Chapter 2
What is statistics? Measuring chance
Measures of center Probability distributions
Measures of spread Binomial distribution
Chapter 3 Chapter 4
Normal distribution Correlation
Central limit theorem Controlled experiments
Poisson distribution Observational studies
INTRODUCTION TO STATISTICS IN PYTHON
Build on your skills
Introduction to Linear Modeling in Python
INTRODUCTION TO STATISTICS IN PYTHON
Congratulations!
I N T R O D U C T I O N T O S TAT I S T I C S I N P Y T H O N