@@ -23,7 +23,7 @@ kernelspec:
23
23
24
24
## Outline
25
25
26
- In this lecture we give a quick introduction to data and probability distributions using Python
26
+ In this lecture we give a quick introduction to data and probability distributions using Python.
27
27
28
28
``` {code-cell} ipython3
29
29
:tags: [hide-output]
@@ -42,7 +42,7 @@ import seaborn as sns
42
42
43
43
## Common distributions
44
44
45
- In this section we recall the definitions of some well-known distributions and show how to manipulate them with SciPy.
45
+ In this section we recall the definitions of some well-known distributions and explore how to manipulate them with SciPy.
46
46
47
47
### Discrete distributions
48
48
@@ -61,7 +61,7 @@ $$ \mathbb P\{X = x_i\} = p(x_i) \quad \text{for } i= 1, \ldots, n $$
61
61
The ** mean** or ** expected value** of a random variable $X$ with distribution $p$ is
62
62
63
63
$$
64
- \mathbb E X = \sum_{i=1}^n x_i p(x_i)
64
+ \mathbb{E}[X] = \sum_{i=1}^n x_i p(x_i)
65
65
$$
66
66
67
67
Expectation is also called the * first moment* of the distribution.
@@ -71,15 +71,15 @@ We also refer to this number as the mean of the distribution (represented by) $p
71
71
The ** variance** of $X$ is defined as
72
72
73
73
$$
74
- \mathbb V X = \sum_{i=1}^n (x_i - \mathbb E X )^2 p(x_i)
74
+ \mathbb{V}[X] = \sum_{i=1}^n (x_i - \mathbb{E}[X] )^2 p(x_i)
75
75
$$
76
76
77
77
Variance is also called the * second central moment* of the distribution.
78
78
79
79
The ** cumulative distribution function** (CDF) of $X$ is defined by
80
80
81
81
$$
82
- F(x) = \mathbb P \{X \leq x\}
82
+ F(x) = \mathbb{P} \{X \leq x\}
83
83
= \sum_{i=1}^n \mathbb 1\{x_i \leq x\} p(x_i)
84
84
$$
85
85
@@ -157,6 +157,75 @@ Check that your answers agree with `u.mean()` and `u.var()`.
157
157
```
158
158
159
159
160
+ #### Bernoulli distribution
161
+
162
+ Another useful (and more interesting) distribution is the Bernoulli distribution
163
+
164
+ We can import the uniform distribution on $S = \{ 1, \ldots, n\} $ from SciPy like so:
165
+
166
+ ``` {code-cell} ipython3
167
+ n = 10
168
+ u = scipy.stats.randint(1, n+1)
169
+ ```
170
+
171
+
172
+ Here's the mean and variance
173
+
174
+ ``` {code-cell} ipython3
175
+ u.mean(), u.var()
176
+ ```
177
+
178
+ The formula for the mean is $(n+1)/2$, and the formula for the variance is $(n^2 - 1)/12$.
179
+
180
+
181
+ Now let's evaluate the PMF
182
+
183
+ ``` {code-cell} ipython3
184
+ u.pmf(1)
185
+ ```
186
+
187
+ ``` {code-cell} ipython3
188
+ u.pmf(2)
189
+ ```
190
+
191
+
192
+ Here's a plot of the probability mass function:
193
+
194
+ ``` {code-cell} ipython3
195
+ fig, ax = plt.subplots()
196
+ S = np.arange(1, n+1)
197
+ ax.plot(S, u.pmf(S), linestyle='', marker='o', alpha=0.8, ms=4)
198
+ ax.vlines(S, 0, u.pmf(S), lw=0.2)
199
+ ax.set_xticks(S)
200
+ plt.show()
201
+ ```
202
+
203
+
204
+ Here's a plot of the CDF:
205
+
206
+ ``` {code-cell} ipython3
207
+ fig, ax = plt.subplots()
208
+ S = np.arange(1, n+1)
209
+ ax.step(S, u.cdf(S))
210
+ ax.vlines(S, 0, u.cdf(S), lw=0.2)
211
+ ax.set_xticks(S)
212
+ plt.show()
213
+ ```
214
+
215
+
216
+ The CDF jumps up by $p(x_i)$ and $x_i$.
217
+
218
+
219
+ ``` {exercise}
220
+ :label: prob_ex2
221
+
222
+ Calculate the mean and variance for this parameterization (i.e., $n=10$)
223
+ directly from the PMF, using the expressions given above.
224
+
225
+ Check that your answers agree with `u.mean()` and `u.var()`.
226
+ ```
227
+
228
+
160
229
161
230
#### Binomial distribution
162
231
@@ -170,7 +239,7 @@ Here $\theta \in [0,1]$ is a parameter.
170
239
171
240
The interpretation of $p(i)$ is: the number of successes in $n$ independent trials with success probability $\theta$.
172
241
173
- (If $\theta=0.5$, this is "how many heads in $n$ flips of a fair coin")
242
+ (If $\theta=0.5$, p(i) can be "how many heads in $n$ flips of a fair coin")
174
243
175
244
The mean and variance are
176
245
@@ -215,12 +284,12 @@ plt.show()
215
284
216
285
217
286
``` {exercise}
218
- :label: prob_ex2
287
+ :label: prob_ex3
219
288
220
289
Using `u.pmf`, check that our definition of the CDF given above calculates the same function as `u.cdf`.
221
290
```
222
291
223
- ``` {solution-start} prob_ex2
292
+ ``` {solution-start} prob_ex3
224
293
:class: dropdown
225
294
```
226
295
@@ -304,7 +373,7 @@ The definition of the mean and variance of a random variable $X$ with distributi
304
373
For example, the mean of $X$ is
305
374
306
375
$$
307
- \mathbb E X = \int_{-\infty}^\infty x p(x) dx
376
+ \mathbb{E}[X] = \int_{-\infty}^\infty x p(x) dx
308
377
$$
309
378
310
379
The ** cumulative distribution function** (CDF) of $X$ is defined by
@@ -328,7 +397,7 @@ This distribution has two parameters, $\mu$ and $\sigma$.
328
397
329
398
It can be shown that, for this distribution, the mean is $\mu$ and the variance is $\sigma^2$.
330
399
331
- We can obtain the moments, PDF, and CDF of the normal density as follows:
400
+ We can obtain the moments, PDF and CDF of the normal density as follows:
332
401
333
402
``` {code-cell} ipython3
334
403
μ, σ = 0.0, 1.0
@@ -659,7 +728,7 @@ x.mean(), x.var()
659
728
660
729
661
730
``` {exercise}
662
- :label: prob_ex3
731
+ :label: prob_ex4
663
732
664
733
Check that the formulas given above produce the same numbers.
665
734
```
@@ -700,6 +769,7 @@ The monthly return is calculated as the percent change in the share price over e
700
769
So we will have one observation for each month.
701
770
702
771
``` {code-cell} ipython3
772
+ :tags: [hide-output]
703
773
df = yf.download('AMZN', '2000-1-1', '2023-1-1', interval='1mo' )
704
774
prices = df['Adj Close']
705
775
data = prices.pct_change()[1:] * 100
@@ -777,6 +847,7 @@ Violin plots are particularly useful when we want to compare different distribut
777
847
For example, let's compare the monthly returns on Amazon shares with the monthly return on Apple shares.
778
848
779
849
``` {code-cell} ipython3
850
+ :tags: [hide-output]
780
851
df = yf.download('AAPL', '2000-1-1', '2023-1-1', interval='1mo' )
781
852
prices = df['Adj Close']
782
853
data = prices.pct_change()[1:] * 100
0 commit comments