Lecture 5: Stats & Probability Lecture 7: Hypothesis Testing
Population vs Sample Central Limit Theorem
population: all possible values that could’ve been collected Distro of sample mean as sample size increases → approaches normal
sample: each singular data point actually collected Small N: sampling distro resembles original pop distro
rand num gen: pop= range of values that could’ve been, Moderate N (8): distro smooths, clusters toward true pop.mean (bell)
sample =values gen Large N >30: distro approaches normal
Calculate Stats & Discuss their Meaning Distro of raw data → approaches original pop distro
if np.mean & np.median = similar → distribution is not skewed Drawing Random Samples
np.std(name, ddof=1): measurements +/- std away from mean
range: np.max() - np.min() if large relative to mean → outliers
scipystats.mode: helpful if data = discrete values, unhelpful if
data= decimaled Manipulating Random Sample
scipystats.skew: negative means tail to left, positive =tail to right Np.random.rand(N): draws from uniform distro with default interval [0, 1]
scipystats.kurtosis(name, fisher=False): 3 = normal, <3 = flatter 0.5 * np.random.rand(N): multiply by decimal make interval smaller [0, 0.5]
(platykurtic), >3 =peaked (leptokurtic) 6.0 + np.random.rand(N): add a number shifts interval [6, 7]
Plotting Histogram w/ Correct Bins Calculate Bounds for 99% Confidence Interval:
Occurrence Probability for Theoretical Distros:
Prob that sample from norm distro w/mean 6.5 will be > than
5.5:
Performing Hypothesis Test for 2 : comparing 2 slices within dataset
Sampling Distribution, Sample Size & Number of Samples:
Population distr: total set of measurements
Sample distr of sample mean: distr of means collected from
diff samples
Number of Samples = # sets of data → increasing will make
distro converge at normal, no effect on mean
Sample size = # of measurements w/in each set → increasing
will make sample distro narrower & decrease uncertainty of
mean SEM = sigma/sqrt(n)
Practice Problems:
select data along specific coordinate values →sel()
timeseries = temp.mean(dim=('lon','lat'))
Best way to select data at specific lon & lat:
ds.temperature.sel(lat=34.05, lon=-118.25, method="nearest")
plot time-averaged spatial heatmap using temp variable from ds:
ds.temperature.mean(dim="time").plot()
“The t-stat x > the crit value y at a 90% significance level. At this sig level,
ds = xr.open_dataset(“path”) we reject the null hypothesis that noon mean pH is similar or < in the
morning and adopt the alt hypo that pH > in the afternoon”
Lecture 6: Time Series Analysis Lecture 7: Hypothesis Testing Continued
Fitting Polynomial Functions to Data: SubPlot Sample Distr of Sample Mean @ Sample Sizes:
Overfitting: model too complex & captures noise → poor generalization
to new data.
Underfitting: model too simple & fails to capture true pattern
Linear Interpolation:
easy to implement & no extreme oscillations, use on sparse data points
Spline Interpolation:
Lecture 8: Multi-Dimensional Data Analysis
Same as linear, add cubic argument to 3rd code line
Use when data has natural continuous variation & need smooth curve
Global Fit & Applied to a Value:
Extrapolation:
interp.interp1d(x, y, bounds_error=False, fille_value=”extrapolate”
How Polynomial Functions Fit Data to Curves: (LSR)
1 specify function form (polynomial, exponential, constant)
2 guess initial values for constants in function
3 define squared error residual metric quantifying mismatch between
observed data & current function values
4 use algorithm to change coefficient values to minimize error metric→
finds least-square solution best fitting data
Quality of Functional Fit Quality:
improves when quantity of data points increases or noise decreases
Higher order fits have extreme oscillations between data points, even if
data seems perfectly matched by a higher order fit → default is to
choose SIMPLEST fit matching data → less prone to high frequency
oscillations Using Xarray.plot(), .contour, etc.
Calculate Correlation Coefficient between Datasets:
always linear relationship, >0.7 strong, 0.3-0.7 moderate, <0.3 weak
2 independent datasets can still have strong correlation, indicating they
are impacted by a common 3rd variable
Other
Ddof: If pop std → Ddof = 1/n, if sample std → Ddof = 1/(n-1)
-matrices in format (#rows, #columns)
Calculating Degrees of Freedom
For confidence interval→ dof = n-1
For 2-sample t-test→ dof =n1+n2−2