1HY013 - Exercise 4
1HY013 - Exercise 4
1HY013 - Exercise 4
When we investigate spatial or temporal variations in climate or in water, soil, rock (etc.)
properties, we may find differences in some of their statistical properties, such as their mean
value. But we generally have access to a limited sample of observations, so how confident
can we be that the observed differences are not simply apparent, and due to the effects of
random sampling ? This is where null hypothesis statistical tests (NHST) can help.
In this exercise, you will investigate variations in streamwater properties in a region of northern
Canada. Your goals are to answer specific questions about how water properties vary between
catchments and months, and verify if the observed differences are statistically meaningful.
This file contains water quality data from 255 surface samples collected at the mouth of 11
rivers across the Northwest Territories of Canada (NWT), between 2013 and 2018. The table
is organized as follows:
Notes:
There are some cells in the dataset marked "<LOD" which indicates values that are below the
limit of detection (left-censored data). These should be replaced by LOD/2 (before importing
into MATLAB), while any missing values in the dataset can be replaced with "NaN".
Tasks
First, choose one water quality property. For the instructions, let's call this property "X".
Next, explore the probability distribution of X. This will help you to choose the most
appropriate methods for testing hypotheses about the data.
In addition to these plots, test the likelihood that your sample data comes from a normal
distribution. Use a goodness-of-fit test such as the Lilliefors or Anderson-Darling test.
1HY013 - Statistics and Data Analysis Methods - HT23
Next, make the appropriate choice of statistical test to answer the questions below. To assist
you, refer to the "decision tree" provided with the instructions (note that you may need to do
some tests of variance to guide your decision, see footnotes on that page).
3. Identify the rivers in the dataset for which the median value of X during summer months
(June, July and August together) is significantly greater than the summer median value of X
in all the rivers pooled together. Level of confidence: 95 %.
4. For the three listed rivers below, is there a significant difference in the mean (or median)
value of X in summer months (June, July and August together) ? Level of confidence: 90 %.
a) Franks Channel
b) Jean Marie River
c) Kakisa River
5. To support the conclusions of your tests from questions 3 and 4, make plots that show the
median values of X that are compared, and also the confidence intervals for these values.
Example: show the medians as circles, with error bars spanning the confidence intervals. The
confidence intervals have to be estimated by bootstrap resampling of the data.
For statistical tests, you should provide, in your Lievscript, the following information:
(for e.g., in the form of comment lines)
Example:
Results:
t = (some value) (this is the value of the test statistic for this particular test)1
p = 0.002 (probability of observing the value above uder H0: this is the p-value)
Since p << 0.05, H0 can be rejected in favor of H1 with 95 % confidence
1
There may be different ways of the computing the test statistic depending, for e.g., on the sample size. For this
reason, MATLAB produces, in some cases, more than one output for the test statistic. The help documentation
of each test function can help to understand what these different outputs mean, but ultimately, what is most
important is whether the p-value is below the threshold of significance, and if so, by how much.
1HY013 - Statistics and Data Analysis Methods - HT23
Distribution Plots
Hypothesis Tests
Resampling techniques
Most relevant statistical test functions in MATALAB: See separate PDF document