1HY013 - Exercise 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

1HY013 - Statistics and Data Analysis Methods - HT23

Exercise 4 - Comparing stream water quality data

When we investigate spatial or temporal variations in climate or in water, soil, rock (etc.)
properties, we may find differences in some of their statistical properties, such as their mean
value. But we generally have access to a limited sample of observations, so how confident
can we be that the observed differences are not simply apparent, and due to the effects of
random sampling ? This is where null hypothesis statistical tests (NHST) can help.

In this exercise, you will investigate variations in streamwater properties in a region of northern
Canada. Your goals are to answer specific questions about how water properties vary between
catchments and months, and verify if the observed differences are statistically meaningful.

Data: HY013_ Water_quality_data.xlsx

This file contains water quality data from 255 surface samples collected at the mouth of 11
rivers across the Northwest Territories of Canada (NWT), between 2013 and 2018. The table
is organized as follows:

-Location of sampling (column 1)


-Date of sampling (year and month; columns 2 and 3)
-Measured water quality property names and their units (rows 1 and 2)
-Limit of detection (LOD) for the measured properties (row 3)
-Values of the measured properties (rows 4 to last)

The 10 water quality properties that were measured are:

Total Alkalinity (mg L-1) Dissolved Organic Carbon (mg L-1)


Specific Conductance (S cm-1) Total Dissolved Solids (mg L-1)
Turbidity (Normalized Turbidity Units) Calcium ions (mg L-1)
Sulfate ions (mg L-1) Total Nitrogen (mg L-1)
Total Aluminum (mg L-1) Total Iron (mg L-1)

Notes:
There are some cells in the dataset marked "<LOD" which indicates values that are below the
limit of detection (left-censored data). These should be replaced by LOD/2 (before importing
into MATLAB), while any missing values in the dataset can be replaced with "NaN".

Tasks

First, choose one water quality property. For the instructions, let's call this property "X".

Next, explore the probability distribution of X. This will help you to choose the most
appropriate methods for testing hypotheses about the data.

1. Produce a histogram of X to determine what type of probability distribution it has. When


doing this, combine data from all rivers together. Use n = 15 bins on the histogram to better
reveal the data structure. (If you find that the probability distribution is highly asymmetrical,
with a long tail to the left or the right, you might consider applying a log-transformation to the
data to better reveal the details of its distribution)

2. Make either a normal probability plot or a quantile-quantile plot to visually determine if


the data (or the log-transformed data) are normally-distributed.

In addition to these plots, test the likelihood that your sample data comes from a normal
distribution. Use a goodness-of-fit test such as the Lilliefors or Anderson-Darling test.
1HY013 - Statistics and Data Analysis Methods - HT23

Next, make the appropriate choice of statistical test to answer the questions below. To assist
you, refer to the "decision tree" provided with the instructions (note that you may need to do
some tests of variance to guide your decision, see footnotes on that page).

3. Identify the rivers in the dataset for which the median value of X during summer months
(June, July and August together) is significantly greater than the summer median value of X
in all the rivers pooled together. Level of confidence: 95 %.

4. For the three listed rivers below, is there a significant difference in the mean (or median)
value of X in summer months (June, July and August together) ? Level of confidence: 90 %.

a) Franks Channel
b) Jean Marie River
c) Kakisa River

5. To support the conclusions of your tests from questions 3 and 4, make plots that show the
median values of X that are compared, and also the confidence intervals for these values.
Example: show the medians as circles, with error bars spanning the confidence intervals. The
confidence intervals have to be estimated by bootstrap resampling of the data.

For statistical tests, you should provide, in your Lievscript, the following information:
(for e.g., in the form of comment lines)

-What is the test ?


-Is it a parametric or a non-parametric test ?
-Are you performing a one-sided (one-tailed), or a two-sided (two-tailed) test ?
-Is this a one-sample, two-sample test, a multi-sample test ?
-What is the null hypothesis (H0) that is being tested ?
-If the test is one-sided, what is the alternative hypothesis H1 ?
-What is the level of significance () of the test ?
-What is the value of the test statistic that was obtained from the data ?
-What was the actual p-value for the statistic ?

Example:

Test: Two-sample, one-sided (right-tailed) t-test


The test compares the mean values of X in samples A and B
H0: mean value of X in sample A = mean value of X in sample A
A right-tailed implies that if H0 can be rejected, the favoured alternative hypothesis H1 will that
the mean value of X in sample A > mean value of X in sample B.
The test is being performed at the 95 % level of confidence, so the level (threshold) of
significance is equal to 1- = 1 - 0.95 = 0.05 (i.e., 5 %).

Results:
t = (some value) (this is the value of the test statistic for this particular test)1
p = 0.002 (probability of observing the value above uder H0: this is the p-value)
Since p << 0.05, H0 can be rejected in favor of H1 with 95 % confidence

Useful MATLAB resources

1
There may be different ways of the computing the test statistic depending, for e.g., on the sample size. For this
reason, MATLAB produces, in some cases, more than one output for the test statistic. The help documentation
of each test function can help to understand what these different outputs mean, but ultimately, what is most
important is whether the p-value is below the threshold of significance, and if so, by how much.
1HY013 - Statistics and Data Analysis Methods - HT23

Topics in the MATLAB Statistics and Machine Learning Toolbox documentation:

Distribution Plots
Hypothesis Tests
Resampling techniques

Most relevant statistical test functions in MATALAB: See separate PDF document

Other useful MATLAB functions for this exercise:

bootstrp bootstrap sampling

(Note: There is a function for computing confidence intervals by bootstrap


sampling (bootci), but for this exercise I recommend that you use bootstrp
instead. Or try to write your own resampling code if you wish !)

errorbar scatter plot with error bars

You might also like