0% found this document useful (0 votes)
37 views26 pages

Lecture 7

This document discusses statistical significance and receiver-operator characteristic (ROC) curves. It defines key terms like p-value, sensitivity, specificity, and area under the ROC curve (AROC). A good diagnostic test is represented by an ROC curve that climbs rapidly towards the upper left corner, indicating low false positive and false negative rates across cut-off values. The AROC quantifies test accuracy, with higher values indicating better discrimination. An AROC of 0.5 represents random chance, while values above 0.7 indicate minimum acceptable accuracy.

Uploaded by

Monrell Tai Baba
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views26 pages

Lecture 7

This document discusses statistical significance and receiver-operator characteristic (ROC) curves. It defines key terms like p-value, sensitivity, specificity, and area under the ROC curve (AROC). A good diagnostic test is represented by an ROC curve that climbs rapidly towards the upper left corner, indicating low false positive and false negative rates across cut-off values. The AROC quantifies test accuracy, with higher values indicating better discrimination. An AROC of 0.5 represents random chance, while values above 0.7 indicate minimum acceptable accuracy.

Uploaded by

Monrell Tai Baba
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Evaluation Measurements

1
Statistical significance

• In statistics, a result is significant if it is unlikely to have occurred


by chance, given that a presumed null hypothesis is true.
• More precisely, in traditional frequentist statistical hypothesis testing
, the significance level of a test is the maximum probability of
accidentally rejecting a true null hypothesis (a decision known as a
Type I error) and accept the Alternative hypothesis.
• The significance of a result is also called its p-value; the smaller the
p-value, the more significant the result is said to be.

2
Statistical significance

• For example, one may choose a significance level of, say, 5%, and
calculate a critical value of a statistic (such as the mean) so that the
probability of it exceeding that value, given the truth of the
null hypothesis, would be 5%. If the actual, calculated statistic value
exceeds the critical value, then it is significant "at the 5% level".
Symbolically speaking, the significance level is denoted by α.

3
Statistical significance

• If the significance level is smaller, a value will be less likely to be


more extreme than the critical value. So a result which is "significant
at the 1% level" is more significant than a result which is "significant
at the 5% level". However a test at the 1% level is more likely to
have a Type II error than a test at the 5% level, and so will have less
statistical power. In devising a hypothesis test, the tester will aim to
maximize power for a given significance, but ultimately have to
recognise that the best which can be achieved is likely to be a
balance between significance and power, in other words between
the risks of Type I and Type II errors. It is important to note that
Type I error is not necessarily any worse than a Type II error, and
vice versa. The severity of an error depends on each individual
case.

4
Statistical significance

• If the alternative hypothesis is in fact true, then a sufficiently large


sample size is likely to give a highly significant result, even if the
difference between the null hypothesis and the alternative
hypothesis is very small. The statistical significance of a result is
therefore not an indication of how substantial or important the
difference is.

5
Receiver-Operator-Characteristic
(ROC)
• It is Statistics
A ROC curve shows the relationship of probability of false alarm
(x-axis) to probability of detection (y-axis) for a certain test.
• Expressed in medical terms: the probability of a positive test,
given no disease to the probability of a positive test, given
disease.
• The ROC curve may be used to determine an optimal cutoff point
for the test.

6
Determining significance of
database matches
• When searching a DB, the challenge for
analysis methods is to determine if
matches are related (true-positive, TP) or
unrelated (true-negative, TN)
• At a given scoring threshold, it is likely that
unrelated sequences will be matched
erroneously (false-positives, FP) & some
correct matches will be missed (false-
negative, FN)
• The aim is to improve the resolution
between the curves - in the overlap, it is
difficult or impossible to establish if
matches are significant
• Different methods tackle this problem in
different ways

7
Practical Problem

• For example, doctors have measured the S100 protein in serum and
found that higher values tend to be associated with Creutzfeldt-
Jakob disease. The median value is 395 pg/ml for the 108 patients
with the disease and only 109 pg/ml for the 74 patients without the
disease.
• The doctors set a cut off of 213 pg/ml, even though they realized
that 22.2% of the diseased patients had values below the cut off and
18.9% of the disease-free patients had values above the cut off.

8
Practical Problem

• The two percentages listed above are the false negative and false
positive rates, respectively.
• If we lowered the cut off value, we would decrease the false
negative rate probability, but we would also increase the false
positive rate.
• Similarly, if we raised the cut off, we would decrease the false
positive rate, but we would increase the false negative rate.

9
Practical Problem: Graphical
Representation

N
True negative

True positive

False negative False positive Score

10
ROC

• An ROC curve is a graphical representation of the trade off between


the false negative and false positive rates for every possible cut off.
Equivalently, the ROC curve is the representation of the tradeoffs
between sensitivity (Sn) and specificity (Sp).
• By tradition, the plot shows the FP rate on the X axis and (1 –FN)
rate on the Y axis.
• You could also describe this as a plot with 1-Sp on the X axis and
Sn on the Y axis.

11
Sensitivity (Sn)

• The sensitivity of a binary classification test or algorithm, such as a


blood test to determine if a person has a certain disease, is a
parameter that expresses something about the test's performance.
• The sensitivity of such a test is the proportion of those cases having
a positive test result of all positive cases (e.g., people with the
disease, faulty products) tested.
• A sensitivity of 100% means that all sick people or faulty products
were recognized as such.
• Sensitivity alone does not tell us all about the test, because a 100%
sensitivity can be trivially achieved by labeling all test cases
positive. Therefore, we also need to know the specificity of the test.

# TP
Sn 
# TP  # FN
12
Specificity (Sp)

• In binary testing, e.g. a medical diagnostic test for a certain disease,


specificity is the proportion of true negatives of all the negative
samples tested.
• For a test to determine who has a certain disease, a specificity of
100% means that all people labeled as non-sick are actually non-
sick.
• Specificity alone does not tell us all about the test, because a 100%
specificity can be trivially achieved by labeling all test cases
negative. Therefore, we also need to know the sensitivity of the test.
# TN
Sp 
# TN  # FP

13
Precision

• Also called positive predictive value


• which is as much a statement about the proportion of actual
positives in the population being tested as it is about the test.

# TP
Precision 
# TP  # FP

14
Accuracy

• Correctly indentify TP and TN from all data

# TP  # TN
Accuracy 
# TP  # TN  # FP  # FN

15
How tell a good ROC curve from a
bad one?
• It is the diagnostic test which can be good or bad.
• A good diagnostic test is one that has small FP and FN rates across
a reasonable range of cut off values.
• A bad diagnostic test is one where the only cut offs that make the
FP rate low have a high FN rate and vice versa.
• Good test: ROC curve climbs rapidly towards upper left hand corner
of the graph. This means that (1- FN) rate is high and the FP rate is
low.
• Poor test: ROC curve follows a diagonal path from the lower left
hand corner to the upper right hand corner. This means that every
improvement in FP rate is matched by a corresponding decline in
the FN rate.

16
Area under ROC curve

• You can quantify how quickly the ROC curve rises to the upper left
hand corner by measuring the area under the curve.
• The larger the area, the better the diagnostic test.
– If the area is 1.0, you have an ideal test, because it achieves
both 100% sensitivity and 100% specificity.
– If the area is 0.5, then you have a test which has effectively 50%
sensitivity and 50% specificity. This is a test that is no better
than flipping a coin.
• Area under the curve does have one direct interpretation. If you take
a random healthy patient and get a score of X and a random
diseased patient and get a score of Y, then the area under the curve
is an estimate of P[Y>X] (assuming that large values of the test are
indicative of disease).

17
Area under ROC curve (AROC)

• As a rough guide, the area under ROC (AROC) value 1.0


represents a perfect prediction, values 0.9 to 1.0 represent excellent
accuracy, 0.8 to 0.9 represent good accuracy, 0.7 to 0.8 represent
minimum acceptable accuracy, while 0.5 to 0.7 represent
predictions that indicate random choice (Swets 1988)

18
Example of an ROC curve

• Consider a diagnostic test that


can only take on five values,
A, B, C, D, and E. Test Value Diseased Healthy
• We classify diseased (D+) and
healthy (D-) patients by this
A 2 28
test and get the following B 4 14
results:
C 10 5
D 14 2
E 20 1
Total 50 50

19
• convert this table to cumulative
percentages. Test
• A row (*) to represent the Value Diseased Healthy
cumulative percentage of 0%
which will end up corresponding to * 0 0
a diagnostic test where all the
results are considered positive
A 4 56
regardless of the test value.
B 12 84
• The last row represents the other
extreme, where all the results are C 32 94
considered negative regardless of
the test value. D 60 98

E 100 100

20
• The complementary Test Value TP FP
percentages shown above
represent the TP rate (or Sn)Always
and the FP rate (or 1-Sp). Postive 100 100

A is negative
B-E is positive 96 44

A-B is negative
C-E is positive 88 16

A-C is negative
D-E is positive 68 6
A-D is negative
E is positive 40 2
Never positive 0 0
21
22
• Here is information about Area Under the Curve. This area (0.91) is
quite good; it is close to the ideal value of 1.0 and much larger than
worst case value of 0.5.

• Next is an artificial ROC curve with an area equal to 0.5.


– Notice that each gain in sensitivity is balanced by the exact
same loss in specificity and vice versa. Also notice that this
curve includes the point corresponding to 50% for both
sensitivity and specificity. You could achieve this level of
diagnostic ability by flipping a coin. Clearly, this curve represents
a worst case scenario.

23
24
What's a good value for the area
under the curve?
• Tricky and it depends a lot on the context of your individual problem.

• So here is one interpretation of the areas:


0.50 to 0.75 = fair
0.75 to 0.92 = good
0.92 to 0.97 = very good
0.97 to 1.00 = excellent.

• These are very rough guidelines

25
Reference

• The magnificent ROC (Receiver Operating Characteristic curve)


. Jo van Schalkwyk. Accessed on 2003-09-08.
www.anaesthetist.com/mnm/stats/roc/

26

You might also like