Error and Uncertainty: General Statistical Principles
Error and Uncertainty: General Statistical Principles
Error and Uncertainty: General Statistical Principles
Descriptive statistics Used to describe the nature of your data. Inductive statistics The use of descriptive statistics to make a statement, prediction or decision. Descriptive statistics are commonly reported but both are needed to interpret results.
Probability
A character associated with an event ! - its tendency to take place. To see what were talking about, lets use an example - a 10 sided die. If a person had a die and gave it a series of rolls, what would be the expected result?
For a single roll, each value is equally likely to come up. What if a person had two dice?
Two dice
Average of two dice - one roll
More dice
Three dice
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Four dice
We can continue this trend, using more dice and a single roll.
0 1 2 3 4 5 6 7 8 9
]x - ng 1 f ]x g = e 2v 2rv 2
2
-3 < x < + 3
P(x) - standard deviation - universal mean An innite data set is required.
Normal distribution
These terms can be calculated by: Universal mean Variance Standard deviation
Normal distribution
Both variance and standard deviation measure the
dispersion of data around the universal mean.
n =! x i N i =1
v =!_ x i - ni
2 i =1
it has the same units as the original data - provides more weight to the central data. original data. It is a more sensitive measure of dispersion - providing more weight to the outlying data. That why well manipulate data via variance no SD.
v = v2
These are for innite data sets but can be used for data sets where N > 100 and any variation is truly random in nature.
Normal distribution
+ 1 = 68.3% + 2 = 95.5% + 3 = 99.7% -3 -2 -1 0 +1 +2 +3 2mg 5mg 8mg 11mg 14mg 17mg 20mg When determining , you are assuming a normal distribution and relating your data to units.
-3 -2
-1
+1
+2 +3
The area under any portion of the curve tells you the probability of an event occurring.
u=
_ x - ni
This is simply converting your test value from your normal units (mg, hours, ...) to standard deviation.
Form A
area 0.5000 0.4207 0.3446 0.2743 0.2119 0.1587 0.1151 0.0808 0.0548 0.0359 |u| 2.0 2.2 2.4 2.6 2.8 3.0 4.0 6.0 8.0 10.0 area 0.0227 0.0139 0.0082 0.0047 0.0026 1.3x10-3 3.2x10-5 9.9x10-10 6.2x10-16 7.6x10-24
Assuming that your data is normally distributed, you can use u to predict the probability of an event occurring. The probability can be found by looking up u on a table - Form A and Form B. Many spreadsheets also allow for calculation of these values. Which form you use is based on the question being asked.
This form will give you the area under the curve from u to innity.
Example
This form will give you the area under the curve from 0 to u
Form B
|u| 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 area 0.0000 0.0793 0.1554 0.2258 0.2881 0.3413 0.3849 0.4192 |u| 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 area 0.445 0.464 0.477 0.486 0.491 0.495 0.497 0.498
A tire is produced with the following statistics regarding usable milage. = 58000 miles = 10000 miles What mileage should be guaranteed so that less than 5% of the tires would need to be replaced?
Example
Looking on Form A, we nd that an area of 0.05 comes closest to 1.6 . Now use the reduced variable equation using a of -1.6 (we want the value to be less than the mean.) ! ! -1.6 = ( x - 58000) / 10000 x = 42000 miles
Using Excel
The same problem can be solved for using a spreadsheet like Excel. This problem can be solved using the function, NORMINV. It provides more accuracy because it is not limited to the resolution of the table.
Using Excel
Another example
You install a pH electrode to monitor a process stream. The manufacturer provided you with the follow values regarding electrode life time: = 8000 hours = 200 hours If you needed to replace the electrode after 7200 hours of use, was the electrode bad?.
Using Excel
Another example
u = (7200 - 8000) / 200 = -4.0 Looking on form A, we nd that the probability at a value of 4 is 3.2 x 10-5 That means that only 0.0032 % of all electrodes would fail at 7200 hours or earlier. You have a bad electrode.
variance =s = !_ x i - x i
females males
outliers
-3 -2 -1 0 1 2 3
Univariate tools
For a normal distribution: Mean numerical average of values Median central tendency center value of ranked data actual value if N is odd average of two center value if N is even. Mode most frequent value
Univariate tools
Ideally, all three values should be the same. If not the same, at least very close
Skewness
This is a test to see if a population is Gaussian.
g = i =1
!^x - x h
N s3 x
g<0
g=0
g>0
Kurtosis
Closely related to skewness. While skewness measures a bias in the distribution, kurtosis measures how at a distribution is.
!^x - x h kurtosis =
N s4 x
Divide the skew (or kurtosis) figure by the standard error of the skew (kurtosis). If the value is greater than 1.96 or less than -1.96 your data is significantly skewed (kurtotic)
-3
s2=
- xh ! ^x ^n - 1h
n i i=1
Z=
g SE Skew
s = s2 { =n - 1
x RSD = 100 s x
x cv = s x
sx = sx n
SE Kertosis = 24 N
Degrees of freedom
{ or df =n - # of parameters
Example. If you had 10 measurements, they could be used in pairs to obtain 9 different measurements of the mean. If you tried to do a 10th, one of the pairs would have already been used - same standard deviation.
Degrees of freedom
When doing a linear regression t of a line using X,Y data pairs, the model used (Y = mX + b) results in two parameters. The degrees of freedom would be N-2 in this case (or model). So the model used determines the degrees of freedom.
Pooled statistics
In many cases it is necessary to combine results: values from separate labs data collected on separate days a different instrument was used a different method of analysis was used When we combined the data, we refer to this as pooling the data.
Pooled statistics
We cant simply combine all of the values and calculate the mean and other statistical values. There may have been differences with the results obtained. There might have been different numbers of samples collected with each set. We also would like a way to tell if the results are signicantly different.
Pooled statistics
2 + df s 2 +... + df s 2 df 1s 1 1 2 1 k p= df 1 + df 2 + ... + df k
The pooled standard deviation is weighted by the degrees of freedom. This accounts for the number of data points and any parameters used in obtaining the results.
Example
Set 1 2 3 4
s n
1.35 2.00 2.45 1.55 10 7 6 12
{ s 2 {s 2
9 6 5 11 1.82 4.00 6.00 2.40 16.4 24.0 30.0 26.4
sp=
=1.77