PSM Lec-2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 61

Statistics & Probability Theory

Engr. Shakir Ahmad


Lecturer
Department of Civil Engineering
COMSATS University Islamabad,
Sahiwal Campus

1
Errors of Measurements
⚫ Continuous variable measurements depends
upon the methods of measurements,
instruments used, etc.
⚫ Weight recorded as 60.00 Kg means the true
weight is known to lie between 59.995 and
60.005 Kg

2
Types of Errors
⚫ Random or unbiased
errors
⚫ Compensating errors
⚫ Accidental errors
⚫ Non random or Biased
Errors
⚫ Cumulative errors
⚫ Systematic errors

3
Types of Errors
Random or unbiased error:
⚫ This is due to the random selection of the sample and the mean of
such error will be 0 as positive deviation and negative deviation cancel
out. This random error is also referred to as random deviation and is
measured by the standard deviation of the estimator.
Non-random or biased error:
⚫ This occurs due to several sources such as human, machines,
mistakes due to copying or punching, recording and so on. Through
careful planning we should try to avoid or minimize this error.

4
Sampling & Non-sampling Errors
⚫ Sampling Error: difference between the
estimated sample value and the true
population value.
X  
⚫ Where X is estimated sample value and
µ is true
population value.
⚫ Non-sampling Error: arises in process of data
collection,
even if a complete count is carried out.

5
Types of Samples
⚫ Probability (Random) Samples
⚫ Systematic random sample
⚫ Stratified random sample
⚫ Cluster sample

⚫ Non-Probability Samples (Non random sampling)


⚫ Convenience sample
⚫ Purposive sample
⚫ Quota

6
Two general approaches to sampling are used in social
science research. With probability sampling, all
elements (e.g., persons, households) in the population
have some opportunity of being included in the sample,
and the mathematical probability that any one of them
will be selected can be calculated.

With nonprobability sampling, in contrast, population


elements are selected on the basis of their availability
or because of the researcher's personal judgment that
they are representative. The consequence is that an
unknown portion of the population is excluded.

7
One of the most common types of nonprobability
sample is called a convenience sample – not because
such samples are necessarily easy to recruit, but
because the researcher uses whatever individuals are
available rather than selecting from the entire
population.
Because some members of the population have no
chance of being sampled, the extent to which a
convenience sample – regardless of its size – actually
represents the entire population cannot be known.

8
.
The main difference between stratified and cluster
sampling is that in stratified sampling all the strata need
to be sampled.

In cluster sampling one proceeds by first selecting a


number of clusters at random and then sampling each
cluster or conduct a census of each cluster. But usually
not all clusters would be included.

9
Types of Samples

10
Types of Samples

11
Data Representation
⚫ Frequency
⚫ Bar chart
⚫ Pie chart
⚫ Simple bar chart
⚫ Bi-variate frequency table
⚫ Multiple bar chart
⚫ Relative frequency
distribution
⚫ Cumulative frequency
distribution
12
13
Data Representation
X Frequency Relative frequency X Frequency Cumulative
frequency
3 1 1/45 x 100=2.22% 3 1 1

4 3 3/45 x 100=6.67% 4 3 1+3=4

5 9 9/45 x 100=20% 5 9 4+9=13

6 13 13/45 x 100=29.89%
6 13 13+13=26
7 10 10/45 x 100=22.22%
7 10 26+10=36
8 3 3/45 x 100=6.67%
8 3 36+3=39
9 6 6/45 x 100=13.33%
9 6 39+6=45
Total 45
Total 45

14
Types of Frequency curves
⚫ Symmetrical frequency curve
⚫ Moderately skewed frequency
curve
⚫ Extremely skewed frequency
curve
⚫ Uniform
⚫ Distribution f
U-shaped frequency curve

x
Symmetrical frequency curve

15
Types of Frequency curves
f f

X X
Positively skewed curve Negatively skewed curve
f f

X X
Extremely Positively skewed Extremely Negatively 16
Types of Frequency curves
Extremely Negatively Skewed Extremely Positively
Skewed
Age Group No. of
Deaths/thousand No. of 6s Frequency
20-29 2.1 0 28
30-39 4.3
1 17
40-49 5.7
2 4
50-59 8.9
60-69 15 3 1

70-79 23 Total 50

17
Types of Frequency curves
f
U-shaped Frequency
curve

X f

30

Uniform Frequency 20
curve
10

0 X
1 2 3 4 5 18
Example – Frequency Distribution
EPA Mileage Ratings on 30 Cars (MPG)
36.3 42.1 44.9
30.1 37.5 32.9
40.5 40.0 40.2
36.2 35.6 35.9
38.5 38.8 38.6
36.3 38.4 40.5
41 39.0 37.0
37 36.7 37.1
37.1 34.8 33.9
39.9 38.1 39.8

EPA: Environmental Protection 19


Example – Frequency Distribution
⚫ Range = Xm –
X0 30.1 44.9

30 35 40 45
14.8
⚫ Class Interval = h = 14.8/5
= 2.96 = 3
Classes
Class number Lower Class Limit Lower Limit Upper Limit
1 30.0 30.0 32.9
2 30.0+3=33
33.0 35.9
3 33.0+3=36
36.0 38.9
4 36.0+3=39
39.0 41.9
5 39.0+3=42
42.0 44.9
20
Example – Frequency Distribution
Class limit Class Frequency Relative %age
Boundarie frequenc Frequenc
s y y
30.0 – 32.9 29.95 – 2 2/30=0.067 6.7
32.95
33.0 – 35.9 32.95 – 4 4/30=0.133 13.3
35.95
36.0 – 38.9 35.95 – 14 14/30=0.467 46.7
38.95
39.0 – 41.9 38.95 – 8 8/30=0.267 26.7
41.95
42.0 – 44.9 41.95 – 2 2/30=0.067 6.7
44.95
Total 30
The end points of a class interval are called as the
class boundaries. Left end inclusion is appropriate. 21
Example – Frequency Distribution
Class Boundaries Mid Point (X) Frequency Cumulativ
e
Frequenc
y
29.95 – 32.95 31.45 2 2

32.95 – 35.95 34.45 4 6

35.95 – 38.95 37.45 14 20

38.95 – 41.95 40.45 8 28

41.95 – 44.95 43.45 2 30

Total 30

22
Example – Frequency Distribution
Histrogram

1 Frequency Distribution Curve


6
1 16
No. of cars

4 14

Number of cars
12
1
10
2
8
1 6
0 4
8 2
0
6
29.9 32.9 35.9 41.9 44.9 28.45 31.45 34.45 37.45 40.45 43.45 46.45
4
5 5 5 5 5 Miles Per Gallon
2
0 MPG

Cumulative frequency Curve


Class Frequency Cumulati
Boundarie ve
35 s Frequenc
30
25
y
No. of cars

20 26.95 – 29.95 0
15
10 29.95 – 32.95 2 2
5
0 32.95 – 35.95 4 6
29.95 32.95 35.95 38.95 41.95 44.95
MPG 35.95 – 38.95 14 20
38.95 – 41.95 8 28
41.95 – 44.95 2 30 23
The number of class intervals chosen should be a trade-off between (1)
choosing too few classes at a cost of losing too much information about
the actual data in a class and (2) choosing too many classes which will
result in the frequencies of each class being too small for a pattern to be
discernible. Although 5-10 class intervals are typical, the appropriate
number is a subjective choice.
Rule of thumb is to make nc =√n or an integer close to this, but it should be at
least 5 and not greater than 25.

24
DIFFERENT TYPES OF PLOTS

⚫ Point plot: The horizontal axis (x-axis) covering the range


of the data values and vertically plot the points, stacking
any repeated values.
⚫ Time series plot: x-axis corresponds to the number of
the observation or the time of the observation or the day
and so on and the y-axis will correspond to the value of
the observation.
⚫ Histogrom: This is a bar graph, where the data is
grouped into many classes. The x-axis corresponds to the
classes and the y-axis gives the frequency of the
observations.

25
Time-series plot

26
Histogram

27
Averages
⚫ Types
⚫ Arithmetic mean
⚫ Geometric mean
⚫ Harmonic mean
⚫ Median
⚫ Mode
⚫ Arithmetic , geometric and harmonic means
are mathematical in character, and gives
magnitude of the observed values.

28
MEASURES OF LOCATION
⚫ MEAN: Used very often in analyzing the data.
⚫ Although this is a common measure, if the data vary
greatly the average may take a non-typical value and could
be misleading.
⚫ Median: is the halfway point of the data and tells us
something about the location of the distribution of the data.
⚫ Mode: if exists, gives the data point that occur most
frequently.
⚫ It is possible for a set of data to have 0, 1 or more modes.

29
LOCATION (cont’d)
⚫ Mean and median always exist.
⚫ Mode need not to exist.
⚫ Median and mode are less sensitive to extreme
observations.
⚫ Mean is most widely used.
⚫ There are some data set for which median or mode may
be more appropriate than mean

30
Mode for discrete variable
No. of Passengers (X) No. of Flights (f)
28 1
33 1
34 2
35 3
36 5
37 7
38 10
39 13
40 8
total 50

Highest frequency = Hence: mode = 39


13 , 31
Mode for continuous variable
^ f m  f1
Xl xh
 ( f m  f1 ) ( f m  f 2 )

Where
l = lower class boundary of the modal class
fm = frequency of the modal class
f1 = frequency of the class preceding the modal
class f2 = frequency of the class following the modal
class h = length of class interval of the modal class

32
Example: EPA Mileage rating
Class limit Class Frequency
Boundaries
30.0 – 32.9 29.95 – 32.95 2
Where
33.0 – 35.9 32.95 – 35.95 4
36.0 – 38.9 35.95 – 38.95 14 l = 35.95
39.0 – 41.9 38.95 – 41.95 8 fm =
42.0 – 44.9 41.95 – 44.95 2
Total 30 14
f1 =
^ 14  41
X x 4
35.95 (14  41 )  (14  3
^ 8) f2 =
X
37.825 8
33
34
Example – Mode usage
⚫ Suppose the manager of a men's clothing
store is asked about the average size of hats
sold. He will probably think not of the
arithmetic or geometric mean sizes, or indeed
the median size. Instead, he will in all
likelihood quote that particular size which is
sold most often. This average is so far more
use to him as a businessman than the
arithmetic or geometric mean or the median.
⚫ The modal size of all clothing is the size
which the BM must stock in the greatest
quantity and variety in comparison with other
sizes. 35
Arithmetic Mean
⚫ Discrete
Variable n Where n represents the number
 Xi
i1 of
X
observations in the sample.

n
⚫ Continuous
k k
Variable  fi X i  fi X i
X  i1k  i1n
 fi
i
1

36
Example-1
Mid Frequenc fX
Point y (f)
(X)
31.45 2 62.9

34.45 4 137.8 1135.5


X  30 
37.45 14 524.3
37.85
40.45 8 323.6

43.45 2 86.9

30 1135.5

37
Example-2
⚫ Suppose that in a particular high school,
there are: 100 – freshman
80 – sophomores
70 – Juniors
50 – Seniors
⚫ And suppose that on a given day, 15% of
freshman, 5% of sophomores, 10% of juniors,
2% of seniors are absent. What percentage of
students is absent for the school as a whole
on that particular day?
38
Example-2- Solution
15 5 10  2
ArithmeticMean 4 
8
Category No. of absent
of student
students s
Freshman 100 15

Sophomores 80 4 (27 x 100)/300= 9

Juniors 70 7

seniors 50 1

total 300 27

39
Weighted Average
Category Absent (Xi) No. of WiXi
of students
students (Wi)
Freshman 15 100 1500

Sophomores 5 80 400

Juniors 10 70 700

seniors 2 50 100

total 27 300 2700

X w  Wi X i 
Wi 300
2700 40
Median
⚫ The median is the middle value of the series
when the variable values are placed in order
of magnitude.

⚫ Example
Number of passengers traveling on a bus at six
different times during the day
4 9 14 18 23 47

Median = (14+18)/2 = 16
Passengers

41
Example- Median
No. of Number 23, 25, 26, 26, 27 , 27, 27, 27,
students of
per class classes 27, 27, 28, 28, 28, 28, 28, 28,
23 1 28, 28, 28, 29, 29, 29, 29, 29,
24 0
29, 29, 29, 30, 30, 30, 30, 30,
30, 30, 30, 30, 30, 31, 31,
25 1
31,31 31, 31, 31
26 3
27 6
Median = 23rd value
28 9
29 8
30 10
31 7
45
42
Median- Continuous Variable
X~  l h  n  c
f

Where 2
l = lower class boundary of the median class (i.e. that class for

which the cumulative frequency is just in excess of
n/2)
h = class interval size of the median class
f = frequency of the median class
n = total number of observations
c = cumulative frequency of the class preceding the median class

43
Example – EPA Mileage Rating
Class Boundaries Frequency Cumulativ
e
Frequenc
y
29.95 – 32.95 2 2
32.95 – 35.95 4 6
35.95 – 38.95 14 20
38.95 – 41.95 8 28
41.95 – 44.95 2 30
Total 30
~ h n  3
Xl  c  35.95 15
637.8837.9
f  2 14
 44
Example – EPA Mileage Rating

Arithmetic Mean
X
Median 37.85
~
X
37.88
^
Mode X
37.825

45
Partitioning of Distributions
⚫ Medians (divides into two
parts)
⚫ Quartiles (divides into four
parts)
⚫ Deciles (ten divisions)
⚫ Percentiles (specific
percentage)

46
Quartiles (divides into four parts)
First quartile
h n
Q1  l     
f  

Second quartile c 4
(median)
h 2n h n
Q2  l     c  l    
f 4  f 2
Third quartile c

h 3n
Q3 l     c f
f 
4 

Q1 Q2 Q3 47
Deciles and Percentiles
The deciles and the percentiles given the division of the total
area into 10 and 100 equal parts respectively

The formula for the 1st decile is

h n
D1  l     c
f 10 
The formula for the subsequent deciles
are
h  2n 
D2  l  10
 c
f 

h 3n
D.3  l     c
f 10 
And so on. It is easily seen that the 5th decile is the same quantity
as the median. 56
48
Deciles and Percentiles

The formula for the 1st percentile is

h n
P1  l     c
f 100
The formula for the subsequent deciles 
are
h  2n 
P2  l  100
 c
f

h  3n 
P.3  l  100
 c
f
And so on. It is easily seen that the 50th percentile is the
same quantity as the median. 57
49
Example
Class interval Frequency
20 – 29 6
h n
30 – 39 18 Q1  l     c
f  
40 – 49 11 4 10
50 – 59 11  29.5 18 (12.5 
60 – 69 3 6)

70 – 79 1 33.1

h 17n
P17  l     c
h  6n  f 100 
D6  l    10
f 10 P17  29.5 
18
8.5 
cD6  39.5  10 30 P17 
6
 11 30.9
D6
24
44.95 50
Significance
If oil company “A” reports that its yearly sales are at the 90th
percentile of all companies in the industry, the implication
is that 90% of all oil companies have yearly sales less than
company A’s, and only 10 % have yearly sales exceeding
company A’s.

Relative
frequency
0.10
0.90

Yearly sales
Company A’s sales
51
Significance
Cumulative frequency Curve

35
30
3n/4 25
No. of cars

20
15
10

n/4 5
0
29.95 32.95 35.95 38.95 41.95 44.95
MPG

Q1 Q3
52
Geometric Mean
The geometric mean ‘G’ of a set of n positive values X1, X2,
…………… Xn is defined as the positive nth root of their product.

G  n X 1 X 2 ..... X n where X i  0

When n is large the computation of the geometric mean becomes


laborious as we have to extract the nth root of the product of all
the values.
The geometric mean is simplified by the use of logarithms.

1
logG  log X1  log X2 .......  log Xn
n

log
 n
X 53
Geometric Mean for group data
In case of a frequency distribution having K classes with midpoints
X1, X2,…………… Xk and the corresponding frequencies f1 , f2, …..fk
the geometric mean is given by

G  n X1f1 X2f 2 .....X kf k where X i 


0
In case of logarithms the formula
becomes
.......  f log X
1

logG  1f log X1  f2 log

2 k k
n
X
f log

X
n 54
Example – EPA mileage Rating
Mileag No. Class logX f logX
e of mid
Rating Car point
s (X)
30 – 32.9 2 31.45 1.4976 2.9952
33 – 35.9 4 34.45 1.5372 6.1488
36 – 38.9 14 37.45 1.5735 22.0290
39 – 41.9 8 40.45 1.6069 12.8552
42 – 44.9 2 43.45 1.6380 3.2760
30 47.3042

G = antilog (47.3042/30) = antilog 1.5768 =


37.74
55
Harmonic Mean
The harmonic man is defined as the reciprocal of the arithmetic mean
of the reciprocals of the values.
In case of raw data: n
H.M .
 1

 X 
In case of grouped data (data grouped into a frequency
distribution ):
n
H .M .
 1
f  

X 
Where X represents the mid points of the various
classes.
56
Example – Harmonic Mean
• Suppose a car travels 100 miles with 10
stops, each stop after an interval of 10 miles.
• Suppose that the speeds at which the car travels
these 10 intervals are 30, 35, 40, 40, 45, 40, 50, 55,
and 30 miles per hours respectively.

30  35 .........  30 420
ArithmeticMean  
42mph
10
10

57
Example – Harmonic Mean (cont’d)
interval Distance Speed = (D/t) Time = (D/S)
1 10 miles 30 mph 10/30 = 0.3333 hrs
2 10 miles 35 mph 10/35 = 0.2857 hrs
3 10 miles 40 mph 10/40 = 0.2500 hrs
4 10 miles 40 mph 10/40 = 0.2500 hrs
5 10 miles 45 mph 10/45 = 0.2222 hrs
6 10 miles 40 mph 10/40 = 0.2500 hrs
7 10 miles 50 mph 10/50 = 0.2000 hrs
8 10 miles 55 mph 10/55 = 0.1818 hrs
9 10 miles 55 mph 10/55 = 0.1818 hrs
10 10 miles 30 mph 10/30 = 0.333 hrs
Total 100 miles Total Time = 2.4881 hrs

Hence average speed = 100/2.4881 = 40.2


mph 58
Example – Harmonic Mean (cont’d)
X 1/X
30 mph 1/30 = 0.0333 n
H.M .
35 mph 1/35 = 0.0286  1

40 mph 1/40 = 0.0250  X 
40 mph 1/40 = 0.0250 10
H.M .  40.2
45 mph 1/45 = 0.0222  0.2488 mph
40 mph 1/40 = 0.0250
Hence it is cleared
50 mph 1/50 = 0.02
that the harmonic
55 mph 1/55 = 0.0182 mean gives the totally
55 mph 1/55 = 0.0182 correct result.
30 mph 1/30 = 0.0333
= 0.2488
59
Rules
⚫ When values are given as x per y where x
is constant and y is variable, the Harmonic
mean is the appropriate average to use.
⚫ When values are given as x per y where y
is constant and x is variable, the Arithmetic
mean is the appropriate average to use.
⚫ When relative changes in some variable
quantity are to be averaged, the geometric
mean is the appropriate average to use.

60
LOCATION
⚫ Symmetric: mean = median = mode
⚫ Positively skewed: tail to the right; mean > median
⚫ Negatively skewed: tail to the left; median > mean
⚫ For skewed data median is preferred to the mean.

61

You might also like