1 - 3 - 4 - Class1 - Descriptive Statistics - 4slines - 1trang
1 - 3 - 4 - Class1 - Descriptive Statistics - 4slines - 1trang
5
Some Important Definitions (2)
• Parameter vs. Statistic
• Parameter:
6
Visualization of Population and Sample (1)
The sample is a
subset of the
population.
7
Visualization of Population and Sample (2)
SAMPLE a
POPULATION
Sampling Statistic a
Parameter
Estimation
SAMPLE b
Statistic b
8
Visualization of Population and Sample (3)
Discussion
Statistical Processes
• Descriptive Statistics
• Inferential Statistics
• Hypothesis Testing
• Statistical models
11
Descriptive Statistics
• Collect Data
– Survey
• Present Data
– Tables
– Graphs
• Characterize Data
– Mean
– Standard Deviation
12
Inferential Statistics
• Estimation
– For example, obtain a confidence interval estimate
of the population mean using the sample data
• Hypothesis Testing
– For example, test the claim that the population
mean weight is 120 pounds
13
Estimation
• Point Estimate
– Single sample statistic that is used to estimate the
true value of a population parameter
• Interval Estimate
– Interval with a specified confidence or probability
of correctly estimating the true value of the
population parameter
14
Hypothesis Testing
• Make inferences about a population
parameter by analyzing differences between
the results we observe (our sample statistic)
and the results we expect to obtain if some
underlying hypothesis is actually true.
15
Statistical models
• Regression Analysis
– Simple Linear Regression Analysis
– Multiple Regression Analysis
• Time Series Analysis
16
Regression Analysis
• Used for the purpose of prediction or
explanation.
• To develop a statistical model that can be used
to predict or explain the value of a dependent
(response) variable based on the values of at
least one independent (explanatory) variable.
17
Time Series Analysis
• To develop a statistical model that can be used to
forecast a future occurrence based on past
observations.
– Smoothing a time series
– Modeling for forecasting
18
References
19
Descriptive Statistics
20
Data Discovery (1)
Data Discovery (2)
Data Discovery (3)
Data Discovery/Descriptive Statistics: Main
Tools
1. FREQUENCY Distribution/Table
2. Measure of CENTRALITY of a data set
3. Measure of SPREAD of a data set
Frequency Distribution/Table (1)
Frequency Distribution/Table (2)
• Summary table in which the data are arranged into
conveniently established, numerically ordered class groupings
or categories
• Example: Frequency Table of 92 students
Histogram of weights of 92 students
Guidelines for forming the class interval
Data Presentation
1. Frequency Distribution
2. Cumulative Distribution
3. Histogram
4. Percentage Polygon
5. Cumulative Polygon (Ogive)
29
Cumulative Distribution
• Another useful method of data presentation that
facilitates analysis and interpretation formed from
the frequency distribution
1-Year Total Number of Funds Percentage of Funds
Percentage Return "Less Than" indicated value "Less Than" indicated value
20.0 0 0.0%
25.0 2 3.4%
30.0 15 25.4%
35.0 39 66.1%
40.0 43 72.9%
45.0 54 91.5%
50.0 59 100.0%
30
Histogram
• Vertical bar chart in which the rectangular bars
are constructed at the boundaries of each class
1-Year Total percentage returns
30
25
20
Frequency
15
10
31
Percentage Polygon
• Formed by having the midpoint of each class
represent the data in that class and then
connecting the sequence of mid points at their
respective class percentage
1-Year Percentage Return
45.0%
40.0%
35.0%
30.0%
Percentage
25.0%
20.0%
15.0%
10.0%
5.0%
0.0%
32
Cumulative Polygon
• Graphic representation of a cumulative
distribution table
1-Year Percentage Return
120.0%
100.0%
80.0%
Percent
60.0%
40.0%
20.0%
0.0%
33
Summary Table
• Similar in format to the frequency distribution
table for numerical data
Percentage of
Fee Schedule Number of Funds
Funds
1 Fees from fund assets 17 8.8%
2 Deferred fees 5 2.6%
3 Front-load fees 19 9.8%
4 Multiple fees 46 23.7%
5 No-load funds 107 55.2%
Total 194 100.0%
34
Bar Chart
• Graphical display to visually express categorical
data from a summary table
5 No-load funds
4 Multiple fees
3 Front-load fees
2 Deferred fees
0 20 40 60 80 100 120
35
Pie Chart
• Graphical display to visually express categorical
data from a summary table
1 Fees from fund
assets
2 Deferred fees
3 Front-load fees
5 No-load funds
4 Multiple fees
36
Pareto Diagram
• Special type of vertical bar chart in which the
categorized responses are plotted in the descending
rank order of their frequencies and combined with a
cumulative polygon on the same scale
60% 100%
90%
50%
80%
70%
40%
60%
30% 50%
40%
20%
30%
20%
10%
10%
0% 0%
5 No-load funds 4 Multiple fees 3 Front-load fees 1 Fees from fund 2 Deferred fees
assets
37
Contingency Table
• Two-way table of cross-classification
Fee Schedule
Fund
Objective Fund Deffered Front Multiple
No Load Total
Asset Fees Load Fees
Growth 4 0 7 16 32 59
Blend 13 5 12 30 75 135
Total 17 5 19 46 107 194
38
Side-by-Side Chart
• Useful way to visually display bivariate categorical
data
5 No-load funds
4 Multiple fees
2 Blend
3 Front-load fees
1 Growth
2 Deferred fees
0 10 20 30 40 50 60 70 80
39
Graph of the causes of death of
British soldiers during the Crimean
War during1854-1855 by Nurse
Florence Nightingale.
Blue: deaths in
hospital due to
illness after
injury
Red: deaths
due to injury
Black: deaths
due to other
causes
Descriptive Summary Measures
• Major Features
– Central Tendency
– Variation
– Shape
• Exploratory Data Analysis
– Five-Number Summary
– Box-and-Whisker Plot
• Covariance and coefficient of correlation
41
Summary Measures
Describing Data Numerically
Mode Variance
42
Central Tendency
• Most sets of data show a distinct tendency to
group or cluster about a certain central point.
• Thus, for any particular set of data, it usually
becomes possible to select some typical value
to describe the entire set.
• Such a descriptive typical value is a measure of
central tendency.
43
Variation
• Variation is the amount of dispersion, or spread.
44
Shape
• Shape is the manner in which the data are
distributed.
• Either the distribution of the data is symmetrical
or not.
45
Center and Spread
47
Measure of Central Tendency:
Mean (Arithmetic Mean)
• Most commonly used
Ref: Harmonic Mean
• For instance, if a vehicle travels a certain distance at a speed
x (e.g. 80 kilometres per hour) and then the same distance
again at a speed y (e.g. 20 kilometres per hour),
what is the average speed?
• The average speed is the harmonic mean of x and y (32
km/h).
• The harmonic mean H of the positive real numbers x1, x2, ...,
xn is defined to be
49
Ref: Geometric Mean
• GDP at t=1 increased 8% and GDP at t=2
increased 2%. What is the average growth
rate?
1.08 1.02 1.0496
50
Measure of Central Tendency:
Median (1)
• Value such that 50% of the observation are
smaller and 50% of the observation are larger
n 1
Median ranked observatio n
2
• If n is odd, take the middle-ranked value.
If n is even, take the average of the two
middle-ranked values.
51
Measure of Central Tendency:
Median (2)
• Mean: affected by extreme values (outliers)
• Median: not affected by outliers
• Robust measure of central tendency
1 3 5 7 9 1 3 5 7 14
Median = 5 Median = 5
52
Measure of Central Tendency:
Mode (3)
• Value that occurs most often
• Possibilities:
– Data set has no mode
– Data set has one mode (unimodal)
– Data set has many modes
• Not affected by extreme values
• Used for either numerical or categorical data
53
Measure of Spread (1)
• Full description of a data set includes:
– Frequency distribution and central value
– The level of spread from the central value, typically
the mean.
Measure of Spread (2)
• Two populations may have same mean, but different spread
level.
• Use of the measure of spread:
1. Evaluate the representativeness of the central value;
2. Appropriate ways to tread the populations, for example if two
provinces have the same poverty rates, but if province A have
relatively equal poverty rates across districts, as compared to
Province B where there are higher poverty rates in some districts,
then different policies would be applied to the districts with high and
low poverty rates;
3. Characteristics of the population, for example financial situation,
income, wages, living standards;
4. Information for quality check of products, for example do not accept
a set of products with too wide a spread.
Measure of Spread:
Quartiles
• First Quartile, Q1 (Third Quartile, Q3)
– Value such that 25% (75%) of the observations are
smaller and 75% (25%) of the observations are
larger
n 1
Q1 ordered observatio n
4
3(n 1)
Q3 ordered observatio n
4
Refer to some rules of calculation at page 126 of the textbook. 56
Quartiles: Interpretation
Relative position of observation from smallest to largest
Q1 M Q3
Value 57
Measures of Variation
• Range
• Interquartile Range
• Variance
• Standard Deviation
• Coefficient of Variation
58
Range
• Difference between the largest and the
smallest observations:
Range = 12 - 7 = 5 Range = 12 - 7 = 5
7 8 9 10 11 12 7 8 9 10 11 12
59
Interquartile Range
• Difference between Q1 and Q3
• Measures the variation of the middle 50% of
the data
• Not affected by extreme values
60
(Sample) Variance
• The sum of the squared differences around the
arithmetic mean divided by the sample size minus 1
n
1
S2
n 1 i 1
( X i X )2
where
X arithmetic mean
n sample size
X i ith observatio n of the random variable X
61
Note: Sample variance
v.s. population variance
where
populationmean
N populationsize
X i ith value of the random variable X
62
Variance
1 3 5 7 9
63
Standard Deviation
• Square root of the sample variance
1 n
S
n 1 i 1
( X i X ) 2
where
X arithmetic mean
n sample size
X i ith observatio n of the random variable X
64
Coefficient of Variation
• Standard deviation divided by the arithmetic
mean, multiplied by 100%
S
CV 100%
X
where
S standard deviation in a set of numerical data
X arithmetic mean in a set of numerical data
65
Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
S $5
CVA 100%
100% 10% Both stocks
X $50 have the same
standard
deviation, but
• Stock B: stock B is less
– Average price last year = $100 variable relative
to its price
– Standard deviation = $5
S $5
CVB 100%
100% 5%
X $100 66
Z Scores
Example:
• If the mean is 14.0 and the standard deviation is 3.0,
what is the Z score for the value 18.5?
X X 18.5 14.0
Z 1.5
S 3.0
69
Measurement of Shape
• Compare the mean and the median.
70
Summary Statistics with Excel
weight
Box. Maximum
Sum
215
13354
71
Exploratory Data Analysis
using Box-and-Whisker Plots
http://www.physics.csbsju.edu/stats/box2.html 72
Exploratory Data Analysis
using Box-and-Whisker Plots
73
Exploratory Data Analysis
using Box-and-Whisker Plots
74
Shape & Box-and-Whiskers Plot
Q1 Q3 Q1 Q3 Q1 Q3
75
Comparisons
thru Box Plots
76
The Sample Covariance
• The sample covariance measures the strength of
the linear relationship between two variables
(called bivariate data)
( X X)( Y Y )
i i
cov ( X , Y ) i1
n 1
• Only concerned with the strength of the relationship
– No causal effect is implied 77
Interpreting Covariance
78
Coefficient of Correlation
• Measures the relative strength of the linear
relationship between two variables
• Sample coefficient of correlation:
cov (X , Y)
r
SX SY
where n n n
(X X)(Y Y)
i i (X X)
i
2
(Y Y )
i
2
cov (X , Y) i1
SX i1
SY i1
n 1 n 1 n 1
79
Features of
Correlation Coefficient, r
• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker the linear relationship
• Careful! Correlation is not causation! Discussion
80
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y
X X X
r = -1 r = -.6 r=0
Y
Y Y
X X X
r = +1 r = +.3 r=0
81
Using Excel to Find
the Correlation Coefficient
• Select
Data - Data Analysis
• Choose Correlation
from the selection
menu
• Click OK . . .
82
Using Excel to Find
the Correlation Coefficient
(continued)
100
95
• There is a relatively
Test #2 Score
90
85
relationship between 75
70
test score #1 70 75 80 85
Test #1 Score
90 95 100
• Example:
– X= number of coins showing heads if we flip 6
coins
– The table of probabilities for each outcome of
flipping 6 coins looks like below:
Discrete Probability Distribution (1)
• DPD consists of all possible values of the
random variable with their associated
probabilities
Discrete Probability Distribution (1)
Discussion
Discrete Probability Distribution (2)
Expected Value: a central value of RV