0% found this document useful (0 votes)
16 views99 pages

1 - 3 - 4 - Class1 - Descriptive Statistics - 4slines - 1trang

The document provides an introduction to statistics including definitions of key terms like population, sample, parameter and statistic. It also outlines some common statistical processes like descriptive statistics, inferential statistics and hypothesis testing. Visual examples are provided to illustrate concepts like population and sample.

Uploaded by

kydang26092004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views99 pages

1 - 3 - 4 - Class1 - Descriptive Statistics - 4slines - 1trang

The document provides an introduction to statistics including definitions of key terms like population, sample, parameter and statistic. It also outlines some common statistical processes like descriptive statistics, inferential statistics and hypothesis testing. Visual examples are provided to illustrate concepts like population and sample.

Uploaded by

kydang26092004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

INTRODUCTION TO STATISTICS

What is STATISTICS? (4)


• According to “Introduction to the Practice
of Statistics” by Moore, McCabe and Craig

“Statistics is the science of data”


What is STATISTICS? (5)
• According to Prof. Smith:
What is STATISTICS? (6)
Hawkes Learning System:

Statistics is a branch of mathematics. It


provides information so that informed
decisions can be made
Some Important Definitions (1)
• Population vs. Sample
• Population:

– Totality of things under consideration


– The population consists of all persons or things being
studied
• Sample
– Portion of the population selected for analysis

5
Some Important Definitions (2)
• Parameter vs. Statistic
• Parameter:

– Summary measure (numerical) to describe a characteristic


of the population
• Statistic

– Summary measure to describe a characteristic of the


sample

6
Visualization of Population and Sample (1)

The sample is a
subset of the
population.

7
Visualization of Population and Sample (2)

SAMPLE a
POPULATION
Sampling Statistic a
Parameter
Estimation
SAMPLE b

Statistic b

8
Visualization of Population and Sample (3)
Discussion
Statistical Processes

• Descriptive Statistics
• Inferential Statistics
• Hypothesis Testing
• Statistical models

11
Descriptive Statistics
• Collect Data
– Survey
• Present Data
– Tables
– Graphs
• Characterize Data
– Mean
– Standard Deviation

12
Inferential Statistics
• Estimation
– For example, obtain a confidence interval estimate
of the population mean using the sample data
• Hypothesis Testing
– For example, test the claim that the population
mean weight is 120 pounds

13
Estimation
• Point Estimate
– Single sample statistic that is used to estimate the
true value of a population parameter
• Interval Estimate
– Interval with a specified confidence or probability
of correctly estimating the true value of the
population parameter

14
Hypothesis Testing
• Make inferences about a population
parameter by analyzing differences between
the results we observe (our sample statistic)
and the results we expect to obtain if some
underlying hypothesis is actually true.

15
Statistical models
• Regression Analysis
– Simple Linear Regression Analysis
– Multiple Regression Analysis
• Time Series Analysis

16
Regression Analysis
• Used for the purpose of prediction or
explanation.
• To develop a statistical model that can be used
to predict or explain the value of a dependent
(response) variable based on the values of at
least one independent (explanatory) variable.

17
Time Series Analysis
• To develop a statistical model that can be used to
forecast a future occurrence based on past
observations.
– Smoothing a time series
– Modeling for forecasting

18
References

• For preparing effective tables and charts, ECE


“Making Data Meaningful Part 2: A guide to
presenting statistics” is recommended.
http://www.unece.org/stats/documents/writing/

19
Descriptive Statistics

20
Data Discovery (1)
Data Discovery (2)
Data Discovery (3)
Data Discovery/Descriptive Statistics: Main
Tools

1. FREQUENCY Distribution/Table
2. Measure of CENTRALITY of a data set
3. Measure of SPREAD of a data set
Frequency Distribution/Table (1)
Frequency Distribution/Table (2)
• Summary table in which the data are arranged into
conveniently established, numerically ordered class groupings
or categories
• Example: Frequency Table of 92 students
Histogram of weights of 92 students
Guidelines for forming the class interval
Data Presentation
1. Frequency Distribution
2. Cumulative Distribution
3. Histogram
4. Percentage Polygon
5. Cumulative Polygon (Ogive)

29
Cumulative Distribution
• Another useful method of data presentation that
facilitates analysis and interpretation formed from
the frequency distribution
1-Year Total Number of Funds Percentage of Funds
Percentage Return "Less Than" indicated value "Less Than" indicated value

20.0 0 0.0%
25.0 2 3.4%
30.0 15 25.4%
35.0 39 66.1%
40.0 43 72.9%
45.0 54 91.5%
50.0 59 100.0%

30
Histogram
• Vertical bar chart in which the rectangular bars
are constructed at the boundaries of each class
1-Year Total percentage returns

30

25

20
Frequency

15

10

20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0%

31
Percentage Polygon
• Formed by having the midpoint of each class
represent the data in that class and then
connecting the sequence of mid points at their
respective class percentage
1-Year Percentage Return

45.0%

40.0%

35.0%

30.0%
Percentage

25.0%

20.0%

15.0%

10.0%

5.0%

0.0%

20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0%

32
Cumulative Polygon
• Graphic representation of a cumulative
distribution table
1-Year Percentage Return

120.0%

100.0%

80.0%
Percent

60.0%

40.0%

20.0%

0.0%

20.0% 30.0% 35.0% 40.0% 45.0% 50.0%


25.0%

33
Summary Table
• Similar in format to the frequency distribution
table for numerical data
Percentage of
Fee Schedule Number of Funds
Funds
1 Fees from fund assets 17 8.8%
2 Deferred fees 5 2.6%
3 Front-load fees 19 9.8%
4 Multiple fees 46 23.7%
5 No-load funds 107 55.2%
Total 194 100.0%

34
Bar Chart
• Graphical display to visually express categorical
data from a summary table

5 No-load funds

4 Multiple fees

3 Front-load fees

2 Deferred fees

1 Fees from fund


assets

0 20 40 60 80 100 120

35
Pie Chart
• Graphical display to visually express categorical
data from a summary table
1 Fees from fund
assets

2 Deferred fees

3 Front-load fees

5 No-load funds

4 Multiple fees

36
Pareto Diagram
• Special type of vertical bar chart in which the
categorized responses are plotted in the descending
rank order of their frequencies and combined with a
cumulative polygon on the same scale

60% 100%

90%
50%
80%

70%
40%
60%

30% 50%

40%
20%
30%

20%
10%
10%

0% 0%
5 No-load funds 4 Multiple fees 3 Front-load fees 1 Fees from fund 2 Deferred fees
assets

37
Contingency Table
• Two-way table of cross-classification

Fee Schedule
Fund
Objective Fund Deffered Front Multiple
No Load Total
Asset Fees Load Fees
Growth 4 0 7 16 32 59
Blend 13 5 12 30 75 135
Total 17 5 19 46 107 194

38
Side-by-Side Chart
• Useful way to visually display bivariate categorical
data
5 No-load funds

4 Multiple fees

2 Blend
3 Front-load fees
1 Growth

2 Deferred fees

1 Fees from fund assets

0 10 20 30 40 50 60 70 80

39
Graph of the causes of death of
British soldiers during the Crimean
War during1854-1855 by Nurse
Florence Nightingale.

Blue: deaths in
hospital due to
illness after
injury

Red: deaths
due to injury

Black: deaths
due to other
causes
Descriptive Summary Measures
• Major Features
– Central Tendency
– Variation
– Shape
• Exploratory Data Analysis
– Five-Number Summary
– Box-and-Whisker Plot
• Covariance and coefficient of correlation

41
Summary Measures
Describing Data Numerically

Central Tendency Quartiles Variation Shape

Arithmetic Mean Range Skewness

Median Interquartile Range

Mode Variance

Harmonic Mean Standard Deviation

Geometric Mean Coefficient of Variation

42
Central Tendency
• Most sets of data show a distinct tendency to
group or cluster about a certain central point.
• Thus, for any particular set of data, it usually
becomes possible to select some typical value
to describe the entire set.
• Such a descriptive typical value is a measure of
central tendency.

43
Variation
• Variation is the amount of dispersion, or spread.

44
Shape
• Shape is the manner in which the data are
distributed.
• Either the distribution of the data is symmetrical
or not.

45
Center and Spread

Observe the left-


hand-side figure:
- The center of the
two data sets lies at
about the middle
- The upper data set
has a wider spread
Measures of Central Tendency

• Mean (Arithmetic Mean)


• Median

47
Measure of Central Tendency:
Mean (Arithmetic Mean)
• Most commonly used
Ref: Harmonic Mean
• For instance, if a vehicle travels a certain distance at a speed
x (e.g. 80 kilometres per hour) and then the same distance
again at a speed y (e.g. 20 kilometres per hour),
what is the average speed?
• The average speed is the harmonic mean of x and y (32
km/h).

• The harmonic mean H of the positive real numbers x1, x2, ...,
xn is defined to be

49
Ref: Geometric Mean
• GDP at t=1 increased 8% and GDP at t=2
increased 2%. What is the average growth
rate?
1.08  1.02  1.0496

• The geometric mean of a data set [a1, a2, ...,


an] is given by

50
Measure of Central Tendency:
Median (1)
• Value such that 50% of the observation are
smaller and 50% of the observation are larger

n 1
Median  ranked observatio n
2
• If n is odd, take the middle-ranked value.
If n is even, take the average of the two
middle-ranked values.

51
Measure of Central Tendency:
Median (2)
• Mean: affected by extreme values (outliers)
• Median: not affected by outliers
• Robust measure of central tendency

1 3 5 7 9 1 3 5 7 14

Median = 5 Median = 5

52
Measure of Central Tendency:
Mode (3)
• Value that occurs most often
• Possibilities:
– Data set has no mode
– Data set has one mode (unimodal)
– Data set has many modes
• Not affected by extreme values
• Used for either numerical or categorical data

53
Measure of Spread (1)
• Full description of a data set includes:
– Frequency distribution and central value
– The level of spread from the central value, typically
the mean.
Measure of Spread (2)
• Two populations may have same mean, but different spread
level.
• Use of the measure of spread:
1. Evaluate the representativeness of the central value;
2. Appropriate ways to tread the populations, for example if two
provinces have the same poverty rates, but if province A have
relatively equal poverty rates across districts, as compared to
Province B where there are higher poverty rates in some districts,
then different policies would be applied to the districts with high and
low poverty rates;
3. Characteristics of the population, for example financial situation,
income, wages, living standards;
4. Information for quality check of products, for example do not accept
a set of products with too wide a spread.
Measure of Spread:
Quartiles
• First Quartile, Q1 (Third Quartile, Q3)
– Value such that 25% (75%) of the observations are
smaller and 75% (25%) of the observations are
larger

n 1
Q1  ordered observatio n
4
3(n  1)
Q3  ordered observatio n
4
Refer to some rules of calculation at page 126 of the textbook. 56
Quartiles: Interpretation
Relative position of observation from smallest to largest

Q1 M Q3
Value 57
Measures of Variation
• Range
• Interquartile Range
• Variance
• Standard Deviation
• Coefficient of Variation

58
Range
• Difference between the largest and the
smallest observations:

Range  X Largest  X Smallest


• Ignores how data are distributed

Range = 12 - 7 = 5 Range = 12 - 7 = 5

7 8 9 10 11 12 7 8 9 10 11 12

59
Interquartile Range
• Difference between Q1 and Q3
• Measures the variation of the middle 50% of
the data
• Not affected by extreme values

Data in Ordered Array: 11 12 13 16 16 17 17 18 21

Interquartile Range  Q3  Q1  17.5  12.5  5

60
(Sample) Variance
• The sum of the squared differences around the
arithmetic mean divided by the sample size minus 1

n
1
S2  
n  1 i 1
( X i  X )2

where
X  arithmetic mean
n  sample size
X i  ith observatio n of the random variable X

61
Note: Sample variance
v.s. population variance

• Denominator of sample variance is (n-1) whereas


that of population variance is N.
N
1
2
 
N i 1
( X i  )2

where
  populationmean
N  populationsize
X i  ith value of the random variable X

• Also standard deviation and coefficient of variation

62
Variance

• Shows variation of data set from its mean


• Basis of measure of variation:
 X i  mean 
2

1 3 5 7 9

63
Standard Deviation
• Square root of the sample variance

1 n
S 
n  1 i 1
( X i  X ) 2

where
X  arithmetic mean
n  sample size
X i  ith observatio n of the random variable X

64
Coefficient of Variation
• Standard deviation divided by the arithmetic
mean, multiplied by 100%

S
CV   100%
X
where
S  standard deviation in a set of numerical data
X  arithmetic mean in a set of numerical data

65
Comparing Coefficient
of Variation
• Stock A:
– Average price last year = $50
– Standard deviation = $5
S  $5
CVA    100% 
 100%  10% Both stocks
X  $50 have the same
standard
deviation, but
• Stock B: stock B is less
– Average price last year = $100 variable relative
to its price
– Standard deviation = $5
S  $5
CVB    100% 
 100%  5%
X  $100 66
Z Scores

• A measure of distance from the mean (for example, a


Z-score of 2.0 means that a value is 2.0 standard
deviations from the mean)
• The difference between a value and the mean, divided
by the standard deviation
• A Z score above 3.0 or below -3.0 is considered an
outlier
XX
Z
S 67
Z Scores
(continued)

Example:
• If the mean is 14.0 and the standard deviation is 3.0,
what is the Z score for the value 18.5?

X  X 18.5  14.0
Z   1.5
S 3.0

• The value 18.5 is 1.5 standard deviations above the


mean
• (A negative Z-score would mean that a value is less
68
than the mean)
Shape of a Distribution
• Describes how data are distributed
• Measures of shape
– Symmetric or skewed

Left-Skewed Symmetric Right-Skewed


Mean < Median Mean = Median Median < Mean

69
Measurement of Shape
• Compare the mean and the median.

Right - skewness : mean  median


Symmetry : mean  median
Left - skewness : mean  median

70
Summary Statistics with Excel
weight

• Select Mean 145.1522

– Data Standard Error


Median
2.475003
145

– Data Analysis Mode 155


Standard Deviation 23.7394
– Descriptive Statistics Sample Variance 563.559
Kurtosis -0.06603
• Enter the data and Skewness 0.369982
Range 120
options in the Dialog Minimum 95

Box. Maximum
Sum
215
13354

• Click the OK button. Count


Largest(1)
92
215
Smallest(1) 95

71
Exploratory Data Analysis
using Box-and-Whisker Plots

http://www.physics.csbsju.edu/stats/box2.html 72
Exploratory Data Analysis
using Box-and-Whisker Plots

73
Exploratory Data Analysis
using Box-and-Whisker Plots

74
Shape & Box-and-Whiskers Plot

Left-Skewed Symmetric Right-Skewed

Q1 Q3 Q1 Q3 Q1 Q3

75
Comparisons
thru Box Plots

76
The Sample Covariance
• The sample covariance measures the strength of
the linear relationship between two variables
(called bivariate data)

• The sample covariance:


n

 ( X  X)( Y  Y )
i i
cov ( X , Y )  i1
n 1
• Only concerned with the strength of the relationship
– No causal effect is implied 77
Interpreting Covariance

• Covariance between two random variables:

cov(X,Y) > 0 X and Y tend to move in the same direction


cov(X,Y) < 0 X and Y tend to move in opposite directions
cov(X,Y) = 0 X and Y are independent

78
Coefficient of Correlation
• Measures the relative strength of the linear
relationship between two variables
• Sample coefficient of correlation:
cov (X , Y)
r
SX SY

where n n n
 (X  X)(Y  Y)
i i  (X  X)
i
2
 (Y  Y )
i
2

cov (X , Y)  i1
SX  i1
SY  i1
n 1 n 1 n 1
79
Features of
Correlation Coefficient, r

• Unit free
• Ranges between –1 and 1
• The closer to –1, the stronger the negative linear
relationship
• The closer to 1, the stronger the positive linear
relationship
• The closer to 0, the weaker the linear relationship
• Careful! Correlation is not causation! Discussion
80
Scatter Plots of Data with Various
Correlation Coefficients
Y Y Y

X X X
r = -1 r = -.6 r=0

Y
Y Y

X X X
r = +1 r = +.3 r=0
81
Using Excel to Find
the Correlation Coefficient
• Select
Data - Data Analysis
• Choose Correlation
from the selection
menu
• Click OK . . .

82
Using Excel to Find
the Correlation Coefficient
(continued)

• Input data range and select


appropriate options
• Click OK to get output
83
Interpreting the Result
• r = .733 Scatter Plot of Test Scores

100

95

• There is a relatively

Test #2 Score
90

85

strong positive linear 80

relationship between 75

70

test score #1 70 75 80 85

Test #1 Score
90 95 100

and test score #2


• Students who scored high on the first test
tended to score high on second test, and
students who scored low on the first test
tended to score low on the second test 84
DISCRETE DISTRIBUTIONS
(Source: Hawkes Learning Systems)
Random Variable

• Random variable: a quantity which depends on chance


• Naming convention:
– Capital letters such as X refer to random variable
– Small letter such as x refer to specific values of the random variable,
subscripted x1, x2, x3, …, xn
• Quantitative random variables are divided into two classes:
Discrete Random Variable (DRV) (1)
• Values: finitely or infinitely many values
• Many DRV have values of counting numbers
from 0 to N
Discrete Random Variable (DRV) (2)

• Example:
– X= number of coins showing heads if we flip 6
coins
– The table of probabilities for each outcome of
flipping 6 coins looks like below:
Discrete Probability Distribution (1)
• DPD consists of all possible values of the
random variable with their associated
probabilities
Discrete Probability Distribution (1)
Discussion
Discrete Probability Distribution (2)
Expected Value: a central value of RV

• Note: Expected value measures only one dimension of the random


variable: its central value.
• Discussion: Is it enough in order to understand a data set?
Expected Value: Discussion
Expected Value: Investment Decision

Discussion: What investment option do you decide to choose?


Variance and Standard Deviation

• Are X, x, μ, p(x) familiar to you?


• Variance measures the variability of a random variable
• The larger the variance the more variability in the data
sets. Discussion: Why?
How can Variance and SD help you? (1)
How can Variance and SD help you? (2)
• Check it yourself in Excel
How can Variance and SD help you? (3)
• Check it yourself in Excel
How can Variance and SD help you? (4)
• What investment option is less risky?

You might also like