STAT Note1-4

RVU Introduction To Statistics For Economic
March 20211
Lecture -one
1. Introduction:
1.1. Definition and Classification of Statistics
 Definition of statistics
Statistics is defined as the science of collecting, organizing, presenting, analyzing and
interpreting numerical data to make decision on the bases of such analysis.
Data: is the measurement or observation (values) for a variable (factor)
- A collection of data values forms a data set.
- Each value in the data set is called a data value or datum.
 Classification of Statistics
There are two broad branches of statistics:-
i.) Descriptive statistics:-Statistical
statistics:-Statistical method that deals with organizing or summarizing a given set
of data in to a meaningful form.
form. Like in newspapers, magazines, reports and other publications
come from data that has been summarized and presented in a form that is easy for the reader to
understand.
 Here there is no generalization or conclusion about the population.
 Consists of collection, organization, and presentation of data.
E.g. frequency distribution, measure of central tendency (such as mean, median) measure of
dispersion (like range, Sd, V, etc...)
ii.) Inferential statistics:
statistics: - Is the process of drawing conclusion about a population based on the
information obtained from the sample.
sample. Because of time, cost and other constraints data are
collected from only small portion of the group (or sample).
 This make estimates and
 test claims about the characteristics of a population based on the sample
 Used to describe, infer, estimate, approximate the characteristics of the target population
Examples : As a result of recent reduction in oil production by oil producing nations , we
can expect the price of gasoline to be double up in the next year.
year.
As a result of recent survey of public opinion, most Americans are in favor of building additional
nuclear power plant.
1.2. Stages in Statistical Investigation
The area of statistics points out the following five stages.
1. Data collection
2. Organization of data
Adjusted by Sibex A. LECTURE NOTES Page 1
March 20211
3. Presentation of data
4. Analysis of data
5. Interpretation
Stage 1. Data collection
 Is the process of gathering information or data about the variable of interest for our specific
purpose?
 Constitutes the first step in a statistical investigation.
 At most care must be exercised.
 The data may be available from existing published or unpublished sources . I.e. data may
obtain either primarily or secondarily.
Stage 2.
2. Organization of data
 Is the process of---.
Editing:
Editing: is the process of checking and connecting data for omission, inconsistencies, irrelevant
answer and wrong computation in the collected data.
Classification: is the task of grouping the collected & edited data in to different similar
categories based on some criteria
Tabulation: is to put classified data in the form of table.
 Arranging or classification of data in the suitable order makes the information easier for
presentation.
Stage 3.
3. Presentation of data
 Presented in the form of charts and diagram.
 Large data will be presented in tables in a very summarized and condensed manner.
 The main purpose of data presentation is to facilitate statistical analysis.
Stage 4. Analysis of data
 This is the stage where we critically study the data to draw conclusions about the
population parameter.
 The purpose of data analysis is to dig out information useful for decision making.
making.
Analysis usually involves highly complex and sophisticated mathematical techniques.
Such as the calculations of averages, the computation of measures of dispersion,
regression and correlation analysis are covered.
Stage 5.
5. Interprétation
 This is the stage where draw valid conclusions and decision making from the results obtained
through data analysis.
March 20211
 Is a difficult task and necessitates a high degree of skill and experience. Because if data that
have been analyzed are not properly interpreted, the whole purpose of the investigation may be
defected and fallacious conclusion be drawn. So that great care is needed when making
interpretation.
1.3. Definitions of Some Basic Terms
a) Population:- is the totality of causes (items) under consideration in a given investigation or
research.
Ex. the total number of students in WU, workers in a factory, etc…
Population can be finite or infinite.
 Finite population:
population: is the population that can be finite (can be limited in size). E.g. No
No of
workers in farmland hotel.
 Infinite population:
population: is the population that is unrestricted in nature/not limited.
E.g. the Production of bacteria (the observation is can’t be even in theory).
b) Sample: - is a sub group or part of the population selected by some methods (sampling
techniques) in order to estimate the characteristics of the population parameter. E.g. 25 staff of
WU out of 350 staffs.
c) Elementary unit:- is the specific person, business, product, and so on, with some
characteristics to be measured or categorized (information is recorded). E.g. The weight of
particular person in the class.
d) Sampling unit: - is the finite number of distinct, non-overlapping & identifiable unit obtained
by dividing the population for the purpose of sample selection.
e) Variable: - is the factor or characteristics that can take on different possible value or outcome.
Example income, height, weight, sex, salary, etc….
A variable can be qualitative or quantitative
 A Qualitative variable: - is the variable that can be expressed in categorical ways.
 Are generally described by words or letters. I.e. it cannot be expressed in terms of
numbers. For instance, hair color might be black, dark brown, light brow, Sex, marital
status, Religion, Region, political affiliation, etc.…
 A Quantitative variable: - is the variable that can be measured in numerical ways
(measurable quantity).
 Are the results of counting or measuring attributes of a population.
 May be either discrete or continuous.
 Observations obtained through such process are called quantitative data.
March 20211
E.g. Observations regarding height, income, weight, age, etc…

Quantitative random variable is divided in to two, these are:
i. Discrete random variable: - usually obtained by counting.
 Here there is a jump between values.
Eg. - No
No of children/ family, no
no of students in the class, etc…
ii. Continues random variable: - usually obtained by measurement.
 No jump between values (has specific range).
Example: - age, weight, height,T0 , etc…
f) Parameter: - is an estimated population value (any population constant).
constant). It is numerical
result obtained as measuring the population. E.g. Popn
Popn total, Popn
Popn mean, Popn
Popn proportion,
proportion, Popn
Popn
ratio, etc…
g) Statistic:-
Statistic:- is a function of observable random variables & doesn’t involve any unknown
parameter. It used to estimate parameter.
h) Sampling: - is
is the method of obtaining sample from the population.
i) Survey experiment: - It is the device of obtaining the desired data.
E.g. Collection observations based on the weight of students in Statistics department.
j) Statistical design: - It is the process that involves a decision problem and choosing an
approach to solve the problem.
1.4. Application Uses and limitation of statistics
 Applications of Statistics:
Statistics can be applied in any field of study which seeks quantitative evidence. For instance,
engineering, economics, natural science, etc.
a) Engineering:
 To compare the breaking strength of two types of materials.
 To determine the probability of reliability of a product.
 To control the quality of products in a given production process.
 To compare the improvement of yield due to certain additives such as fertilizer, herbicides, etc
b) Economics: Statistics are widely used in economics study and research.
 To measure and forecast Gross National Product (GNP).
 Statistical analyses of population growth, inflation rate, poverty, unemployment figures, rural
or urban population shifts and so on influence much of the economic policy making.
 Financial statistics are necessary in the fields of money and banking including consumer
savings and credit availability.
March 20211
c) Statistics and research: there is hardly any advanced research going on without the use of
statistics in on form or another. Statistics are used extensively in medical, pharmaceutical and
agricultural research.
Usefulness of statistics:-
• Statistics condenses and summarizes complex data. The original set of data (raw data) is
normally voluminous and disorganized unless it is summarized and expressed in few numerical
values.
• Statistics facilitates comparison of data. Measures obtained from d/t set of data can be
compared to draw conclusion about those sets. Statistical values such as averages, percentages,
ratios, etc, are the tools that can be used for the purpose of comparing sets of data.
• Statistics helps in predicting future trends. Statistics is extremely useful for analyzing the past
and present data and predicting some future trends.
• Statistics influences the policies of government. Statistical study results in the areas of taxation,
on unemployment rate, on the performance of every sort of military equipment, etc, may
convince a government to review its policies and plans with the view to meet national needs and
aspirations.
• Statistical methods are very helpful in formulating and testing hypothesis and to develop new
theories.
Weaknesses of statistics:-
 It doesn’t deal with single (individual) values . Statistics deals only with aggregate values. But
in some cases single individual is highly important to consider in some situations. Example, the
sun, a deriver of bus, president, etc.
 It can’t deal with qualitative characteristics.
characteristics. It only deals with data which can be quantified.
Example, not deal with marital status (married, single, divorced, widowed) but it deal with
number of married, number of single, number of divorced.
 It conclusions are not universally true . Statistical conclusions are true only under certain
condition or true only on average. The conclusions drawn from the analysis of the sample may,
perhaps, differ from the conclusions that would be drawn from the entire population. For this
reason, statistics is not an exact science.
 It interpretations require a high degree of skill and understanding of the subject.
subject. It requires
extensive training to read and interpret statistics in its proper context. It may lead to wrong
conclusions if inexperienced people try to interpret statistical; results.

March 20211
 It can be misused. Sometimes statistical figures can be misleading unless they are carefully
interpreted.
Example, the report of head of the minister about Ethio-Somalia terrorist attack mission
dismissed terrorists 25% at first day, 50% at second day, 75% at third day. However, we doubt
about the mechanisms how the mission is measured and quantified. This leads miss use of
statistical figures.
1.5. Scales of Measurement
The various measurement scales results from the facts that measurement may be carried out
under different sets of rules. Generally, there are four types of measurements of data. They are
(from lowest to highest level):.
i.) Nominal Scale:-
 Is characterized by data that consist of names, labels, or categories only.
only.
 Is data cannot be arranged in an ordering scheme.
 In these arithmetic operations of addition, subtraction, multiplication, and division are not
performed.
 Data that is measured using a nominal scale is qualitative
E.g.
E.g. Religion: Christianity, Islam, Hinduism, etc, Sex:
Sex: Male, Female, Eye color:
color: brown, black,
etc.
ii.) Ordinal Scale:-
 Data that is measured using an ordinal scale is similar to nominal scale data but there is a
big difference. The ordinal scale data can be ordered.
For example: list of the top five national parks in the African. The top five national parks in
the African can be ranked from one to five but we cannot measure differences between the
data.
 Like the nominal scale data, ordinal scale data cannot be used in calculations.
 Whenever observations are not only different from category to category, but can be ranked
according to some criterion. The variables deal with their relative difference rather than
with quantitative differences.
 Ordinal data are data which can have meaningful inequalities. The inequality signs < or >
may assume any meaning like ‘stronger, softer, weaker, better than’, etc.
E.g.: Patients may be characterized as unimproved, improved & much improved.

March 20211
E.g.: Individuals may be classified according to socio-economic as low, medium & high. It is
usually impossible to infer that difference between member of one category and the next adjacent
category.
iii.) Interval Scale:
 Interval scale data can be measured though the data does not have a starting point.
 It is not only possible to order measurements, but also the distance between any two
measurements is known but not meaningful quotients.
 There is no true zero point but arbitrary zero point. Interval data are the types of
information in which an increase from one level to the next always reflects the same
increase in the characteristic. Possible to add or subtract interval data but they may not be
multiplied or divided.
E.g.: Temperature of zero degrees does not indicate lack of heat. The two common temperature
scales; Celsius (C) and Fahrenheit (F). We can see that the same difference exists between 10 oC
(50oF) and 20oC (68OF) as between 25oc (77oF) and 35oc (95oF) i.e , the measurement scale is
composed of equal-sized interval. But we cannot say that a temperature of 20 oc is twice as hot as
a temperature of 10oc. because the zero point is arbitrary.
iv.) Ratio Scale:-
 It is like interval scale data, but it has a 0 point and ratios can be calculated.
 Characterized by the fact that equality of ratios as well as equality of intervals may be
determined.
For example, four multiple choice statistics final exam scores are 80, 68, 20 and 92 (out of a
possible 100 points). The exams are machine-graded.
The data can be put in order from lowest to highest: 20, 68, 80, 92.
CHAPTER - 2
March 20211
Method of Data Collection & Presentation:

2.1. Method of Data Collection
 Collection of data implies a systematic/methods used to gathering the required information
from the units under investigation. The quality of data greatly affects final output of an
investigation. Data can be collected by different methods. It can be collected any one or more of
the following methods:
i) Direct Observation
In this approach, an investigator stays at the place of survey and notes down the first hand
information. Direct observations can be used to discover a variety of information including
consumer behavior, working methods & other aspects of social & economic behavior. Direct
observation is more experimental and usually applied in scientific studies. It is time consuming
and also costly. Also the method is highly subjective.
ii) Interview Method-
It is a conversation between two groups, i.e. incited by the interviewer in order to obtain the
required information. The interviewer sets a series of questions directly elected for his/her work
in advance & conducts the interview. Interviewing is a technique that is primarily used to gain an
understanding of the underlying reasons and motivations for people’s attitudes, preferences or
behavior. Interviews can be undertaken on a personal one-to-one basis or in a group. They can be
conducted at work, at home, in the street or in a shopping centre, or some other agreed location.
The interview may be face to face or by telephone
 Face to face interview is advantageous to question a person’s motives & attitudes about some
characteristics or behavior
 Telephone interview is relatively less time consuming
Limitation:
• Respondents are sometimes unwilling & reluctant to supply the information.
• Respondents differ in ability & motivation in clearly supplying the information.
• Requires highly experienced & skilled interviewer.
• The personal bias & prejudice of the interview may affect the result.
• It excludes those who don’t have telephone.
iii) Questionnaire Method

March 20211
Under this method, a list of questions related to the survey is prepared and sent to the various
respondents by hand, post, website, email etc .However; this method cannot be used if the
respondent is illiterate.
The following are the major points that we need to take into account while preparing the
questionnaire. The number of questions should be small. Naturally respondents are not
comfortable with lengthy questionnaires. Lengthy questionnaire usually bore respondents. If a
lengthy questionnaire is unavoidable, it should preferably be divided in to two or more parts.
Source of Data
The statistical data may be classified under two categories depending up on the sources.
1. Primary data: - Data collected by the investigator himself for the purpose of a specific inquiry
or study.
 Mostly generated by surveys conducted by individuals or research institutions.
 It is more reliable & accurate since the investigator can extract the correct information by
removing doubts, if any, in the minds of the respondents regarding certain questions.
2. Secondary data:
data: - When an investigator uses data, which have already been collected by
others, such data are called secondary data.
 Such data are primary data for the agency that collected them, and become secondary for
someone else who uses these data for his own purposes.
For examples;
examples; are books, journals, reports, etc.
When our source is secondary data check that:
 The type and objective of the situations.
 The purpose for which the data are collected and compatible with the present problem.
 The nature and classification of data is appropriate to our problem.
 There are no biases and misreporting in the published data.
Note: Data which are primary for one may be secondary for the other.
2.2. Methods of Data Presentation
Having collected and edited the data, the next important step is to organize it. That is to present it
in a readily comprehensible condensed form that aids in order to draw inferences from it. It is
also necessary that the like be separated from the unlike ones.
The presentation of data is broadly classified in to the following two categories:
 Tabular presentation
 Diagrammatic and Graphic presentation.

March 20211
The process of arranging data in to classes or categories according to similarities technically is

called classification.
classification. It eliminates inconsistency and also brings out the points of similarity
and/or dissimilarity of collected items/data.
Classification is necessary because it would not be possible to draw inferences and conclusions if
we have a large set of collected [raw] data.
2.2.1. Frequency distribution
 Frequency: is the number of times a certain value or class of values occurs.
 Frequency distribution (FD): is the organization of raw data in table from using classes and
frequency.
There are three types of FD and there are specific procedures for constructing each type.
The three types are:-
i) Categorical FD
ii) Ungrouped FD and
iii) Grouped FD
i. Categorical FD:
FD: Used for data that can be placed in specific categories; such as nominal,
ordinal level of data.
Example 2.1: Twenty five patients were given a blood test to determine their blood type. The
data is as shown below: A B B AB O A O O B AB B B B O A O O O AB AB A O O B A.
Solution: Since the data are categorical by taking the four blood types as classes we can
construct a FD as shown below.
Step 1:
1 Make a table as shown below
CLASS TALLY FREQUANCY PERCENRT
A
B
AB
O
Step 2: Tally data and place the result under the column Tally
Step 3: Count the tallies and place the result under the column frequency.
Step 4: find the percentage of values in each class by the formula (%= f/n * 100%; f=
frequency, n total number of observation.)

March 20211
TALLY FREQUAN PERCENRT

CLASS CY
A //// 5 5/25* 100 = 20%
B //// // 7 28%
AB //// 4 16%
O //// //// 9 9/25*100
9/25*100 = 36%
ii. Ungrouped Frequency Distribution (UFD) : UFD us a table of all potential raw score values
each times each actually could possibly occur in the data along with the number of times each
actually could occur.
UFD is often constructed for small set of data or data of discrete variable.
Constructing ungrouped frequency distribution:
 First find the smallest and largest raw score in the collected data.
 Arrange the data in order of magnitude and count the frequency.
 To facilitate counting one may include a column of tallies.
Example 2.2:
2.2: The following data represent the number of days of sick leave taken by each of 50
workers of a company over the last 6 weeks.
2 0 0 5 8 3 4 1 0 0 7 1
7 1 5 4 0 4 0 1 8 9 7 0
1 7 2 5 5 4 3 3 0 0 2 5
1 3 0 2 4 5 0 5 7 5 1 1
0 2
i. Construct ungrouped frequency distribution
ii. How many workers had at least 1 day of sick leave?
iii. How many workers had between 2 and 6 days of sick leave?
Solution:
i. Since this data set contains only a relatively small number of distinct or different values, it is
convenient to represent it in a frequency table which presents each distinct value along with its
frequency of occurrence.
ii. Since 12 of the 50 workers had no days of sick leave, the answer is 50-12=38
iii. The answer is the sum of the frequencies for values 3, 4 and 5 that is 4+5+8=17.

March 20211
Class Frequency Cumulative frequency

0 12 12
1 8 20
2 5 25
3 4 29
4 5 34
5 8 42
7 5 47
8 2 49
9 1 50
iv. Grouped Frequency Distribution (GFD). Is a frequency distribution having several values
grouped in to one class.
*Usually used when the range of the data is large.
Grouped frequency distribution must be inclusive i.e. classes must not be overlap one to the
other
(a) Inclusive
(b) Exclusive
a) In inclusive type of frequency distribution, the upper limit of one class does not coincide with
the lower limit of the next class.
b) In exclusive type of frequency distribution, the upper limit of one class coincides with the
lower limit of the next class
Definition of some basic terms
 Grouped frequency distribution: is a FD when several numbers are grouped into one class.
 Class limits (CL): It separate one class from another. The limits could actually appear in the
data and have gaps between the upper limits of one class and the lower limit of the next class.
 Unit of measure (U): This is the possible difference between successive values. E.g. 1, 0.1,
0.01, 0.001, etc
 Class boundaries: Separate one class in a grouped frequency distribution from the other. The
boundary has one more decimal place than the raw data. There is no gap between the upper
boundaries of one class and the lower boundaries of the succeeding class. Lower class boundary

March 20211
is found by subtracting half of the unit of measure from the lower class limit and upper class
boundary is found by adding half unit measure to the upper class limit.
 Class width (W): The difference between the upper and lower boundaries of any consecutive
class. The class width is also the difference between the lower limit or upper limits of two
consecutive class.
 Class mark (Mid point): It is found by adding the lower and upper class limit (boundaries) &
divided the sum by two.
 Cumulative frequency: It is the number of observation less than or greater than the upper
class boundary of class.
 CF (Less than type): it is the number of values less than the upper class boundary of a given
class.
 CF (Greater than type): it is the number of values greater than the lower class boundary of a
given class.
 Relative frequency (Rf ):The
):The frequency divided by the total frequency. This gives the present
of values falling in that class.
Fri. = fi/n= fi/ ∑if
 Relative cumulative frequency (RCf): The running total of the relative frequencies or the
cumulative frequency divided by the total frequency gives the present of the values which are less
than the upper class boundary or the reverse.
CRfi = Cfi/n=
Cfi/∑fi
Steps Needed to Construct Grouped Frequency Distribution

1. Calculate the range (R)
R=Xmax- Xmin
2. Calculate the number of class using the sturge’s formula
k= 1+3.322logn, where k-No
k-No of classes
n- No
No of observation
n= Σfi
Here always make it round up. E.g. k=4.5 ~ 5
3. Calculate the class width
W=R/K R& K must be round up the next whole number.
March 20211
4. Identify the starting point:- LCL1= Xmin

LCL2=Xmin +W
When grouping data the following rules are important:
 The groups must not overlap, otherwise there is confusion concerning in which group a
measurement belongs.
 There must be continuity from one group to the next, which means that there must be no gaps.
Otherwise some measurements may not fit in a group.
 The groups must range from the lowest measurement to the highest measurement so that all of
the measurements have a group to which they can be assigned.
 The groups should normally be of an equal width, so that the counts in different groups can
easily be compared.
Example1:
Example1: Construct a grouped frequency distribution for the following raw data.
11, 29, 6, 33, 14, 31, 22, 27, 19, 20,
21, 18, 17, 22, 38, 23, 26, 34, 39, 27
1. R= Xmax-Xmin, 39-6=33
2. K=1+3.322 log20 =5.32 ~ 6
3. W=R/K , 33/6=5.5 ~ 6
4. Determine LCL1=Xmin=6
Class limit frequency Mi class boundary Lcf Mcf Rf Pf
6-11 2 8.5 5.5-11.5 2 20
12-17 2 14.5 11.5-17.5 4 18
18-23 7 20.5 17.5-23.5 11 16
24-29 4 26.5 23.5-29.5 15 9
30-35 3 32.5 29.5-35.5 18 5
36-41 2 38.5 35.5-41.5 20 2
2.3. Cumulative Frequency Distribution: -is a frequency distribution that displays the sum of
frequencies of consecutive classes of above or below a given class.
There are two types of cumulative frequency: -
a) Less than cumulative frequency (Lcf): it used interest focuses on the total number of
observation below a specified value.
b) More than cumulative frequency (Mcf): it used when frequency interest focuses on the total
no of observation above a specified value. E.g.

March 20211
Class frequency Lcf Mcf

0 3 3 20
1 4 7 17
2 6 13 13
3 4 17 7
4 3 20 3
Total 20
 Relative frequency (Rf ):The
):The frequency divided by the total frequency. This gives the present
of values falling in that class.
Rfi = fi/n= fi/ ∑fi
 Relative cumulative frequency (RCf): The running total of the relative frequencies or the
cumulative frequency divided by the total frequency gives the present of the values which are
less than e upper class boundary or the reverse.
CR/fi = Cfi/n= Cfi/∑fi
2.2.2. Diagrammatic & Graphic Presentation of Data

►Presentation of data diagrammatically is simple & easy to understand.
i) Bar-Chart (Bar diagram):
diagram): A series of equally spaced bars having equal width (base) where
the height the bar represents the frequency of (amount) associated with each class.
Usually applied for categorical random variables.
A bar chart could be either vertical or horizontal.
E.g. Construct a bar chart for the previously used scholarship data.
Class year Frequency
1st 5
2nd 7
3rd 9
4th 4
Total 25

March 20211
Frequency
class year
Fig 1 Vertical bar chart

There are various types of bar chart. These are:-
a) Simple bar chart:- the above chart
b) Multiple bar chart :- various information in one bar.
c) Component (sub-divided ) bar chart .
ii) Pie Chart:-Is
Chart:-Is the circle that is divided in to different sectors according to the percentage of
frequency in to each category of the distribution with angle in proportion of 360° to the amount
associated to each category.
E.g. for scholarship data construct  pie-chart.
Class frequency Rf Pf 360xRf (in degree)
1st 5 5/25 20% 72°
2nd 7 7/25 28% 100.8
3rd 9 9/25 36% 12 9.6
4th 4 4/25 16% 57.6
Total 25
Fig 2 Pie chart

March 20211
Graphic presentation data:

iii. Histogram: usually used to present quantitative data.
 Is a graph consists of series of rectangles whose bases are equal to the class boundaries of the
corresponding classes & whose heights are proportional to class frequencies.
 It is constructed from a grouped frequency distribution.
 In histogram we use class boundaries in the X-axis.
E.g. construct a histogram for the following data.
Class limit class boundary frequency
6-10 5.5-10.5 1
11-15 10.5-15.5 2
16-20 15.5-20.5 3
21-25 20.5-25.5 5
26-30 25.5-30.5 4
31-35 30.5-35.5 3
36-40 35.5-40.5 2
Total ………………………………………....….20
Fig 3 Histogram
iv. Frequency Polygon:
Polygon: Is the line graph that displays the data using a line that connects points
plotted for the frequencies of the class mark.
i.e. the frequencies represent the height of the class mark.
* A frequency polygon can also be super imposed on a histogram.
Frequency

March 20211
Class boundaries
Frequency polygon
i.e. super imposed
on a histogram.
5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5

Fig. 4. Frequency Polygon
Chapter - Three
Measure of Central Tendency

3.1. Introduction
Objectives
At the end of this chapter students will be able to:
 Identify measure of central tendency
 understand properties of arithmetic mean
 Summarize an aggregate of statistical data by using single measure
 Define and calculate the mean, mode and median.
 ‘Measure the position of data using quartiles, deciles and percentiles with their interpretation.
3.2. The Summation Notation (
 Let a data set consists of a number of observations, represents by , where n (the
last subscript) denotes the number of observations in the data and is the ith observation. Then

March 20211
the sum of all numbers where i goes from 1 up to n is symbolically given by
that is
= +
x - Whole set of numbers
- Specific score in a set of numbers
n - Total number of observations.

Example 3.1: For instance a data set consisting of six measurements 2, 3, 9, 10, 8 and -2 is
represented by , where , =3, 9, = 10, = 8 and =-2 Their sum
becomes = + = 2+3+9+10+8+ (-2) = 30
Some Properties of the Summation Notation

1. = n.c, where c is a constant number.
2. =b where b is a constant number
3. = n.a + b
4. =
5.
3.3. Properties/characteristics of measures of central tendency
A good average should be:
be:
 Rigidly defined (unique).
 Based on all observation under investigation.
 Easily understood.
 Simple to compute.
 Suitable for further mathematical treatment.
 Little affected by fluctuations of sampling.
 Not highly affected by extreme values.
3.4. Types of measures of central tendency

March 20211
Measures of Central Tendency: give us information about the location of the center of the
distribution of data values. A single value that describes the characteristics of the entire mass of
data is called measures of central tendency. The following are types of central:
These are
 The Means
 Arithmetic Mean
 Weighted Arithmetic Mean
 Combined mean
 Geometric Mean
 Harmonic Mean
 The Median
 The Mode or modal value
3.3.1. The Mean
a) Arithmetic mean:- is defined as the sum of the measurements of the items divided by the
total number of items. It is usually denoted by .
Arithmetic Mean for individual series
Suppose , are observed values in a sample of size n from a population of size N,
n<N then the arithmetic mean of the sample, denoted by is given by
= =
If we take an entire population the Mean is denoted by μ and is given by:
= =
Where N stands for the total number of observations in the population.

Example 3.3: The number of flowers per plant is given below. Find the mean.
i. 5 12 9 6
ii. 6 8 6 7 8
Find the arithmetic mean
Solution:
i. The sample values are: 5 12 9 6

March 20211
= = = =8
The arithmetic mean for sample value is 8.

ii. The sample values are: 6 8 6 7 8
= = = =7
The arithmetic mean for sample value is 7.

Arithmetic mean for discrete data arranged in frequency distribution
When the numbers , occur with frequencies , , respectively, then the
mean can be expressed in a more compact form as:
= =
Example 3.4
Calculate the arithmetic mean of the pulse rates (beats per minute) of eleven students:
60 60 71 68 71 72 71 76 72 80 80
= = = = 71
In this case there are two 60’s, one 68, three 71’s, two 72’s, one 76, and two 80’s. The number of
times each number occurs is called its frequency and the frequency is usually denoted by f. The
information in the sentence above can be written in a table, as follows.
Value, xi 60 68 71 72 76 80 Total
Frequency, fi 2 1 3 2 1 2 11
xi fi 120 68 213 144 76 160 781
The formula for the arithmetic mean for data of this type is
= =
In this case we have:
= = 71.

March 20211
The mean pulse rate (beats per minute) of the eleven students is 71.
Arithmetic Mean for Grouped Continuous Frequency Distribution

If data are given in the form of continuous frequency distribution, the sample mean can be
computed as
= where is the class mark of the ith class; i=1, 2. . . k
Is the frequency of the ith class and k is the number of classes
Note that = n = the total number of observations.
Example 3.5
The following frequency table gives the height (in inches) of 100 students in a college.
Class boundary 60-62 62-64 64-66 66-68 68-70 70-72 Total
Frequency (fi) 5 18 42 20 8 7 100
Calculate the mean
Solution:
The formula to be used for the mean is as follows:
Let us calculate these values and make a table for these values for the sake of convenience.
Class boundary Frequency Mid-Point (
(fi)
)
60 - 62 5 61 305
62 – 64 18 63 1134
64 – 66 42 65 2730
66 – 68 20 67 1340
68 – 70 8 69 552
70 – 72 7 71 497

March 20211
Total 100 6558
Substituting these values with = 100, we get
= = = = 65.58
The mean height of students is 65.58.

Properties of the Arithmetic Mean
 The algebraic sum of the deviations of a set of numbers , from their mean x is
always zero. i.e.
 The sum of squares of deviations from the mean is the least. That is, is minimum when
.
 If the mean of , is , then
a) The mean of ± k, ± k ,..., ± k will be ±k
b) The mean of will be k .
Merits of Arithmetic Mean

 Arithmetic mean has a rigidly defined mathematical formula so that its value is always definite
or unique. It can be calculated for any set of numerical data.
 It is calculated based on all observations.
 Arithmetic mean is simple to calculate and easy to understand.
 It doesn’t need arrangement of data in increasing or decreasing order.
 Arithmetic mean of many samples from the same population does not fluctuate considerably.
 It affords a good standard of comparison.
Demerits of Arithmetic Mean
 It can’t be calculated for data which are not quantifiable.
 It is highly affected by extreme (abnormal) values in the series.

March 20211
 It can be a number which does not exist in the series.

 It can’t be calculated for grouped continuous open-ended classes.
b) Weighted Arithmetic Mean
While calculating simple arithmetic mean, all items were assumed to be of equally importance
(each value in the data set has equal weight). When the observations have different weight, we
use weighted average. Weights are assigned to each item in proportion to its relative importance.
If , represent values of the items and , are the corresponding weights,
then the weighted mean, ( ) is given by
Example 3.6
A student’s final mark in Mathematics, Physics, Chemistry and Biology are respectively A, B, D
and C. If the respective credits received for these courses are 4, 4, 3 and 2, determine the
approximate average mark the student has got for the course.
Solution
We use a weighted arithmetic mean, weight associated with each course being taken as the
number of credits received for the corresponding course.
4 3 1 2 Total
4 4 3 2 13
16 12 3 4 35
= = = 2.69
Average mark of the student is approximately 2.69.
c) Combined mean:-When
mean:-When a set of observations is divided into k groups and is the mean of
n1 observations of group 1, is the mean of n2 observations of group2, …, is the mean of nk

March 20211
observations of group k , then the combined mean ,denoted by , of all observations taken
together is given by
This is a special case of the weighted mean. In this case the sample sizes are the weights.
Example 3.7
In the Previous year there were two sections taking Statistics course. At the end of the semester,
the two sections got average marks of 70 & 78. There were 45 and 50 students in each section
respectively. Find the mean mark for the entire students.
Solution:
= = = = 74.21
The combined mean of the entire students will be 74.21.

d) Geometric Mean
The geometric mean like arithmetic mean is calculated average. It is used when observed values
are measured as ratios, percentages, proportions, indices or growth rates.
Geometric mean for individual series:- The geometric mean, G.M. of an individual series of
positive numbers , is defined as the nth root of their product.
= antilog ( )
Example 3.8
Find the G. M of (a) 3 and 12 (b) 2, 4 and 8
Solution: a) ; b) GM= =4
Geometric mean for discrete data arranged in FD:- When the numbers , occur
with frequencies , , respectively,

respectively, then the geometric mean is obtained by
= antilog ( )

March 20211
Example 3.9
Compute the geometric mean of the following values: 3, 3, 4, 4, 4, 5, 6 and 6.
Solution
Values 3 4 5 6
Frequency 2 3 1 2
G.M. = = 4.236
The geometric mean for the given data is 4.236.

Geometric mean for continuous grouped FD: The above formula can also be used whenever
the frequency distribution is grouped continuous, class marks of the class intervals are
considered as xi.
Properties of geometric mean
 It is less affected by extreme values.
 It takes each and every observation into consideration.
 If the value of one observation is zero its values becomes zero.
e) Harmonic Mean
It is a suitable measure of central tendency when the data pertains to speed, rate and time. The
harmonic of n values is defined as n divided by the sum of their reciprocal.
Harmonic mean for individual series:- If , are n observations, then harmonic
mean can be represented by the following formula:
Example 3.10: A car travels 25 miles at 25 mph, 25 miles at 50 mph, and 25 miles at 75 mph.
Find the harmonic mean of the three velocities.
Solution
= = 40.9.
Harmonic mean for discrete data arranged in FD:- If the data is arranged in the form of
frequency distribution
, where

March 20211
Harmonic mean for continuous grouped FD: Whenever the frequency distribution are
grouped continuous, class marks of the class intervals are considered as and the above
formula can be used as
H.M. = where
is the class mark of ith class
Properties of harmonic mean

 It is unique for a given set of data.
 It takes each and every observation into consideration.
 Difficult to calculate and understand.
 Appropriate measure of central tendency in situations where data is in ratio, speed or rate.
Relations among different means
i. If all the observations are positive we have the relationship among the three means given as:
GM HM
ii. For two observations GM
iii. = GM = HM if all observation are positive and have equal value.
3.3.2. Median
The median is as its name indicates the middle most value in the arrangement which divides the
data into two equal parts. It is obtained by arranging the data in an increasing or decreasing order
of magnitude and denoted by .
i. Median for individual series:-We

series:-We arrange the sample in ascending order of the variable of
interest. Then the median is the middle value (if the sample size n is odd) or the average of the
two middle values (if the sample size n is even).
For individual series the median is obtained by
a/ = value if n is odd, and

March 20211
b/ = if n is even
Example 3.11
Find the median for the following data.
a/ -5 15 10 5 0 2 1 4 6 and 8
b/ 5 2 2 3 1 8 4
Solution;
a) The data in ascending order is given by:
-5 0 1 2 4 5 6 8 10 15
n=10 n is even. The two middle values are 5th and 6th observations. So the median is,
= = .
b) The data in ascending order is given by:

1 2 2 3 4 5 8
The middle value is the 4th observation. So the median is 3.
Note:
Note: The median is easy to calculate for small samples and is not affected by an "outlier".
ii. Median for grouped continuous data:-For
data:-For continuous data, the median is obtained by the
following formula.
Where: L= the lower class boundary of the median class; w = the class width of the median
class;
= the frequency of the median class; and
the cum. freq. corresponding to the class preceding the median class. That is, the sums of the
frequencies of all classes lower than the median class. Where the median class is the class which
contains the (n/2)th observation whether n is odd or even, since the items have already lost their
originality once they are grouped in to continuous classes.
Example 3.12: Water percentage in the body of species of Fish is given below. Calculate the
median.
C.I 15-24 25-34 35-44 45-54 55-64 Total
Freq. 7 17 16 6 4 50
Solution: Construct the less than cumulative frequency distribution, then:

March 20211
C.I 15- 25-34 35- 45-54 55- Total

24 44 64
Freq. 7 17 16 6 4 50
Cuml. Freq. 7 24 40 46 50
Since n = 50, 50/2 = 25, and the smallest CF greater than or equal to 25 is 40; thus, the median
class is the third class. And for this class, L = 34.5, w = 10, =16,
CF = 24. Then applying the formula, we get:
=34.5+ (25-24)*10/16 = 35.1.
Merits of median
• It is less affected by extreme values.
• Median can be calculated even in case of open-ended intervals.
• It can be computed for ratio, interval, and ordinal level of data.
Demerits of median
• Its value is not determined by each & every observation.
• It is not a good representative of the data if the number of items (data) is small.
• The arrangement of items in order of magnitude is sometimes very tedious process if the
number of items is very large.
3.3.3. The Mode or Modal value
The mode or the modal value is the value with the highest frequency and denoted by . A data
set may not have a mode or may have more than one mode. A distribution is called a bimodal
distribution if it has two data values that appear with the greatest frequency. If a distribution has
more than two modes, then the distribution is multimodal. If a distribution has no modes, then
the distribution is non-modal.
i. Mode of individual series:
series: The mode or the modal value of individual series (raw data) is
simply obtained by locating the observation with the maximum frequency.
Example 3.13
Consider the following data:
a. 30 45 69 70 32 18 32. The Mode ( ) = 32.
b. 10 20 30 10 40 30. The Mode ( ) = 10 and 30.
c. 10 40 30 20 50 60. No Mode.

March 20211
Note that in some samples there may be more than one mode or there may not be a mode. The
mode is not a suitable measure of central tendency in these cases. We use the mode as a measure
of central tendency if we require a measure that takes on one of the sample values. The mode can
be used for variables that are measured on a category (nominal) scale. e.g. the most popular
computer type.
ii. Mode for Grouped Continuous Frequency Distribution
For grouped data, the mode is found by the following formula. In such cases, one can only
determine the modal class easily i.e. the class with the highest frequency. After locating this
class, the mode is interpolated using:
, where L = the lower class boundary of the modal class;
, , w = the common class width, = frequency of the class immediately

preceding the modal class; = frequency of the class immediately succeeding the modal
class; and fmode = frequency of the modal class.
Example 3.14
Calculate the mode for the frequency distribution (water percentage) of the data on example
3.12.
Solution: By inspection, the mode lies in the second class, where L =24.5, f mod = 17, f1= 7,
f2=16, w = 10. Using the formula, the mode is:
= 24.5 + (17-7)*10/[(17-7)+(17-16)] = 33.59.
Merits of mode
 Mode is not affected by extreme values.
 We can change the size of the observations without changing the mode.
 It can be computed for all level of data. i.e., ratio, interval, ordinal or nominal.
Demerits of mode
 It may not exist
 It does not take every value into consideration.
 Mode may not exist in the series and if it exists it may not be unique
3.4. The Relationship of the Mean, Median and Mode
Comparing the Mean, Median, and the Mode
 If the data is skewed, avoid the mean.
 If there is high gap around the middle, avoid the median.

March 20211
 A measure is a resistant measure if its value is not affected by an outlier or an extreme data
value.
 The mean is not a resistant measure of central tendency because it is not resistant to the
influence of the extreme data values or outliers.
 The median is resistant to the influence of extreme data values or outliers and its value does
not respond strongly to the changes of a few extreme data values regardless of how large the
change may be.
 The mode has an advantage over both the mean and the median when the data is categorical
since it is not possible to calculate the mean or median for this type of data. Also, the mode
usually indicates the location within a large distribution where the data values are concentrated.
However, the mode cannot always be calculated because if a distribution has all different data
values, then the distribution is non-modal.
 In the case of symmetrical distribution; mean, median and mode coincide. That is,
mean=median = mode. However, for a moderately asymmetrical (non symmetrical) distribution,
mean and mode lie on the two ends and median lies between them and they have the following
important empirical relationship, which is
(Mean – Mode) = 3(Mean - Median).
Example 3.15
In a moderately asymmetrical distribution, the mean and the mode are 30 and 42 respectively.
What is the median of the distribution?
Solution:
Median = (2mean + Mode)/2 = (2*30 + 42)/3 = 34
Hence the median of the distribution is 34.
Which of the Three Measures is the Best?
At this stage, one may ask as to which of these three measure of central tendency is the best.
There is no simple answer to this question. It is because these three measures are based upon
different concepts. The arithmetic mean is the sum of the values divided by the total number of
observations in the series. The median is the value of the middle observations tend to
concentrate, As such; the use of a particular measure will largely depend on the purpose of the
study and the nature of the data. For example, when we are interested in knowing the consumers’
preferences for different brands of television sets or kinds of advertising, the choice should go in
favor of mode. The use of mean and median would not be proper. However, the median can
sometimes be used in the case of qualitative data when such data can be arranged in an ascending

March 20211
or descending order. Let us take another example. Suppose we invite applications for a certain
vacancy in our University. A large number of candidates apply for that post. We are now
interested to know as to which age or age group has the largest concentration of applicants. Her,
obviously the mode will be the most appropriate choice. The arithmetic mean may not be
appropriate as it may be influenced by some extreme values.
CHAPTER - FOUR
MEASURE OF DISPERSION (VARIATION)
(VARIATION)
Defn: dispersion or variation is any value obtained from the difference of the numbers.
4.1. Objectives of Measuring variation or Dispersion
To judge the reliability of measure of central tendency,
To compare two or more groups of numbers in terms of their variability, and
To further statistical analysis.
4.2. Absolute or Relative Measures
 Absolute Measures of Dispersion: The measures of dispersion which are expressed in terms
of the original unit of a series are termed as absolute measures. Such measures are not suitable
for comparing the variability of two distributions which are expressed in different units of
measurement and different average size.
 Relative Measures of Dispersion:
Dispersion: Relative measures of dispersions are a ratio or percentage of
a measure of absolute dispersion to an appropriate measure of central tendency and are thus pure
numbers independent of the units of measurement. For comparing the variability of two

March 20211
distributions (even if they are measured in the same unit), we compute the relative measure of
dispersion instead of absolute measures of dispersion.
4.3. Types of Measure of Dispersion
There are various measure of dispersions, out of which the most commonly used are:
1. Range (R) and Relative Range (RR)
2. Variance (s2), Standard Deviation (s) and Coefficient of Variation (CV).
3. The Standard Score
I) Range (R) and Relative Range (RR):
a) Range (R): is the largest score minus the smallest score. It is a quick and dirty measure of
variability, although when a test is given back to students they very often wish to know the range
of scores. i.e.
 Range for raw data:
 Range for grouped data: If data are given in the shape of continuous frequency distribution,
the range is computed as:
Merits and Demerits of range

Merits:
 It is rigidly defined.
 It is easy to calculate and simple to understand.
Demerits:
 It is not based on all observation.
 It is highly affected by extreme observations.
 It is affected by fluctuation in sampling.
 It is not liable to further algebraic treatment.
 It cannot be computed in the case of open end distribution.
 It is very sensitive to the size of the sample.
b) Relative Range (RR):
(RR): is also sometimes called coefficient of range and given by:
II) Variance (s2), Standard Deviation (s) and Coefficient of Variation (CV)

March 20211
Variance - Is the “average squared deviation from the mean”
Population variance( ) = , where i=1,2,3,......N, - population mean, N –
population size, xi - individual value.
For the case of frequency distribution it is expressed as: , where i=1,2....k, fi –

is frequence.
 Sample variance(s2): s2 = )2, i=1,2,3....n
 For the case of frequency distribution it is expressed as:
s2 = )2, i=1, 2, 3....k
Short- cut formula:
s2 Xi2 - n 2) for row data, s2 fiXi2 - n 2) for freq. distribution.
Standard Deviation
 There is a problem with variances.
 Recall that the deviations were squared. That means the units were also squared.
 To get the units back the same as the original data values, the square root must be taken.
Examples: find the variances and standard deviations of the following sample data 5,17,12,10.
The data is given in the form of frequency distribution.
Solutions: =11
=11
Xi 5 10 12 17 total
(Xi- )2 36 1 1 36 74
s2 = )2 = 74/3 =24.67 ð s == = = 4.97
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15

March 20211
60-64 12
65-69 6
70-74 3
= 55
Xi(C.M) 42 47 52 57 62 67 72 total
fi( 2 1183 640 198 60 588 864 867 4400
s2 = )2 = 4400/74 = 59.46 ð S= = 7.71
Coefficient of Variation (CV)

 Is defined as the ratio of standard deviation to the mean usually expressed as percents.
 CV= * 100%
Exa. An analysis of the monthly wages paid (in birr) to workers in two firms A and B belonging
to the same industry gives the following results:
value Firm A Firm B
Mean wage 52.5 47.5
Median wage 50.5 45.5
variance 100 121
In which firm A or B is there greater variability in individual wage?
Solutions: calculate coefficient of variation for both firms.
 CVA= * 100% = * 100% = 19.05% , and CVB= * 100% = * 100% = 23.16%
 Since C.VA < C.VB, in firm B there is greater variability in individual wages
III) Standard Score (Z-Scores)
A standard score for sample value in a data set is obtained by the mean of the data set from the
value and dividing the result by the standard deviation of the data set. Basically, the standard
score (z-score) tells us how many standard deviations a specific value is above or below the
mean value of the data set. That is, the z-score is the number of standard deviations the data
value falls above (positive z-score) or below (negative z-score) the mean for the data set.
Z-score computed from the population
March 20211
Z-score computed from the sample
Example 4.11: What is the Z-score for the value of 14 in the following sample data set?
3 8 6 14 4 12 7 10
Solution:
= 8, S = 3.8173 thus, Z =
 The data value of 14 is located 1.57 standard deviations above the mean 8 because the z-
score is positive.
Example 4.12: Suppose that a student scored 66 in Statistics and 80 in Biology. The score of the
summary of the courses is given below.
Course Average score Standard deviation of the score
Statistics 51 12
Biology 72 16
In which course did the student scored better as compared to his classmates?
Solution:
Z-score of student in Statistics:
Z-score of student in Biology: From these two standard scores, we
can conclude that the student has scored better in Statistics course relative to his classmates than
in Economics course.
course.

STAT Note1-4

Uploaded by

Copyright:

Available Formats

STAT Note1-4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STAT Note1-4

Uploaded by

Copyright:

Available Formats

RVU Introduction To Statistics For Economic

E.g. Observations regarding height, income, weight, age, etc…

Adjusted by Sibex A. LECTURE NOTES Page 5

Adjusted by Sibex A. LECTURE NOTES Page 6

Method of Data Collection & Presentation:

iii) Questionnaire Method

Adjusted by Sibex A. LECTURE NOTES Page 8

Adjusted by Sibex A. LECTURE NOTES Page 9

The process of arranging data in to classes or categories according to similarities technically is

Adjusted by Sibex A. LECTURE NOTES Page 10

TALLY FREQUAN PERCENRT

Adjusted by Sibex A. LECTURE NOTES Page 11

Class Frequency Cumulative frequency

Adjusted by Sibex A. LECTURE NOTES Page 12

Steps Needed to Construct Grouped Frequency Distribution

4. Identify the starting point:- LCL1= Xmin

Adjusted by Sibex A. LECTURE NOTES Page 14

Class frequency Lcf Mcf

2.2.2. Diagrammatic & Graphic Presentation of Data

Adjusted by Sibex A. LECTURE NOTES Page 15

Fig 1 Vertical bar chart

3rd 9 9/25 36% 12 9.6

4th 4 4/25 16% 57.6

Fig 2 Pie chart

Adjusted by Sibex A. LECTURE NOTES Page 16

Graphic presentation data:

Adjusted by Sibex A. LECTURE NOTES Page 17

5.5 10.5 15.5 20.5 25.5 30.5 35.5 40.5

Measure of Central Tendency

3.2. The Summation Notation (

 Let a data set consists of a number of observations, represents by , where n (the

Adjusted by Sibex A. LECTURE NOTES Page 18

the sum of all numbers where i goes from 1 up to n is symbolically given by

x - Whole set of numbers

- Specific score in a set of numbers

n - Total number of observations.

represented by , where , =3, 9, = 10, = 8 and =-2 Their sum

becomes = + = 2+3+9+10+8+ (-2) = 30

Some Properties of the Summation Notation

2. =b where b is a constant number

Adjusted by Sibex A. LECTURE NOTES Page 19

total number of items. It is usually denoted by .

Arithmetic Mean for individual series

Suppose , are observed values in a sample of size n from a population of size N,

n<N then the arithmetic mean of the sample, denoted by is given by

If we take an entire population the Mean is denoted by μ and is given by:

Where N stands for the total number of observations in the population.

Adjusted by Sibex A. LECTURE NOTES Page 20

The arithmetic mean for sample value is 8.

The arithmetic mean for sample value is 7.

When the numbers , occur with frequencies , , respectively, then the

mean can be expressed in a more compact form as:

In this case we have:

Adjusted by Sibex A. LECTURE NOTES Page 21

Arithmetic Mean for Grouped Continuous Frequency Distribution

= where is the class mark of the ith class; i=1, 2. . . k

Is the frequency of the ith class and k is the number of classes

Note that = n = the total number of observations.

Adjusted by Sibex A. LECTURE NOTES Page 22