STATISTICS DESCRIPTIVE PROJECT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

DESCRIPTIVE STATISTICS

A PROJECT
Submitted in partial fulfillment for the degree of

BACHELOR OF SCIENCE
IN
STATISTICS
SUBMITTED BY
VIDUSHI RASTOGI
ROLL NO : 2210404010657
UNDER THE GUIDANCE OF :
Ms. Saumya Tiwari
DEPARTMENT OF STATISTICS
SHRI JAI NARAYAN DEGREE COLLEGE
ACKNOWLEDGEMENT

I would like to express my deepest gratitude to all those who have supported me throughout the
project on topic “DISCRIPTIVE STATTISTICS”

First and foremost, I would like to thank my supervisor Ms. Saumya Tiwari , for their
invaluable guidance, expertise, and encouragement. Their insightful advice and constant support
helped me stay focused and motivated.

I would also like to extend my gratitude towards all the faculty members of the DEPARTMENT
OF STATISTICS , SHRI JAI NARAYAN DEGREE COLLEGE .

I would lastly asseverate my gratefulness towards my family for their support and guidance.

VIDUSHI RASTOGI

ROLL NO. : 2210404010657


CERTIFICATE

This is to certify that the term paper entitled “DISCRIPTIVE STATTISTICS” which is
submitted by BSc. Semester 5 student has been carried out as per the requirement given by
DEPARTMENT OF STATISTICS, SHRI JAI NARAYAN DEGREE CCOLLEGE.

This work is a review of literature and has been done by the student themselves. To the best of
my knowledge they have fulfil the conditions of the submission of this term paper

Department of statistics
TABLE OF CONTENT

❖ Introduction
❖ Frequency distribution

1. Types of frequency distribution


2. Arrangement of data
3. Graphical representation of data

❖ Measures of central tendency

1. Arithmetic Mean
2. Harmonic Mean
3. Geometric Mean
4. Median
5. Mode

❖ Partition value

1. Quartiles
2. Deciles
3. Percentiles

❖ Dispersion

1. Absolute measure of dispersion


2. Relative measure of dispersion

❖ Moments

1. Moment about arbitrary point


2. Moment about origin
3. Moment about mean
4. Absolute moment
5. Sheppard’s correction of moments
6. Pearson’s β and γ coefficient

❖ Skewness
❖ Kurtosis
Introduction

Descriptive statistics are brief descriptive coefficients that summarize a given data set,
which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central tendency
and measures of dispersibility (spread). Measures of central tendency include the
mean, median and mode, quartiles, kurtosis, and skewness. Measures of variability
describe the dispersion of the data set (variance, standard deviation). Measures of
frequency distribution describe the occurrence of data within the data set (count).
People use descriptive statistics to repurpose hard-to-understand quantitative insights
across a large data set into bite–sized descriptions. In descriptive statistics, univariate
data analyzes only one variable. It is used. It is used to identify characteristics of a
single trait and is not used to analyze any relationships or causations. Bivariate data, on
the other hand, attempts to link two variables by searching for correlation.

FREQUENCY

The frequency of a value is the number of times it occurred in a data set. A frequency distribution
is the pattern of frequencies of a variable. It's the number of times each possible value of a
variable occurs in a data set.

Types of frequency distribution


There are four types of frequency distribution :

1.Ungrouped frequency distribution : The number of observation of each value of variable.

2. Grouped frequency distribution : The number of observation of each class interval of a


variable class intervals are ordered grouping of a variables values.

3. Relative frequency distribution : The proportion of observation of each value or class


individual of the variable.

4. Cumulative frequency distribution : The sum of the frequencies less than or equal to each
value or class interval of a variable.
5. Open End Frequency Distribution : Open end frequency distribution is one which has at
least one of its ends open. Either the lower limit of the first class or upper limit of the last class or
both are not specified.

ARRANGEMENT OF DATA
There are different ways of arranging the raw data. The data after collection is arranged in
mainly two possible ways :

1. Simple array : The simple array is one of the simplest ways to present data. It is an
arrangement of given raw data in ascending or descending order. In ascending order the
scores are arranged in increasing order of their magnitude. . Simple array has several
advantages as well as disadvantages over raw data. Using Simple array, we can easily
point out the lowest and highest values in the data and the entire data can be easily
divided into different sections. Repetition of the values can be easily checked, and
distance between succeeding values in the data can be observed on the first look. But
sometimes a data array is not very helpful because it lists every observation in the array.
It is cumbersome for displaying large quantities of data.

2. Grouped frequency distribution : the data arranged in the form of grouping can be
either discrete distribution and continuous distribution . These distributions are describe
below :

Discrete frequency distribution : In this different observations are not written as in simple
array. We count the number of times any observation appears which is known as frequency. The
literary meaning of frequency is the number or occurrence of a particular event score in a set of
sample. A frequency distribution is a table that organizes data into classes, i.e., into groups of
values describing one characteristic of the data.

For example : frequency distribution of number of persons and their wages per month

Continuous frequency distribution : To prepare a grouped frequency distribution, first we


decide the range of the given data, i.e., the difference between the highest and lowest scores. This
will tell about the range of the scores. In spite of great importance of classification and statistical
analysis , no hard and fast rules can be laid down for it. The following points maybe kept in mind
for classification :

1. The classes should be clearly defined and should not lead to any ambiguity.
2. The classes should be exhaustive .
3. The classes should be mutually exclusive and non overlapping .
4. The classes should be of equal width.
5. Indeterminate classes should be avoided as far as possible .
6. The number of classes should neither be too large nor too small . The following formula
due to Struges may be used to determine an approximate number of k classes :
K = 1 + 3.322log10 N , Where N is the total frequency
𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑠𝑡𝑖𝑜𝑛− 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
The Magnitude of the class interval = 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
Class limit

It should be choose in such a way that the mid value of the class interval and actual average of
the observation in the class interval are as near to each other as possible. If this is not the case
then the distribution gives a distorted picture of the characteristic of the data.

Exclusive classes : This method of class formation, the classes are so formed that the upper limit
of one class also becomes the lower limit of the next class. Exclusive method of classification
ensures continuity between two successive classes.
Representation of data classification in exclusive classes

Class interval Frequency


0 – 10 2
10 – 20 10
20 – 30 11
30 – 40 6
40 – 50 1
30

Inclusive classes : In this method classification includes scores, which are equal to the upper
limit of the class. Inclusive method is preferred when measurements are given in the whole
numbers.
Representation of data classification in inclusive classes

Class interval Frequency


0–9 2
10 – 19 10
20 – 29 11
30 – 39 6
40 – 49 1
30

True or actual classes : In inclusive method upper class limit is not equal to lower class limit of
the next class. Therefore, there is no continuity between The classes. However, in many
statistical measures continuous classes are required. Therefore we subtract 0.5 to upper limit each
class and add 0.5 lower limit each class.

Representation of classification in actual classes


Class interval Frequency
0.5 – 9.5 2
9.5 – 19.5 10
19.5 – 29.5 11
29.5 – 39.5 6
39.5 – 49.5 1
30

GRAPHICAL REPRESENTATION OF DATA


HISTOGRAM
It is one of the most popular method for presenting continuous frequency distribution in a form
of graph. In this type of distribution the upper limit of a class is the lower limit of the following
class. The histogram consists of series of rectangles, with its width equal to the class interval of
the variable on horizontal axis and the corresponding frequency on the vertical axis as its heights.

For example

Class interval Frequency


10 - 20 12
20 - 30 10
30 - 40 35
40 - 50 55
50 - 60 45
60 - 70 25
70 - 80 18

HISTOGRAM OF THE FOLLOWING DATA :


CUMULATIVE FREQUENCY CURVE OR OGIVE
The graph of a cumulative frequency distribution is known as cumulative frequency curve or
ogive. Since there are two types of cumulative frequency distribution e.g., “ less than” and
“ more than” cumulative frequencies, we can have two types of ogives.

i) ‘Less than’ Ogive: In ‘less than’ ogive , the less than cumulative frequencies are
plotted against the upper class boundaries of the respective classes. It is an increasing
curve having slopes upwards from left to right.
ii) ‘More than’ Ogive: In more than ogive , the more than cumulative frequencies are
plotted against the lower class boundaries of the respective classes. It is decreasing
curve and slopes downwards from left to right.

For example

Class interval Frequency Less than C.F. More than C.F.


10 – 20 12 12 200
20 – 30 10 22 188
30 – 40 35 57 178
40 – 50 55 112 143
50 – 60 45 157 88
60 – 70 25 182 43
70 – 80 18 200 18

OGIVE OF THE FOLLOWING DATA

MEASURES OF CENTRAL TENDENCY


According to Professor Bowley , averages are “ statistical constants which enables us to
comprehend in a single effort the significance of the whole.” They give us an idea about
concentration of the values in the central part of the distribution . Every dataset has a property to
tend towards a central point as the sample size tends towards infinity. This property of data is
called as central tendency and the point towards data tends is called as Measures of central
tendency. Central Tendencies in Statistics are the numerical values that are used to represent mid-
value or central value a large collection of numerical data. These obtained numerical values are
called central or average values in Statistics. A central or average value of any statistical data or
series is the value of that variable that is representative of the entire data or its associated
frequency distribution. Such a value is of great significance because it depicts the nature or
characteristics of the entire data, which is otherwise very difficult to observe. The representative
value of a data set, generally the central value or the most occurring value that gives a general
idea of the whole data set is called Measure of Central Tendency. These measures can also be
termed as the central point of data around which data is supposed to be scattered (or
concentrated). There are five main measures of central tendency
• ARITHMETIC MEAN
• GEOMETRIC MEAN
• HARMONIC MEAN
• MEDIAN
• MODE

Condition for a measure to be an ideal measure of central tendency

According to Professor YULE , the following are the characteristics to be satisfied by an ideal
measure of central tendency:

1. It should be rigidly defined.


2. It should be readily comprehensible and easy to calculate.
3. It should be based on all observation.
4. It should be suitable for further mathematical treatment, by this we mean that if we are
given the averages and sizes of a number of series we should be able to calculate the
average of the composite series obtained on combining the given series.
5. It should be affected as little as possible by fluctuation of sampling.
6. It should not be affected much extreme values.

ARITHMETIC MEAN
Arithmetic mean of a set of observations is their sum divided by the number of
observations .Arithmetic mean is often referred to as the mean or arithmetic average. It is
calculated by adding all the numbers in a given data set and then dividing it by the total number
of items within that set. The arithmetic mean (AM) for evenly distributed numbers is equal to the
middlemost number. Further, the AM is calculated using numerous methods, which is based on
the amount of the data, and the distribution of the data.

General formula for calculating Arithmetic mean is


𝐬𝐮𝐦 𝐨𝐟 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧
Mean = 𝐧𝐮𝐦𝐛𝐞𝐫 𝐨𝐟 𝐨𝐛𝐬𝐞𝐫𝐯𝐚𝐭𝐢𝐨𝐧

This formula is used to calculate the mean of the individual series i.e. ungrouped data

For further calculation of mean ( discrete or continuous or discontinuous ) grouped data ,


Let us assume, there are n observations x1 , x2 , …………, xn having frequencies f1 , f2 ,….…., fn
respectively so arithmetic mean can be calculated using formula,
𝐧
∑ 𝐟𝐢𝐱 𝐢
Mean (x̄) = 𝐢=𝟏
; Where i = 1,2,…,n and N = ∑fi
𝐍

This formula is known as direct method for calculating arithmetic mean . This is the most
simplest method of calculating mean.

In case of ( continuous or discontinuous) grouped frequency distribution data ,

The arithmetic mean can be reduced to great extent by taking the deviation of any given value at
any arbitrary point ‘A' by taking di = xi – A where di is the deviation at any arbitrary point A
𝐧
∑𝐟𝐢𝐝𝐢
Now, Mean (x̄) = A+ 𝐢=𝟏
; where i = 1, 2,….., n
𝐍

This formula is known as shortcut method for calculating arithmetic mean. This formula is much
more accurate for giving the arithmetic mean.

The arithmetic mean is reduced to a still greater extent by taking di = (xi – A) / h where A is any
arbitrary point and h is the width of the class.
𝐧
∑𝐟𝐢 𝐝𝐢
Now, Mean (x̄) = A + h * 𝐢=𝟏
; where i = 1,2,….,n
𝐍

This formula is known as the step deviation method for calculating arithmetic mean. This
formula is most accurate for giving the arithmetic mean.

PROPERTIES OF ARITHMETIC MEAN


1. The algebraic sum of deviation of a set of values from their arithmetic mean is zero . If xi
are observations having fi frequencies where i = 1,2,…….,n then
∑𝑓𝑖 (𝑥𝑖 −𝑥̄ )=0
𝑖 ; x̄ being the mean of distribution.

2. The sum of square of deviation of a set of values is minimum when they take in about mean.
3. The mean of a constant is constant.

4. The mean is affected by the change of origin.

5. The mean is affected by change of origin.


6. If x̅i are the means of k component series of sizes ni ,(1,2,…,n) respectively ,then the mean x̅
of the composite series obtained on combining the component series is given by formula :
𝒌 𝒌
x̅ = ∑𝒏𝒊 𝒙𝒊 / ∑𝒏𝒊
𝒊=𝟏 𝒊=𝟏

ADVANTAGES OF ARITHMETIC MEAN

1.It is rigidly defined.

2.It is easy to understand and easy to calculate .

3. It is amenable to algebraic treatment . The mean of the composite series in terms of the means
and the sizes of the component series.

4. The arithmetic mean is least affected by fluctuations of sampling .

DISADVANTAGES OF ARITHMETIC MEAN

1. It cannot be determined by inspection nor it can be located graphically.

2. Arithmetic mean cannot be used if we are dealing with qualitative characteristics.

3. Arithmetic mean cannot be calculated if a single observation is missing or lost .

4. Arithmetic mean is affected very much by extreme values and cannot be calculated for open
ended classes.

5. In extremely asymmetrical (skewed) distribution usually arithmetic mean is not a suitable


measures of location.

GEOMETRIC MEAN
Geometric mean of a set of n observations is the nth root of their product. In statistics,
the Geometric Mean is the average value or mean which signifies the central tendency of the set
of numbers by finding the product of their values. Basically, we multiply the numbers altogether
and take the nth root of the multiplied numbers, where n is the total number of data values.

Thus the geometric mean G , of n observations xi ; i=1,2,….,n is given by


1
G= (𝑥1𝑥2 … … 𝑥𝑛 )𝑛
The computation is facilitated by the use of logarithms. On further calculations ,
𝟏 𝒌
G= Antilog(𝒏 ∑𝒍𝒐𝒈 𝒙𝒊 )
𝒊=𝟏

This formula is used for calculating geometric mean of individual series.

In case of frequency distribution , xi | fi (i=1,2,….,n) geometric mean , G is :


1
𝑓 𝑓 𝑓 𝑘
G= ( 𝑥11 𝑥22 … 𝑥𝑛𝑛 )𝑁 ; where N= ∑𝑓𝑖
𝑖=1

The computation is facilitated by the use of logarithms. On further calculations ,


𝟏 𝒌
G = G= Antilog(𝑵 ∑ 𝒇𝒊 𝒍𝒐𝒈 𝒙𝒊 )
𝒊=𝟏

This formula is used for (discrete or continuous or discontinuous) grouped data


In case of (continuous or discontinuous) ,frequency distribution x is taken to be the value
corresponding to the mid point of the class intervals.

PROPERTIES OF GEOMETRIC MEAN


1. The G.M for the given data set is always less than the arithmetic mean for the data set

2. If each object in the data set is substituted by the G.M, then the product of the objects remains unchanged.

3. The ratio of the corresponding observations of the G.M in two series is equal to the ratio of their geometric
means

4. The products of the corresponding items of the G.M in two series are equal to the product of their geometric
mean.

ADVANTAGES OF GEOMETRIC MEAN

1.It is rigidly defined .

2.It is based on all observations .

3.It is suitable for further mathematical treatment .

4.It is not affected by fluctuation of sampling

5.It gives comparatively more weight to small items.


DISADVANTAGES OF GEOMETRIC MEAN

1. The geometric mean is not easy to understand for people who are not mathematically inclined
it involves logarithmic operations.
2. The geometric mean is difficult to calculate because it involves finding the root of the products
of certain values
3. The geometric mean cannot be calculated if any value in a series is zero or if the number of
negative values is odd.
4. The geometric mean may not be the actual value of the series.
5. Only used in geometric progression series.
6. Time consuming.

HARMONIC MEAN
The Harmonic Mean (H) of a number of observations , none of which is zero , is the reciprocal
of the arithmetic mean of the reciprocals of the given values . Harmonic mean gives less
weightage to the large values and large weightage to the small values to balance the values
correctly. In general, the harmonic mean is used when there is a necessity to give greater weight
to the smaller items. It is applied in the case of times and average rates. It is the most appropriate
measure for ratios and rates because it equalizes the weights of each data point. For instance, the
arithmetic mean places a high weight on large data points, while the geometric mean gives a
lower weight to the smaller data points.

The harmonic mean (H) , of n observations xi ,i=1,2,…n is given by :


𝟏
H= 𝟏 𝒏
𝒏 ∑𝟏
𝒙𝒊
𝒊=𝟏

In case of frequency distribution xi | fi (i=1,2,…n)


𝟏 𝑛
H= 𝟏 𝒏 ; where N = ∑𝑓𝑖
𝑵 ∑(𝒇𝒊 ) 𝑖=1
𝒙𝒊
𝒊=𝟏

PROPERTIES OF HARMONIC MEAN

1.For all the observations, say c, then the harmonic means calculated of the observations will
also be c.

2.The harmonic mean can also be evaluated for the series having any negative values.

3.If any of the values of a given series is 0 then its harmonic mean cannot be determined as the
reciprocal of 0 doesn’t exist.

4.If in a given series all the values are neither equal nor any value is zero

5.The harmonic mean calculated will be lesser than the geometric mean and arithmetic mean

ADVANTAGES OF HARMONIC MEAN

1.Harmonic mean is rigidly defined


2. it is based on all the observations.

3. It is suitable for further mathematical treatment.

4. It is not affected by fluctuation of sampling.


5. It gives greater importance to small items and is useful only when small items have to be
given greater weightage .

DISADVANTAGES OF HARMONIC MEAN

1.The harmonic mean is greatly affected by the values of the extreme items

2.It cannot be able to calculate if any of the items is zero

3.The calculation of the harmonic mean is cumbersome, as it involves the calculation using the
reciprocals of the number.

RELATIPNSHIP BETWEEN ARITHMETIC MEAN (AM), GEOMETRIC MEAN (GM),


HARMONIC MEAN (HM)

The relationship between arithmetic mean , harmonic mean ,geometric mean is

AM * HM = GM2

Also, AM > GM > HM

MEDIAN
Median of a distribution is the value of a variable which divides it into two equal parts. It is a
value which exceeds and is exceeded by the same number of observation. That is, it is the value
such that the number of observations above it is equal to the number of observation below it. The
median is thus a positional average.
In case of ungrouped data (individual series) , if the number of observations is odd then median
is the middle value after the values have been arranged in ascending or descending order of the
magnitude. In case of even number of observations , there are two middle terms and medium is
obtained by taking arithmetic mean of the middle terms.

In case of discrete frequency distribution median is obtained by considering the cumulative


frequency , the steps of calculating medium are :
1 ∑𝑓𝑖
1. Find 2 𝑁 ; where N = 𝑖=1 .

2. See the (less than ) cumulative frequency (c.f.) just greater than ½ N.

3. The corresponding value of x is median .

In the case of continuous Frequency distribution , the class corresponding to the c.f. just greater
than ½ N is called the median class and the value of the median is obtained by following
formula :
𝒉 𝑵
Median = 𝒍 + 𝒇 ( 𝟐 − 𝒄)

where l is the lower limit of the median class

f is the frequency of the median class

h is the magnitude of the median class

c is the c.f. of the class preceding the median class

and N=∑f

PROPERTIES OF THE MEDIAN

1.Median is the only average to be used while dealing with qualitative data which cannot be
measured quantitatively but still can be arranged in ascending or descending order of magnitude .

e.g. to find the average intelligence or average honesty among a group of people.

2. it is to be used for determining the typical value problems concerning wages ,distribution of
wealth.

ADVANTAGES OF MEDIAN

1. It is rigidly defined

2. The median is not affected by very large or very small values, also known as outliers.
3. The median is easy to calculate and understand, can be located just by inspection.
4. The median can be used for ratio, interval, and ordinal scales.
5. The median can be used to compute frequency distribution with open-ended classes.

DISADVANTAGES OF MEDIAN
1.I n case of even number of observations cannot be determined exactly . We just estimate it by
taking the mean of two middle terms .
2. It is not based on all the observations .

3. it is not amenable to algebraic treatment .

4. As compared with mean , it is affected by fluctuation of sampling .

MODE
Mode is the value which occurs most frequently in a set of observations and around which the
other items of the set cluster densely . In other words, mode is the value of the variable which is
predominant in the series.

Thus , in case of discrete frequency distribution , mode is the value of x corresponding to the
maximum frequency.

But in any one or more of the following cases :

1. if the maximum number of frequency is repeated,

2. if the maximum frequency occurs in very beginning or at the end of the distribution, and
3. if there are irregularities in the distribution ,

the value of mode is determined by the method of grouping.

In case of continuous frequency distribution, mode is given by the formula in case of continuous
frequency distribution mode is given by the formula
𝑓 –𝑓
Mode = l + ℎ (𝑓 −𝑓 1)−(𝑓0 − 𝑓 )
1 0 1 2

𝒉(𝒇𝟏 −𝒇𝟎 )
Mode = l + 𝟐𝒇
𝟏 −𝒇𝟎 −𝒇𝟐

Where l is the lower limit.

h is the magnitude .

f1 is the frequency of the modal class .

f0 is the frequency of the class preceding the modal class.

f2 is the frequency of the class succeeding the modal class.

The mode is the observation that occurs most frequently in a data set. Here are some terms
related to the mode of a data set:

Unimodal: A data set with only one value that occurs most often

Bimodal: A data set with two values that occur with the greatest frequency

Multimodal: A data set with more than two values that occur with the same greatest frequency

ADVANTAGES OF MODE
1. Mode is readily comprehensible and easy to calculate . Like median ,mode can be located in
some cases merely by inspection .

2. Mode is not affected by extreme value .

3. Easy to locate even when class intervals are of unequal magnitude provided modal class ,class
preceding and succeeding the modal class are of same magnitude , open ended classes do not
pose any problem .

DISADVANTAGES OF MODE

1. Mode is ill defined . It is not always possible to find a clearly defined mode . We may come
across two modes in same distributions.

2.It is not based upon all the observations .

3. It is not capable of further mathematical treatment

4. As compared to mean, mode is affected by fluctuation of sampling.

RELASHIONSHIP BETWEEN ARITHMETIC MEAN , MEDIAN , MODE

Mode is estimated from mean and medium. For a symmetrical distribution mean medium more
coincide. If the distribution is moderately asymmetrical , the mean, median and mode obey the
following empirical relationship :
Mode = 3Median – 2Mode

SELECTION OF AN AVERAGE
There is no single average which is suitable for all the purposes. Each one of the average has its
own merits and demerits and thus its own particular field of importance and utility. We cannot
use the averages indiscriminately a judicial selection of average depending on the nature of the
data and the purpose of inquiry is essential for the sound statistical analysis since arithmetic
mean satisfies all the properties of an ideal average as late down by Professor Yule ; is familiar to
Layman and further has a wide application in statistical theory at large ; It may regard as the best
of all averages.

PARTITION VALUES
These are the values which divide the series into a number of equal parts. They help in
understanding the distribution and spread of data by indicating where certain percentages of
the data fall. The most commonly used partition values are quartiles, deciles , and percentiles.

QUARTILES

There are several ways to divide an observation when required. To divide the observation into
two equally sized parts, the median can be used. A quartile is a set of values that divides a
dataset into four equal parts. The first quartile, second quartile, and third quartile are the three
basic quartile categories. The lower quartile is another name for the first quartile and is
denoted by the letter Q1. The median is another term for the second quartile and is denoted by
the letter Q2. The third quartile is often referred to as the upper quartile and is denoted by the
letter Q3

QUARTILES(individual n is odd ( where n is no. of n is even


series) observation )
Lower Quartile (Q1) 𝑛+1 𝑛
𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
4 4
Upper Quartile ( Q3) 3(𝑛 + 1) 3𝑛
𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
𝑡ℎ 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 4
4

QUARTILES FOR GROUPED DATA


𝑁
− 𝑓𝑐
Lower Quartiles (Q1) = l + 4
*w
𝑓𝑄

3𝑁
− 𝑓𝑐
Upper Quartile (Q3) = l + 4
*w
𝑓𝑄

Where l is the lower class boundary of the quartile class


fc is the cumulative frequency of the class before the quartile class.

fQ is the frequency of the quartile class.

w is the width of the class.

N = ∑ 𝑓𝑖

Inter quartile range = Q3 – Q1

DECILES
Decile is a type of quantile that divides the dataset into 10 equal subsections with the help of 9
data points. Each section of the sorted data represents 1/10 of the original sample or population.
Decile helps to order large amounts of data in the increasing or decreasing order. This ordering is
done by using a scale from 1 to 10 where each successive value represents an increase by 10
percentage points. To split the given data and order it according to some specified metric,
statisticians use the decile rank also known as decile class rank. Once the given data is divided
into deciles then each subsequent data set is assigned a decile rank. Each rank is based on an
increase by 10 percentage points and is used to order the deciles in the increasing order. The 5th
decile of a distribution will give the value of the median.

FORMULA FOR CALCULATING DECILES

For ungrouped data,


𝑥(𝑛+1)
Dx = ; where x = 1,2,….,9 ; n is the total number of observations.
10

For grouped data,


𝑁𝑥
−𝐶
10
Dx = l + ∗ 𝑤
𝑓

Where l is the lower limit of the decile class.

C is the cumulative frequency of the class before decile class

f is frequency of the decile class

x = 1,2,….,9

w is the width of the class

N = ∑𝑓𝑖

PERCENTILES
Percentiles divides the series into 100 equal parts. Percentiles are a type of quantiles , obtained
adopting a subdivision into 100 groups. The 25th percentile is also known as the first quartile ,
the 50th percentile as the median or second quartile , and the 75th percentile as the third quartile.

Formula for ungrouped data ,


𝑥(𝑛+1)
Px = 100

For grouped data,


𝑁𝑥
−𝐶
PX = l + 100𝑓 *w

Where l is the lower limit of the percentile class.

C is the cumulative frequency of the class before the percentile class.

f is frequency of the frequency class.

w is the width of class.

N = ∑ 𝑓𝑖

x = 1,2,……,99

GRAPHICAL LOCATION OF THE PARTITION VALUES


The partition values quartiles, deciles and percentiles can be conveniently located with the help
of a curve called cumulative frequency curve or ogive . The procedure is illustrated below :

First , form the less than cumulative frequency table . Take the class intervals along the x-axis
and plot the corresponding cumulative frequencies along the y – axis against the upper limit of
the class interval . The curve obtained on joining the point so obtained by means of free hand
drawing is called less than cumulative frequency or words less than ogive. Similarly, by plotting
the more than cumulative frequency against the lower limit of the corresponding class and
joining the point so obtained by smooth free hand curve , we obtain more than ogive , then we
interpret the values.
DISPERSION
Averages (or the measure of central tendency) give us an idea of the concentration of the
observation about the central part of the distribution. If we know the average alone , we cannot
form a complete idea about the distribution. Thus as measure of central tendency does not give
us a complete idea about the distribution. Thus, they must be supported and supplemented by
some other measures, one such measure is dispersion . Literal meaning of dispersion is
scatteredness . We study dispersion to have an idea about the homogeneity or heterogeneity of a
distribution.

According to L. R. Connor , Discussion is the measure of extent to which individual items vary.

According to Simpson and Kafka , The measure of scatteredness of the mass of figures in a
series about an average is called the measure of variation or dispersion.

According to Spiegel, The degree to which the numerical value tend to spread about an average
value is called the variation or dispersion of the data.

Characteristics for an ideal measure of dispersion

1.It should be rigidly defined.

2. It should be easy to calculate and easy to understand

3. It should be based on all the observations.


4. It should be amenable to further mathematical treatment.

5. It should be affected as little as possible by fluctuation of sampling .

MEASURE OF DISPERSION
Various measures of dispersion can be classified into two broad categories :

1.The measure which express the spread of observations in terms of distance between the values
of selected observations . These are also termed as distance measures e.g. Range, quartile
deviation.

2. The measures which express the spread of observations in terms of the average of deviations
of observations from some central value e.g. mean deviation and standard deviation.
MEASURE OF ABSOLUTE DISPERSION
Measure of dispersion is said to be absolute from when it states, the actual amount by which the
value of an item on an average deviates from a measure of Central tendency.

RANGE
It is the simplest measure of dispersion. The range is the difference between two extreme
observations of the distribution. If A and B are the greatest and the smallest observation
respectively in a distribution , then its range is given by :

RANGE = Xmax - Xmin = A – B

ADVANTAGES OF RANGE

1.It is easy to understand.

2. It is simple to compute.

DISADVANTAGES OF RANGE.

1. It is well defined .

2. It is based on two extreme values .


3. It is much affected by fluctuation of sampling as it based on two extreme values which
themselves are subject to fluctuations.
4. It is not capable of further mathematical treatment.

5. Not based on all the observations.


QUARTILE DEVIATION
Quartile deviation or semi interquartile range (Q) is given by:
𝑸𝟑 −𝑸𝟏
Q= 𝟐

Where Q1 and Q3 are first and third quartiles of the distribution respectively.

Quartile deviation is definitely a better measure then range as it makes use of 50% of the data .
But since it ignores the other 50% of the data , it cannot be regarded as a reliable measure.

ADVANTAGES OF QUARTILE DEVIATION.


1. It is easy to understand.

2. It is simple to compute.

3. Not based on only extreme values.

4. It is better measure of

4. It is better measure of dispersion than range as it takes 50% of the data not the two extreme
observations.

DISADVANTAGES OF QUARTILE DEVIATION

1.It is not well defined.

2. It is not based on all the observations but only on the 50% of the data leaving the other 50% of
the data .
3. it is not capable of further mathematical treatment.

It is affected by fluctuation of sampling.


MEAN DEVIATION
It is average of absolute deviations taken from a central value which is generally mean or
median. If xi | fi , i = 1,2,…n is the frequency distribution , then the mean deviation from the
average A(usually mean , median , mode) is given by :

Let A be the central value then ,

MEAN DEVIATION FROM AVERAGE A


For ungrouped data ;
𝟏 𝒏
M.D.= 𝒏 ∑|𝒙𝒊−𝑨|
𝒊

For grouped data ;


𝟏 𝒏
M.D. = ; where N = ∑ 𝑓𝑖
𝑵 ∑ 𝒇𝒊 | 𝒙𝒊−𝑨|
𝒊=𝟏

Where | xi – A| represents modulus or the absolute value of the deviation ( xi – A) , where the
negative sign is ignored .

Mean deviation is least when taken from median and zero when taken from mean.

ADVANTAGES OF MEAN DEVIATION

1.It is easy to understand and simple to calculate.

2. It is based on all the observations.


3. It shown the dispersion , or scatter of the various items of a series from its central value.

4. It is not very much affected by the values of the extreme items.

5. It facilitates comparison between different items of a series.

DISADVANTAGES OF MEAN DEVIATION

1.It is not well defined and not suitable for further mathematical treatment .

2. The step of ignoring the signs of the deviation creates artificiality and renders it useless for
further mathematical treatment.
VARIANCE
It is the average of sum of squares deviations taken from mean. The square root of standard
deviation is called the VARIANCE . It is also the variance of a random variable is the expected
value of the squared deviation of mean and is given by :

Let x1, x2,…..,xn be the n observations then :

For ungrouped data,


𝟏 ∑(𝒙𝒊−𝒙)𝟐
𝛔𝟐 = 𝒊
𝐧
Or we can also find variance of ungrouped data by :
𝟏 𝟐
[∑𝒙𝟐𝒊 −𝒏𝒙 ]
𝛔𝟐 = 𝒊
𝐧
For grouped data :
𝟏 ∑𝒇𝒊(𝒙𝒊 −𝒙)𝟐
𝛔𝟐 = 𝒊
𝐍
Variance can also be calculated by the formula :
𝟏 [∑𝒇𝒊𝒙𝟐𝒊 − 𝑵𝒙 ]
𝟐
𝛔𝟐 = 𝒊
𝐍
❖ VARIANCE is affected by change of scale not by origin.
❖ COMBINED VARIANCE
Given any two series consisting of n1 and n2 observations with mean x̄1 and x̄2 and
variances 𝜎12 𝑎𝑛𝑑 𝜎22 respectively . Then the pooled or combined variance of the two
series is given by :
𝒏𝟏 {𝝈𝟐𝟏 + (𝒙𝟏 − 𝒙)𝟐 } + 𝒏𝟐 {𝝈𝟐𝟐 + (𝒙𝟐 − 𝒙 )𝟐 }
𝝈𝟐 =
𝒏𝟏 + 𝒏𝟐
2
Where 𝜎 is the combined variance
x̄ is the combined mean
let d1 = 𝑥1 − 𝑥 and d2 = 𝑥2 − 𝑥 then the variance is given by :
𝟐
𝒏𝟏 {𝝈𝟐𝟏 + 𝒅𝟐𝟏 } + 𝒏𝟐 {𝝈𝟐𝟐 + 𝒅𝟐𝟐 }
𝝈 =
𝒏𝟏 + 𝒏𝟐

❖ Drawback of variance is that the units are squared in it.


STANDARD DEVIATION
Standard deviation usually denoted by Greek letter small Sigma (𝜎)is positive square root of the
arithmetic mean of the squares of deviations of a given values from their arithmetic mean, for the
frequency distribution xi | fi ; i = 1,2,…,n ,

𝟏 ∑𝐟𝐢 (𝐱 𝐢−𝐱)𝟐
𝛔 = √𝐍 𝐢 ; where x̄ is the arithmetic mean of the distribution

The standard deviation explained the average amount of variation on either side of the mean

The step of squaring the deviations (𝑥𝑖 − 𝑥) overcomes the drawback of ignoring the signs in
mean deviation . Standard deviation is also suitable for further mathematical treatment .
Moreover , of all the measures ,standard deviation is affected least by fluctuations of sampling .

Thus ,we see that standard deviation satisfies almost all the properties laid down for an ideal
measure of dispersion except for the general nature of extracting the square root which is not
readily comprehensible for a non mathematical person . Thus we may regard standard deviation
as the best and the most powerful measure of dispersion .

❖ The standard deviation explains the average amount of variation on either side of the
mean this way the drawbacks of variance are overcome .
❖ In a symmetrical or moderately skew distribution the following relationship exists
between mean deviation, quartile deviation , standard deviation :
6Q.D. =5M.D. = 4S.D.

ADVANTAGES OF STANDARD DEVIATION


1.Based on all the values of the variables.

2. Suitable for further mathematical treatment .

3. It is less affected by fluctuations of sampling .

4. It is expressed in the same unit as the unit of measurement.

DISADVANTAGES OF STANDARD DEVIATION

1.It is relatively difficult to calculate and understand.


2. It is affected very much by extreme values.

3. It cannot be used to compare the dispersion of two or more series of different units.

ROOT MEAN SQUARE DEVIATION


Root Mean Square Deviation , denoted by s , is given by:

𝟏
𝒔 = √ ∑𝒇𝒊 (𝒙𝒊 − 𝑨)𝟐
𝑵

Where A is any arbitrary number . s2 is called mean square deviation .


𝟏
𝒔𝟐 = ∑ 𝒇𝒊 (𝒙𝒊 − 𝑨)𝟐
𝑵

RELATIONSHIP BETWEEN 𝝈 𝒂𝒏𝒅 𝒔


The variance is the minimum value of mean square deviation or standard deviation is the
minimum value of root mean square deviation .
𝟏 ∑𝒇𝒊 (𝒙𝒊 −𝒙̄ )𝟐
(s2 ) = 𝑵𝒊 =𝝈

SHEPPARD'S CORRECTION
When the observation are grouped in classes of all the observation are taken to be equal to the
mid point of the class this introduced in some errors known as grouping errors . Sheppard
ℎ2
suggested a correction known as Sheppard's correction and is given by 12 where h is the width
of the class .

𝒉𝟐
Sheppard’s correction for variance = variance from grouped data - 𝟏𝟐

𝒉𝟐
= 𝝈𝟐 − 𝟏𝟐
MEASURE OF RELATIVE DISPERSION
In situations where either the series to be compared have different measurements and their mean
differ significant in size of the absolute measure of dispersion does not solve our problem. A
relative measure of depression is questioned obtained by dividing the absolute measure of the
quantity with respect to which the absolute deviation has been computed.
𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛
Relative dispersion = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑢𝑠𝑒𝑑

It is a pure number and usually expressed in percentage from they used to compare the dispersion
of the two or more series.

𝐴−𝐵
❖ Coefficient of range = 𝐴+𝐵
Where A and B are the greatest and the smallest items in the series.
𝑄 −𝑄
❖ Coefficient of quartile deviation = 𝑄3+𝑄1
3 1
Where Q1 and Q3 are the first and the third quartile .
𝑀𝑒𝑎𝑛 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
❖ Coefficient of mean deviation = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑟𝑜𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑡 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎
❖ Coefficients of standard deviation = =𝑥
𝑚𝑒𝑎𝑛
❖ Coefficients of variation = 100 times the coefficient of dispersion based upon standard of
deviation is called coefficient of variation i.e.,
𝝈
C.V.= 100 * 𝒙
According to Professor Karl Pearson , who suggested this measure, coefficient of
variation is the percentage variation in the mean, standard deviation being considered as a
total variation in the main.
For comparing the variability of two series, we calculate the coefficient of variation for
each series. The series having greater coefficient of variation is said to be more variable
than the other and the series having lesser coefficient of variation is said to be more
consistent (or homogeneous) than the other.
MOMENTS
Moments are popularly used to describe the characteristic of a distribution. They represent a
convenient and unifying method for summarizing many of the most commonly used statistical
measures such as measures of tendency, variation, skewness and kurtosis. Moments are statistical
measures that give certain characteristics of the distribution. Moments can be raw moments,
central moments and moments about any arbitrary point. In Statistics, moments are the
arithmetic means of first, second, third and so on, i.e. rth power of the deviation taken from
either mean or an arbitrary point of a distribution. In other words, moments are statistical
measures that give certain characteristics of the distribution. In statistics, some moments are very
important. Generally, in any frequency distribution, four moments are obtained which are known
as first, second, third and fourth moments. These four moments describe the information about
mean, variance, skewness and kurtosis of a frequency distribution. Calculation of moments gives
some features of a distribution which are of statistical importance.

Moments can be classified in raw and central moment. Raw moments are measured about any
arbitrary point A (say). If A is taken to be zero then raw moments are called moments about
origin. When A is taken to be Arithmetic mean we get central moments. The first raw moment
about origin is mean whereas the first central moment is zero. The second raw and central
moments are mean square deviation and variance, respectively. The third and fourth moments are
useful in measuring skewness and kurtosis.

METHODS OF CALCULATION OF MOMENTS


There are three methods of calculating moments are :

❖ Moment about arbitrary point


❖ Moment about mean
❖ Moment about origin
MOMENT ABOUT ARBITRARY POINT
When actual mean is in fraction, moments are first calculated about an arbitrary point and then
converted to moments about the actual mean.

When deviations are taken about arbitrary point moments are given by :

For ungrouped data ;


𝑛
0
∑(𝑥𝑖 −𝐴)

Zero order moment : 𝜇0′ = 𝑖=1


=1
𝑛
𝑛
1
∑(𝑥𝑖 −𝐴)

First order moment : 𝜇1′ = 𝑖=1


𝑛
𝑛
2
∑(𝑥𝑖 −𝐴)

Second order moment : 𝜇2′ = 𝑖=1


𝑛
𝑛
3
∑(𝑥𝑖 −𝐴)

Third order moment : 𝜇3′ = 𝑖=1


𝑛
𝑛
4
∑(𝑥𝑖−𝐴)

Fourth order moment : 𝜇4′ = 𝑖=1


𝑛

In general , the rth moment about an arbitrary point is


𝒏
𝒓
∑(𝒙𝒊−𝑨)
𝝁′𝒓 = 𝒊=𝟏
where r = 1,2,…
𝒏

Similarly for grouped data

If x1 , x2 , …., xn are the n values of X having frequencies f1,f2,….,fn respectively then moment
about a arbitrary point ;
𝑁
0
∑𝑓𝑖 (𝑥𝑖−𝐴)

Zero order moment : 𝜇0′ = 𝑖=1


=1
𝑁

𝑁
1
∑𝑓𝑖 (𝑥𝑖−𝐴)

First order moment : 𝜇1′ = 𝑖=1


𝑁
𝑁
2
∑𝑓𝑖 (𝑥𝑖 −𝐴)

Second order moment : 𝜇2′ = 𝑖=1


𝑁

𝑁
3
∑𝑓𝑖 (𝑥𝑖−𝐴)

Third order moment : 𝜇3′ = 𝑖=1


𝑁
𝑁
4
∑𝑓𝑖 (𝑥𝑖 −𝐴)

Fourth order moment : 𝜇4′ = 𝑖=1


𝑁

In general , the rth moment about an arbitrary point is :


𝑵
𝒓
∑𝒇𝒊(𝒙𝒊 −𝑨)
𝝁′𝒓 = 𝒊=𝟏
where r = 1,2, ….
𝑵

We know that if d1 = xi – A , then in general :

For ungrouped data the rth moment about an arbitrary point is :


𝑛
∑𝑓𝑖 𝑑𝑟
𝑖
𝜇𝑟′ = 𝑖= 1
; where r = 1,2,…
𝑛

For grouped data , the rth moment about an arbitrary point is :


𝑁
∑𝑓𝑖 𝑑𝑟
𝑖
𝜇𝑟′ = 𝑖=1
; where r = 1,2,…
𝑁
MOMENTS ABOUT ORIGIN
In case, when we take an arbitrary point A = 0 then, we get the moments about origin :

For ungrouped data , The rth order moment about origin is defined as :
𝐧
𝐫
∑(𝐱 𝐢−𝟎)
mr = 𝐢=𝟏
; where r = 1,2,…
𝐧
𝒏
∑𝒙𝒓
𝒊
= 𝒊=𝟏
𝒏
𝑛
∑𝑥𝑖

First moment : m1 = 𝑖=1


=𝑥
𝑛
𝑛
∑𝑥2
𝑖
Second moment : m2 = 𝑖=1
𝑛
𝑛
∑𝑥3
𝑖
Third moment : m3 = 𝑖=1
𝑛
𝑛
∑𝑥4
𝑖
Fourth moment : m4 = 𝑖=1
𝑛

Similarly for grouped data ,

If x1 , x2 , …., xn are the n values of X having frequencies f1,f2,….,fn respectively then moment
about origin ;
𝑛
∑𝑓𝑖 𝑥𝑖

First order moment : m1 = 𝑖=1


𝑁
𝑛
∑𝑓𝑖 𝑥𝑖2

Second order moment : m2 = 𝑖=1


𝑁
𝑛
∑𝑓𝑖 𝑥𝑖3

Third order moment : m3 = 𝑖=1


𝑁
𝑛
∑𝑓𝑖 𝑥𝑖 4

Fourth order moment : m4 = 𝑖=1


𝑁

In general , the rth moment about origin is :


𝒏
∑𝒇𝒊𝒙𝒊 𝒓

For grouped data : mr = 𝒊=𝟏


; where r = 1,2,…
𝑵

MOMEMT ABOUT MEAN


When we take the deviation from the actual mean and calculate the moments, these are known as
moments about mean or central moments. These are given by:

For ungrouped data ,


𝑛
0
∑(𝑥𝑖−𝑥)

Zero order moment : 𝜇0 = 𝑖=1


=1
𝑛
𝑛
1
∑(𝑥𝑖 −𝑥)

First order moment : 𝜇1 = 𝑖=1


=0
𝑛

The first order moment about mean is zero because the algebraic sum of deviation taken about
mean is zero .
𝑛
2
∑(𝑥𝑖 −𝑥)

Second order moment : 𝜇2 = 𝑖=1


= 𝜎2
𝑛

Thus second order moment about mean is zero .


𝑛
3
∑(𝑥𝑖−𝑥)

Third order moment : 𝜇3 = 𝑖=1


𝑛
𝑛
4
∑(𝑥𝑖 −𝑥)

Fourth order moment : 𝜇4 = 𝑖=1


𝑛

In general, rth moment about mean is:


𝒏
𝒓
∑(𝒙𝒊 −𝒙)
𝝁𝒓 = 𝒊=𝟏
; where r = 1,2,..
𝒏

Similarly for grouped data ,

If x1 , x2 , …., xn are the n values of X having frequencies f1,f2,….,fn respectively then moment
about mean or central moment

In general , the rth moment about mean is :


𝒏
𝒓
∑𝒇𝒊(𝒙𝒊 −𝒙)
𝝁𝒓 = 𝒊=𝟏
; where r = 1,2,….
𝑵

RELATION BETWEEN MOMENT ABOUT MEAN AND MOMENT


ABOUT ARBITRARY POINT
The relationship between them is given as ;

𝜇2 = 𝜇2′ − 𝜇1′2

𝜇3 = 𝜇3′ − 3𝜇2′ 𝜇1′ + 2𝜇1′3

𝜇4 = 𝜇4′ − 4𝜇3′ 𝜇1′ + 6𝜇2′ 𝜇1′2 − 3𝜇1′4

EFFECT OF CHANGE OF ORIGIN AND SCALE ON MOMENTS


𝑥−𝐴
Let 𝑢 = so that x = A + hu , 𝑥 = 𝐴 + ℎ 𝑢 and 𝑥 − 𝑥 = h(u – 𝑢).

Thus , the rth moment of x about any point x=A is given by :


𝑁
𝑟
∑𝑓𝑖 (𝑥𝑖−𝐴)
1 ∑𝑓𝑖 (ℎ𝑢𝑖 )𝑟 ∑𝑓𝑖 𝑢𝑖𝑟
𝜇𝑟′ = 𝑖=1
=𝑁 𝑖 = ℎ𝑟 ∗ 𝑖
𝑁

And the rth moment of x about mean is


𝑁
𝑟
∑𝑓𝑖 (𝑥𝑖−𝑥)
1 ∑𝑓𝑖 {ℎ(𝑢𝑖 −𝑢)}𝑟
𝜇𝑟 = 𝑖=1
=𝑁 𝑖
𝑁

1 ∑𝑓𝑖 (𝑢𝑖 −𝑢)𝑟 =ℎ𝑟 𝜇𝑟 (𝑢)


𝜇𝑟 (𝑥) = ℎ𝑟 𝑁 𝑖
Thus the rth moment of the variable x about mean is hr times the rth moment of the variable u
about mean.

ABSOLUTE MOMENT
For the frequency distribution xi |fi ; i= 1,2,…n, The rth absolute moment of the variable x about
the origin is given by:
1 𝑛
𝑟 ; N = ∑ fi
𝑁 ∑𝑓𝑖 |𝑥𝑖 |
𝑖=1

where |𝑥𝑖𝑟 | represents the absolute or modulus value of 𝑥𝑖𝑟 .

The rth absolute moment of the variable about the mean 𝑥 is given by:
1 𝑛
𝑟
𝑁 ∑𝑓𝑖 |𝑥𝑖 −𝑥|
𝑖=1

0SHEPPARD’S CORRECTION OF MOMENTS

In case of grouped frequency distribution , while calculating moments we assume that the
frequencies are concentrated at the middle of point of the in class intervals. The fundamental
assumption that we make in farming class intervals is that the frequencies are uniformly
distributed about the mid points of the class intervals. All the moment calculations in case of
grouped frequency distributions rely on this assumption. The aggregate of the observations or
their powers in a class is approximated by multiplying the class mid point or its power by the
corresponding class frequency. If the distribution is symmetrical or slightly symmetrical and the
class intervals are greater than one twentieth of the range , this assumption is very nearly true.
But since this assumption is not in general true, some error called the grouping error creeps into
the calculation of the moments. W.F. Sheppard prove that if,

❖ If frequency distribution is continuous and


❖ The frequency tapers off to zero in both the directions , the effect due to grouping at the
mid point of the interval can be corrected by the following formula known as Sheppard’s
corrections :
ℎ2
𝜇2 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) = 𝜇2 (𝑢𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) −
12
𝜇3 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) = 𝜇3
1 7
𝜇4 (𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑) = 𝜇4 − 2 ℎ2 𝜇2 + 240 ℎ4

Where h is the width of the class.


PEARSON’S 𝜷 𝒂𝒏𝒅 𝜸 COEFFICIENTS
Karl Pearson defined the following four coefficients, based upon the first four central moments:

1. 𝛽1 is defined as :
𝜇32
𝛽1 =
𝜇23

It is used as measure of skewness. For a symmetrical distribution, 𝛽1 shall be zero.

𝛽1 as a measure of skewness does not tell about the direction of skewness, i.e. positive or
negative. Because 𝜇3 being the sum of cubes of the deviations from mean may be positive or
negative but 𝜇32 is always positive. Also 𝜇2 being the variance always positive. Hence 𝛽1 would
be always positive. This drawback is removed if we calculate Karl Pearson’s coefficient of
skewness 𝛾1 which is the square root of 𝛽1 ,i. e.

𝛾1 = +√𝛽1

Then the sign of skewness would depend upon the value of 𝜇3 whether it is positive or negative.
It is advisable to use 𝛾1 as measure of skewness.

2. 𝛽2 measures kurtosis and it is defined by :


𝜇4
𝛽2 =
𝜇22

And similarly , the coefficient of 𝛾2 is defined as :


𝛾2 = 𝛽2 − 3

It may be pointed out that these co - efficients are pure numbers independent of units of
measurement.

SKEWNESS AND KURTOSIS


They give the location and scale of the distribution. In addition to measures of central tendency
and dispersion, we also need to have an idea about the shape of the distribution. Measure of
skewness gives the direction and the magnitude of the lack of symmetry whereas the kurtosis
gives the idea of flatness.

CONCEPT OF SKEWNESS

Skewness means lack of symmetry. In mathematics, a figure is called symmetric if there exists a
point in it through which if a perpendicular is drawn on the X-axis, it divides the figure into two
congruent parts i.e. identical in all respect or one part can be superimposed on the other i.e.
mirror images of each other.
❖ SYMMETRY CURVE

In Statistics, a distribution is called symmetric if mean, median and mode coincide. Otherwise,
the distribution becomes asymmetric.
❖ NEGETIVELY SKEWED CURVE

If the left tail is longer, we get a negatively skewed distribution for which mean < median <
mode.

❖ POSITIVELY SKEWED CURVE

If the right tail is longer ,we get positively skewed distribution for which mean > median >
mode.
DIFFERENCE BETWEEN VARIANCE AND SKEWNESS
The following two points of difference between variance and skewness should be carefully
noted.
1. Variance tells us about the amount of variability while skewness gives the direction of
variability.

2. In business and economic series, measures of variation have greater practical application than
measures of skewness. However, in medical and life science field measures of skewness have
greater practical applications than the variance.

MEASURES OF SKEWNESS
Measures of skewness help us to know to what degree and in which direction (positive or
negative) the frequency distribution has a departure from symmetry. Although positive or
negative skewness can be detected graphically depending on whether the right tail or the left tail
is longer but, we don’t get idea of the magnitude. Besides, borderline cases between symmetry
and asymmetry may be difficult to detect graphically. Hence some statistical measures are
required to find the magnitude of lack of symmetry.

There are two types of measure of skewness :

1. Absolute measure of skewness


2. Relative measure of skewness

ABSOLUTE MEASURE OF SKEWNESS


Various measure of skewness are :
1. Sk = M - Md
2. Sk = M – M0

Where M is the mean , Md is the median and M0 is the mode of the distribution .

3. Sk = (Q3 – Md) – (Md – Q1)

These are the absolute measure of skewness . As in dispersion , for comparing two series we do
not calculate these absolute measures but we calculate the relative measure called as coefficients
of skewness which are pure numbers independent of units of measurement.
RELATIVE MEASURE OF SKEWNESS

KARL PEARSON'S MEASURE OF SKEWNESS

This method is most frequently used for measuring skewness. The formula for measuring
coefficient of skewness is given by :
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
𝑆𝑘 = 𝜎

The value of this coefficient would be zero in a symmetrical distribution. If mean is greater than
mode, coefficient of skewness would be positive otherwise negative. The value of the Karl
Pearson’s coefficient of skewness usually lies between for moderately skewed distubution. If
mode is not well defined, we use the formula :
3(𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘 = 𝜎

By using the relationship

Mode = (3 Median – 2 Mean)

Here, −3 ≤ 𝑆𝑘 ≤ 3 .

BOWLEY’ S COEFFICIENT OF SKEWNESS

This method is based on quartiles. The formula for calculating coefficient of skewness is given
by :
(𝑄3 − 𝑄2 ) − (𝑄2 − 𝑄1 )
𝑆𝑘 =
(𝑄3 − 𝑄2 ) + (𝑄2 − 𝑄1 )
𝑄3 − 𝑄1 + 2𝑄2
𝑆𝑘 =
𝑄3 − 𝑄1

The value of Sk would be zero if it is a symmetrical distribution. If the value is greater than zero,
it is positively skewed and if the value is less than zero it is negatively skewed distribution. It
will take value between +1 and -1.

KELLY'S COEFFICIENT OF SKEWNESS


𝑃90 −𝑃10 +𝑃50
Based on percentiles : 𝑆𝑘 = 𝑃90 −𝑃10
𝐷9 −𝐷1 +2𝐷5
Based on deciles : 𝑆𝑘 = 𝐷9 −𝐷1

BASED ON MOMENTS COEFFICIENTS OF SKEWNESS

Karl Pearson defined the following 𝛽 𝑎𝑛𝑑 𝛾 coefficients of skewness, based upon the second and
third central moments
𝜇2
𝛽1 = 𝜇33
2

And 𝛾1 = +√𝛽1

NOTE :

❖ If the value of mean, median and mode are same in any distribution, then the
skewness does not exist in that distribution. Larger the difference in these values,
larger the skewness;
❖ If sum of the frequencies are equal on the both sides of mode then skewness does
not exist;
❖ If the distance of first quartile and third quartile are same from the median then a
skewness does not exist. Similarly if deciles (first and ninth) and percentiles (first
and ninety nine) are at equal distance from the median. Then there is no
asymmetry;
❖ If the sums of positive and negative deviations obtained from mean, median or
mode are equal then there is no asymmetry
❖ If a graph of a data become a normal curve and when it is folded at middle and
one part overlap fully on the other one then there is no asymmetry.
KURTOSIS
If we have the knowledge of the measures of central tendency, dispersion and skewness, even
then we cannot get a complete idea of a distribution. Inaddition to these measures, we need to
know another measure to get the complete idea about the shape of the distribution which can be
studied with the help of Kurtosis. Prof. Karl Pearson has called it the “Convexity of a Curve”.
Kurtosis gives a measure of flatness of distribution.

The degree of kurtosis of a distribution is measured relative to that of a normal curve. The curves
with greater peakedness than the normal curve are called “Leptokurtic”. The curves which are
more flat than the normal curve are called “Platykurtic”. The normal curve is called
“Mesokurtic.”

❖ KURTOSIS CURVE

It is measured by coefficients 𝛽2 or its derivation 𝛾2 given by :


𝜇4
𝛽2 = and 𝛾2 = 𝛽2 − 3
𝜇22
If 𝛽2 = 3 or 𝛾2 = 0, then curve is said to be mesokurtic;

If 𝛽2 < 3 or 𝛾2 < 0, then curve is said to be platykurtic;

If 𝛽2 > 3 or 𝛾2 > 0, then curve is said to be leptokurtic;

You might also like