STATISTICS DESCRIPTIVE PROJECT
STATISTICS DESCRIPTIVE PROJECT
STATISTICS DESCRIPTIVE PROJECT
A PROJECT
Submitted in partial fulfillment for the degree of
BACHELOR OF SCIENCE
IN
STATISTICS
SUBMITTED BY
VIDUSHI RASTOGI
ROLL NO : 2210404010657
UNDER THE GUIDANCE OF :
Ms. Saumya Tiwari
DEPARTMENT OF STATISTICS
SHRI JAI NARAYAN DEGREE COLLEGE
ACKNOWLEDGEMENT
I would like to express my deepest gratitude to all those who have supported me throughout the
project on topic “DISCRIPTIVE STATTISTICS”
First and foremost, I would like to thank my supervisor Ms. Saumya Tiwari , for their
invaluable guidance, expertise, and encouragement. Their insightful advice and constant support
helped me stay focused and motivated.
I would also like to extend my gratitude towards all the faculty members of the DEPARTMENT
OF STATISTICS , SHRI JAI NARAYAN DEGREE COLLEGE .
I would lastly asseverate my gratefulness towards my family for their support and guidance.
VIDUSHI RASTOGI
This is to certify that the term paper entitled “DISCRIPTIVE STATTISTICS” which is
submitted by BSc. Semester 5 student has been carried out as per the requirement given by
DEPARTMENT OF STATISTICS, SHRI JAI NARAYAN DEGREE CCOLLEGE.
This work is a review of literature and has been done by the student themselves. To the best of
my knowledge they have fulfil the conditions of the submission of this term paper
Department of statistics
TABLE OF CONTENT
❖ Introduction
❖ Frequency distribution
1. Arithmetic Mean
2. Harmonic Mean
3. Geometric Mean
4. Median
5. Mode
❖ Partition value
1. Quartiles
2. Deciles
3. Percentiles
❖ Dispersion
❖ Moments
❖ Skewness
❖ Kurtosis
Introduction
Descriptive statistics are brief descriptive coefficients that summarize a given data set,
which can be either a representation of the entire population or a sample of a
population. Descriptive statistics are broken down into measures of central tendency
and measures of dispersibility (spread). Measures of central tendency include the
mean, median and mode, quartiles, kurtosis, and skewness. Measures of variability
describe the dispersion of the data set (variance, standard deviation). Measures of
frequency distribution describe the occurrence of data within the data set (count).
People use descriptive statistics to repurpose hard-to-understand quantitative insights
across a large data set into bite–sized descriptions. In descriptive statistics, univariate
data analyzes only one variable. It is used. It is used to identify characteristics of a
single trait and is not used to analyze any relationships or causations. Bivariate data, on
the other hand, attempts to link two variables by searching for correlation.
FREQUENCY
The frequency of a value is the number of times it occurred in a data set. A frequency distribution
is the pattern of frequencies of a variable. It's the number of times each possible value of a
variable occurs in a data set.
4. Cumulative frequency distribution : The sum of the frequencies less than or equal to each
value or class interval of a variable.
5. Open End Frequency Distribution : Open end frequency distribution is one which has at
least one of its ends open. Either the lower limit of the first class or upper limit of the last class or
both are not specified.
ARRANGEMENT OF DATA
There are different ways of arranging the raw data. The data after collection is arranged in
mainly two possible ways :
1. Simple array : The simple array is one of the simplest ways to present data. It is an
arrangement of given raw data in ascending or descending order. In ascending order the
scores are arranged in increasing order of their magnitude. . Simple array has several
advantages as well as disadvantages over raw data. Using Simple array, we can easily
point out the lowest and highest values in the data and the entire data can be easily
divided into different sections. Repetition of the values can be easily checked, and
distance between succeeding values in the data can be observed on the first look. But
sometimes a data array is not very helpful because it lists every observation in the array.
It is cumbersome for displaying large quantities of data.
2. Grouped frequency distribution : the data arranged in the form of grouping can be
either discrete distribution and continuous distribution . These distributions are describe
below :
Discrete frequency distribution : In this different observations are not written as in simple
array. We count the number of times any observation appears which is known as frequency. The
literary meaning of frequency is the number or occurrence of a particular event score in a set of
sample. A frequency distribution is a table that organizes data into classes, i.e., into groups of
values describing one characteristic of the data.
For example : frequency distribution of number of persons and their wages per month
1. The classes should be clearly defined and should not lead to any ambiguity.
2. The classes should be exhaustive .
3. The classes should be mutually exclusive and non overlapping .
4. The classes should be of equal width.
5. Indeterminate classes should be avoided as far as possible .
6. The number of classes should neither be too large nor too small . The following formula
due to Struges may be used to determine an approximate number of k classes :
K = 1 + 3.322log10 N , Where N is the total frequency
𝑙𝑎𝑟𝑔𝑒𝑠𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑠𝑡𝑖𝑜𝑛− 𝑠𝑚𝑎𝑙𝑙𝑒𝑠𝑡 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛
The Magnitude of the class interval = 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙
Class limit
It should be choose in such a way that the mid value of the class interval and actual average of
the observation in the class interval are as near to each other as possible. If this is not the case
then the distribution gives a distorted picture of the characteristic of the data.
Exclusive classes : This method of class formation, the classes are so formed that the upper limit
of one class also becomes the lower limit of the next class. Exclusive method of classification
ensures continuity between two successive classes.
Representation of data classification in exclusive classes
Inclusive classes : In this method classification includes scores, which are equal to the upper
limit of the class. Inclusive method is preferred when measurements are given in the whole
numbers.
Representation of data classification in inclusive classes
True or actual classes : In inclusive method upper class limit is not equal to lower class limit of
the next class. Therefore, there is no continuity between The classes. However, in many
statistical measures continuous classes are required. Therefore we subtract 0.5 to upper limit each
class and add 0.5 lower limit each class.
For example
i) ‘Less than’ Ogive: In ‘less than’ ogive , the less than cumulative frequencies are
plotted against the upper class boundaries of the respective classes. It is an increasing
curve having slopes upwards from left to right.
ii) ‘More than’ Ogive: In more than ogive , the more than cumulative frequencies are
plotted against the lower class boundaries of the respective classes. It is decreasing
curve and slopes downwards from left to right.
For example
According to Professor YULE , the following are the characteristics to be satisfied by an ideal
measure of central tendency:
ARITHMETIC MEAN
Arithmetic mean of a set of observations is their sum divided by the number of
observations .Arithmetic mean is often referred to as the mean or arithmetic average. It is
calculated by adding all the numbers in a given data set and then dividing it by the total number
of items within that set. The arithmetic mean (AM) for evenly distributed numbers is equal to the
middlemost number. Further, the AM is calculated using numerous methods, which is based on
the amount of the data, and the distribution of the data.
This formula is used to calculate the mean of the individual series i.e. ungrouped data
This formula is known as direct method for calculating arithmetic mean . This is the most
simplest method of calculating mean.
The arithmetic mean can be reduced to great extent by taking the deviation of any given value at
any arbitrary point ‘A' by taking di = xi – A where di is the deviation at any arbitrary point A
𝐧
∑𝐟𝐢𝐝𝐢
Now, Mean (x̄) = A+ 𝐢=𝟏
; where i = 1, 2,….., n
𝐍
This formula is known as shortcut method for calculating arithmetic mean. This formula is much
more accurate for giving the arithmetic mean.
The arithmetic mean is reduced to a still greater extent by taking di = (xi – A) / h where A is any
arbitrary point and h is the width of the class.
𝐧
∑𝐟𝐢 𝐝𝐢
Now, Mean (x̄) = A + h * 𝐢=𝟏
; where i = 1,2,….,n
𝐍
This formula is known as the step deviation method for calculating arithmetic mean. This
formula is most accurate for giving the arithmetic mean.
2. The sum of square of deviation of a set of values is minimum when they take in about mean.
3. The mean of a constant is constant.
3. It is amenable to algebraic treatment . The mean of the composite series in terms of the means
and the sizes of the component series.
4. Arithmetic mean is affected very much by extreme values and cannot be calculated for open
ended classes.
GEOMETRIC MEAN
Geometric mean of a set of n observations is the nth root of their product. In statistics,
the Geometric Mean is the average value or mean which signifies the central tendency of the set
of numbers by finding the product of their values. Basically, we multiply the numbers altogether
and take the nth root of the multiplied numbers, where n is the total number of data values.
2. If each object in the data set is substituted by the G.M, then the product of the objects remains unchanged.
3. The ratio of the corresponding observations of the G.M in two series is equal to the ratio of their geometric
means
4. The products of the corresponding items of the G.M in two series are equal to the product of their geometric
mean.
1. The geometric mean is not easy to understand for people who are not mathematically inclined
it involves logarithmic operations.
2. The geometric mean is difficult to calculate because it involves finding the root of the products
of certain values
3. The geometric mean cannot be calculated if any value in a series is zero or if the number of
negative values is odd.
4. The geometric mean may not be the actual value of the series.
5. Only used in geometric progression series.
6. Time consuming.
HARMONIC MEAN
The Harmonic Mean (H) of a number of observations , none of which is zero , is the reciprocal
of the arithmetic mean of the reciprocals of the given values . Harmonic mean gives less
weightage to the large values and large weightage to the small values to balance the values
correctly. In general, the harmonic mean is used when there is a necessity to give greater weight
to the smaller items. It is applied in the case of times and average rates. It is the most appropriate
measure for ratios and rates because it equalizes the weights of each data point. For instance, the
arithmetic mean places a high weight on large data points, while the geometric mean gives a
lower weight to the smaller data points.
1.For all the observations, say c, then the harmonic means calculated of the observations will
also be c.
2.The harmonic mean can also be evaluated for the series having any negative values.
3.If any of the values of a given series is 0 then its harmonic mean cannot be determined as the
reciprocal of 0 doesn’t exist.
4.If in a given series all the values are neither equal nor any value is zero
5.The harmonic mean calculated will be lesser than the geometric mean and arithmetic mean
1.The harmonic mean is greatly affected by the values of the extreme items
3.The calculation of the harmonic mean is cumbersome, as it involves the calculation using the
reciprocals of the number.
AM * HM = GM2
MEDIAN
Median of a distribution is the value of a variable which divides it into two equal parts. It is a
value which exceeds and is exceeded by the same number of observation. That is, it is the value
such that the number of observations above it is equal to the number of observation below it. The
median is thus a positional average.
In case of ungrouped data (individual series) , if the number of observations is odd then median
is the middle value after the values have been arranged in ascending or descending order of the
magnitude. In case of even number of observations , there are two middle terms and medium is
obtained by taking arithmetic mean of the middle terms.
2. See the (less than ) cumulative frequency (c.f.) just greater than ½ N.
In the case of continuous Frequency distribution , the class corresponding to the c.f. just greater
than ½ N is called the median class and the value of the median is obtained by following
formula :
𝒉 𝑵
Median = 𝒍 + 𝒇 ( 𝟐 − 𝒄)
and N=∑f
1.Median is the only average to be used while dealing with qualitative data which cannot be
measured quantitatively but still can be arranged in ascending or descending order of magnitude .
e.g. to find the average intelligence or average honesty among a group of people.
2. it is to be used for determining the typical value problems concerning wages ,distribution of
wealth.
ADVANTAGES OF MEDIAN
1. It is rigidly defined
2. The median is not affected by very large or very small values, also known as outliers.
3. The median is easy to calculate and understand, can be located just by inspection.
4. The median can be used for ratio, interval, and ordinal scales.
5. The median can be used to compute frequency distribution with open-ended classes.
DISADVANTAGES OF MEDIAN
1.I n case of even number of observations cannot be determined exactly . We just estimate it by
taking the mean of two middle terms .
2. It is not based on all the observations .
MODE
Mode is the value which occurs most frequently in a set of observations and around which the
other items of the set cluster densely . In other words, mode is the value of the variable which is
predominant in the series.
Thus , in case of discrete frequency distribution , mode is the value of x corresponding to the
maximum frequency.
2. if the maximum frequency occurs in very beginning or at the end of the distribution, and
3. if there are irregularities in the distribution ,
In case of continuous frequency distribution, mode is given by the formula in case of continuous
frequency distribution mode is given by the formula
𝑓 –𝑓
Mode = l + ℎ (𝑓 −𝑓 1)−(𝑓0 − 𝑓 )
1 0 1 2
𝒉(𝒇𝟏 −𝒇𝟎 )
Mode = l + 𝟐𝒇
𝟏 −𝒇𝟎 −𝒇𝟐
h is the magnitude .
The mode is the observation that occurs most frequently in a data set. Here are some terms
related to the mode of a data set:
Unimodal: A data set with only one value that occurs most often
Bimodal: A data set with two values that occur with the greatest frequency
Multimodal: A data set with more than two values that occur with the same greatest frequency
ADVANTAGES OF MODE
1. Mode is readily comprehensible and easy to calculate . Like median ,mode can be located in
some cases merely by inspection .
3. Easy to locate even when class intervals are of unequal magnitude provided modal class ,class
preceding and succeeding the modal class are of same magnitude , open ended classes do not
pose any problem .
DISADVANTAGES OF MODE
1. Mode is ill defined . It is not always possible to find a clearly defined mode . We may come
across two modes in same distributions.
Mode is estimated from mean and medium. For a symmetrical distribution mean medium more
coincide. If the distribution is moderately asymmetrical , the mean, median and mode obey the
following empirical relationship :
Mode = 3Median – 2Mode
SELECTION OF AN AVERAGE
There is no single average which is suitable for all the purposes. Each one of the average has its
own merits and demerits and thus its own particular field of importance and utility. We cannot
use the averages indiscriminately a judicial selection of average depending on the nature of the
data and the purpose of inquiry is essential for the sound statistical analysis since arithmetic
mean satisfies all the properties of an ideal average as late down by Professor Yule ; is familiar to
Layman and further has a wide application in statistical theory at large ; It may regard as the best
of all averages.
PARTITION VALUES
These are the values which divide the series into a number of equal parts. They help in
understanding the distribution and spread of data by indicating where certain percentages of
the data fall. The most commonly used partition values are quartiles, deciles , and percentiles.
QUARTILES
There are several ways to divide an observation when required. To divide the observation into
two equally sized parts, the median can be used. A quartile is a set of values that divides a
dataset into four equal parts. The first quartile, second quartile, and third quartile are the three
basic quartile categories. The lower quartile is another name for the first quartile and is
denoted by the letter Q1. The median is another term for the second quartile and is denoted by
the letter Q2. The third quartile is often referred to as the upper quartile and is denoted by the
letter Q3
3𝑁
− 𝑓𝑐
Upper Quartile (Q3) = l + 4
*w
𝑓𝑄
N = ∑ 𝑓𝑖
DECILES
Decile is a type of quantile that divides the dataset into 10 equal subsections with the help of 9
data points. Each section of the sorted data represents 1/10 of the original sample or population.
Decile helps to order large amounts of data in the increasing or decreasing order. This ordering is
done by using a scale from 1 to 10 where each successive value represents an increase by 10
percentage points. To split the given data and order it according to some specified metric,
statisticians use the decile rank also known as decile class rank. Once the given data is divided
into deciles then each subsequent data set is assigned a decile rank. Each rank is based on an
increase by 10 percentage points and is used to order the deciles in the increasing order. The 5th
decile of a distribution will give the value of the median.
x = 1,2,….,9
N = ∑𝑓𝑖
PERCENTILES
Percentiles divides the series into 100 equal parts. Percentiles are a type of quantiles , obtained
adopting a subdivision into 100 groups. The 25th percentile is also known as the first quartile ,
the 50th percentile as the median or second quartile , and the 75th percentile as the third quartile.
N = ∑ 𝑓𝑖
x = 1,2,……,99
First , form the less than cumulative frequency table . Take the class intervals along the x-axis
and plot the corresponding cumulative frequencies along the y – axis against the upper limit of
the class interval . The curve obtained on joining the point so obtained by means of free hand
drawing is called less than cumulative frequency or words less than ogive. Similarly, by plotting
the more than cumulative frequency against the lower limit of the corresponding class and
joining the point so obtained by smooth free hand curve , we obtain more than ogive , then we
interpret the values.
DISPERSION
Averages (or the measure of central tendency) give us an idea of the concentration of the
observation about the central part of the distribution. If we know the average alone , we cannot
form a complete idea about the distribution. Thus as measure of central tendency does not give
us a complete idea about the distribution. Thus, they must be supported and supplemented by
some other measures, one such measure is dispersion . Literal meaning of dispersion is
scatteredness . We study dispersion to have an idea about the homogeneity or heterogeneity of a
distribution.
According to L. R. Connor , Discussion is the measure of extent to which individual items vary.
According to Simpson and Kafka , The measure of scatteredness of the mass of figures in a
series about an average is called the measure of variation or dispersion.
According to Spiegel, The degree to which the numerical value tend to spread about an average
value is called the variation or dispersion of the data.
MEASURE OF DISPERSION
Various measures of dispersion can be classified into two broad categories :
1.The measure which express the spread of observations in terms of distance between the values
of selected observations . These are also termed as distance measures e.g. Range, quartile
deviation.
2. The measures which express the spread of observations in terms of the average of deviations
of observations from some central value e.g. mean deviation and standard deviation.
MEASURE OF ABSOLUTE DISPERSION
Measure of dispersion is said to be absolute from when it states, the actual amount by which the
value of an item on an average deviates from a measure of Central tendency.
RANGE
It is the simplest measure of dispersion. The range is the difference between two extreme
observations of the distribution. If A and B are the greatest and the smallest observation
respectively in a distribution , then its range is given by :
ADVANTAGES OF RANGE
2. It is simple to compute.
DISADVANTAGES OF RANGE.
1. It is well defined .
Where Q1 and Q3 are first and third quartiles of the distribution respectively.
Quartile deviation is definitely a better measure then range as it makes use of 50% of the data .
But since it ignores the other 50% of the data , it cannot be regarded as a reliable measure.
2. It is simple to compute.
4. It is better measure of
4. It is better measure of dispersion than range as it takes 50% of the data not the two extreme
observations.
2. It is not based on all the observations but only on the 50% of the data leaving the other 50% of
the data .
3. it is not capable of further mathematical treatment.
Where | xi – A| represents modulus or the absolute value of the deviation ( xi – A) , where the
negative sign is ignored .
Mean deviation is least when taken from median and zero when taken from mean.
1.It is not well defined and not suitable for further mathematical treatment .
2. The step of ignoring the signs of the deviation creates artificiality and renders it useless for
further mathematical treatment.
VARIANCE
It is the average of sum of squares deviations taken from mean. The square root of standard
deviation is called the VARIANCE . It is also the variance of a random variable is the expected
value of the squared deviation of mean and is given by :
𝟏 ∑𝐟𝐢 (𝐱 𝐢−𝐱)𝟐
𝛔 = √𝐍 𝐢 ; where x̄ is the arithmetic mean of the distribution
The standard deviation explained the average amount of variation on either side of the mean
The step of squaring the deviations (𝑥𝑖 − 𝑥) overcomes the drawback of ignoring the signs in
mean deviation . Standard deviation is also suitable for further mathematical treatment .
Moreover , of all the measures ,standard deviation is affected least by fluctuations of sampling .
Thus ,we see that standard deviation satisfies almost all the properties laid down for an ideal
measure of dispersion except for the general nature of extracting the square root which is not
readily comprehensible for a non mathematical person . Thus we may regard standard deviation
as the best and the most powerful measure of dispersion .
❖ The standard deviation explains the average amount of variation on either side of the
mean this way the drawbacks of variance are overcome .
❖ In a symmetrical or moderately skew distribution the following relationship exists
between mean deviation, quartile deviation , standard deviation :
6Q.D. =5M.D. = 4S.D.
3. It cannot be used to compare the dispersion of two or more series of different units.
𝟏
𝒔 = √ ∑𝒇𝒊 (𝒙𝒊 − 𝑨)𝟐
𝑵
SHEPPARD'S CORRECTION
When the observation are grouped in classes of all the observation are taken to be equal to the
mid point of the class this introduced in some errors known as grouping errors . Sheppard
ℎ2
suggested a correction known as Sheppard's correction and is given by 12 where h is the width
of the class .
𝒉𝟐
Sheppard’s correction for variance = variance from grouped data - 𝟏𝟐
𝒉𝟐
= 𝝈𝟐 − 𝟏𝟐
MEASURE OF RELATIVE DISPERSION
In situations where either the series to be compared have different measurements and their mean
differ significant in size of the absolute measure of dispersion does not solve our problem. A
relative measure of depression is questioned obtained by dividing the absolute measure of the
quantity with respect to which the absolute deviation has been computed.
𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑖𝑠𝑝𝑒𝑟𝑠𝑖𝑜𝑛
Relative dispersion = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑢𝑠𝑒𝑑
It is a pure number and usually expressed in percentage from they used to compare the dispersion
of the two or more series.
𝐴−𝐵
❖ Coefficient of range = 𝐴+𝐵
Where A and B are the greatest and the smallest items in the series.
𝑄 −𝑄
❖ Coefficient of quartile deviation = 𝑄3+𝑄1
3 1
Where Q1 and Q3 are the first and the third quartile .
𝑀𝑒𝑎𝑛 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
❖ Coefficient of mean deviation = 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑓𝑟𝑜𝑚 𝑤ℎ𝑖𝑐ℎ 𝑖𝑡 𝑖𝑠 𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑑
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜎
❖ Coefficients of standard deviation = =𝑥
𝑚𝑒𝑎𝑛
❖ Coefficients of variation = 100 times the coefficient of dispersion based upon standard of
deviation is called coefficient of variation i.e.,
𝝈
C.V.= 100 * 𝒙
According to Professor Karl Pearson , who suggested this measure, coefficient of
variation is the percentage variation in the mean, standard deviation being considered as a
total variation in the main.
For comparing the variability of two series, we calculate the coefficient of variation for
each series. The series having greater coefficient of variation is said to be more variable
than the other and the series having lesser coefficient of variation is said to be more
consistent (or homogeneous) than the other.
MOMENTS
Moments are popularly used to describe the characteristic of a distribution. They represent a
convenient and unifying method for summarizing many of the most commonly used statistical
measures such as measures of tendency, variation, skewness and kurtosis. Moments are statistical
measures that give certain characteristics of the distribution. Moments can be raw moments,
central moments and moments about any arbitrary point. In Statistics, moments are the
arithmetic means of first, second, third and so on, i.e. rth power of the deviation taken from
either mean or an arbitrary point of a distribution. In other words, moments are statistical
measures that give certain characteristics of the distribution. In statistics, some moments are very
important. Generally, in any frequency distribution, four moments are obtained which are known
as first, second, third and fourth moments. These four moments describe the information about
mean, variance, skewness and kurtosis of a frequency distribution. Calculation of moments gives
some features of a distribution which are of statistical importance.
Moments can be classified in raw and central moment. Raw moments are measured about any
arbitrary point A (say). If A is taken to be zero then raw moments are called moments about
origin. When A is taken to be Arithmetic mean we get central moments. The first raw moment
about origin is mean whereas the first central moment is zero. The second raw and central
moments are mean square deviation and variance, respectively. The third and fourth moments are
useful in measuring skewness and kurtosis.
When deviations are taken about arbitrary point moments are given by :
If x1 , x2 , …., xn are the n values of X having frequencies f1,f2,….,fn respectively then moment
about a arbitrary point ;
𝑁
0
∑𝑓𝑖 (𝑥𝑖−𝐴)
𝑁
1
∑𝑓𝑖 (𝑥𝑖−𝐴)
𝑁
3
∑𝑓𝑖 (𝑥𝑖−𝐴)
For ungrouped data , The rth order moment about origin is defined as :
𝐧
𝐫
∑(𝐱 𝐢−𝟎)
mr = 𝐢=𝟏
; where r = 1,2,…
𝐧
𝒏
∑𝒙𝒓
𝒊
= 𝒊=𝟏
𝒏
𝑛
∑𝑥𝑖
If x1 , x2 , …., xn are the n values of X having frequencies f1,f2,….,fn respectively then moment
about origin ;
𝑛
∑𝑓𝑖 𝑥𝑖
The first order moment about mean is zero because the algebraic sum of deviation taken about
mean is zero .
𝑛
2
∑(𝑥𝑖 −𝑥)
If x1 , x2 , …., xn are the n values of X having frequencies f1,f2,….,fn respectively then moment
about mean or central moment
𝜇2 = 𝜇2′ − 𝜇1′2
ABSOLUTE MOMENT
For the frequency distribution xi |fi ; i= 1,2,…n, The rth absolute moment of the variable x about
the origin is given by:
1 𝑛
𝑟 ; N = ∑ fi
𝑁 ∑𝑓𝑖 |𝑥𝑖 |
𝑖=1
The rth absolute moment of the variable about the mean 𝑥 is given by:
1 𝑛
𝑟
𝑁 ∑𝑓𝑖 |𝑥𝑖 −𝑥|
𝑖=1
In case of grouped frequency distribution , while calculating moments we assume that the
frequencies are concentrated at the middle of point of the in class intervals. The fundamental
assumption that we make in farming class intervals is that the frequencies are uniformly
distributed about the mid points of the class intervals. All the moment calculations in case of
grouped frequency distributions rely on this assumption. The aggregate of the observations or
their powers in a class is approximated by multiplying the class mid point or its power by the
corresponding class frequency. If the distribution is symmetrical or slightly symmetrical and the
class intervals are greater than one twentieth of the range , this assumption is very nearly true.
But since this assumption is not in general true, some error called the grouping error creeps into
the calculation of the moments. W.F. Sheppard prove that if,
1. 𝛽1 is defined as :
𝜇32
𝛽1 =
𝜇23
𝛽1 as a measure of skewness does not tell about the direction of skewness, i.e. positive or
negative. Because 𝜇3 being the sum of cubes of the deviations from mean may be positive or
negative but 𝜇32 is always positive. Also 𝜇2 being the variance always positive. Hence 𝛽1 would
be always positive. This drawback is removed if we calculate Karl Pearson’s coefficient of
skewness 𝛾1 which is the square root of 𝛽1 ,i. e.
𝛾1 = +√𝛽1
Then the sign of skewness would depend upon the value of 𝜇3 whether it is positive or negative.
It is advisable to use 𝛾1 as measure of skewness.
It may be pointed out that these co - efficients are pure numbers independent of units of
measurement.
CONCEPT OF SKEWNESS
Skewness means lack of symmetry. In mathematics, a figure is called symmetric if there exists a
point in it through which if a perpendicular is drawn on the X-axis, it divides the figure into two
congruent parts i.e. identical in all respect or one part can be superimposed on the other i.e.
mirror images of each other.
❖ SYMMETRY CURVE
In Statistics, a distribution is called symmetric if mean, median and mode coincide. Otherwise,
the distribution becomes asymmetric.
❖ NEGETIVELY SKEWED CURVE
If the left tail is longer, we get a negatively skewed distribution for which mean < median <
mode.
If the right tail is longer ,we get positively skewed distribution for which mean > median >
mode.
DIFFERENCE BETWEEN VARIANCE AND SKEWNESS
The following two points of difference between variance and skewness should be carefully
noted.
1. Variance tells us about the amount of variability while skewness gives the direction of
variability.
2. In business and economic series, measures of variation have greater practical application than
measures of skewness. However, in medical and life science field measures of skewness have
greater practical applications than the variance.
MEASURES OF SKEWNESS
Measures of skewness help us to know to what degree and in which direction (positive or
negative) the frequency distribution has a departure from symmetry. Although positive or
negative skewness can be detected graphically depending on whether the right tail or the left tail
is longer but, we don’t get idea of the magnitude. Besides, borderline cases between symmetry
and asymmetry may be difficult to detect graphically. Hence some statistical measures are
required to find the magnitude of lack of symmetry.
Where M is the mean , Md is the median and M0 is the mode of the distribution .
These are the absolute measure of skewness . As in dispersion , for comparing two series we do
not calculate these absolute measures but we calculate the relative measure called as coefficients
of skewness which are pure numbers independent of units of measurement.
RELATIVE MEASURE OF SKEWNESS
This method is most frequently used for measuring skewness. The formula for measuring
coefficient of skewness is given by :
𝑀𝑒𝑎𝑛−𝑀𝑜𝑑𝑒
𝑆𝑘 = 𝜎
The value of this coefficient would be zero in a symmetrical distribution. If mean is greater than
mode, coefficient of skewness would be positive otherwise negative. The value of the Karl
Pearson’s coefficient of skewness usually lies between for moderately skewed distubution. If
mode is not well defined, we use the formula :
3(𝑀𝑒𝑎𝑛−𝑀𝑒𝑑𝑖𝑎𝑛)
𝑆𝑘 = 𝜎
Here, −3 ≤ 𝑆𝑘 ≤ 3 .
This method is based on quartiles. The formula for calculating coefficient of skewness is given
by :
(𝑄3 − 𝑄2 ) − (𝑄2 − 𝑄1 )
𝑆𝑘 =
(𝑄3 − 𝑄2 ) + (𝑄2 − 𝑄1 )
𝑄3 − 𝑄1 + 2𝑄2
𝑆𝑘 =
𝑄3 − 𝑄1
The value of Sk would be zero if it is a symmetrical distribution. If the value is greater than zero,
it is positively skewed and if the value is less than zero it is negatively skewed distribution. It
will take value between +1 and -1.
Karl Pearson defined the following 𝛽 𝑎𝑛𝑑 𝛾 coefficients of skewness, based upon the second and
third central moments
𝜇2
𝛽1 = 𝜇33
2
And 𝛾1 = +√𝛽1
NOTE :
❖ If the value of mean, median and mode are same in any distribution, then the
skewness does not exist in that distribution. Larger the difference in these values,
larger the skewness;
❖ If sum of the frequencies are equal on the both sides of mode then skewness does
not exist;
❖ If the distance of first quartile and third quartile are same from the median then a
skewness does not exist. Similarly if deciles (first and ninth) and percentiles (first
and ninety nine) are at equal distance from the median. Then there is no
asymmetry;
❖ If the sums of positive and negative deviations obtained from mean, median or
mode are equal then there is no asymmetry
❖ If a graph of a data become a normal curve and when it is folded at middle and
one part overlap fully on the other one then there is no asymmetry.
KURTOSIS
If we have the knowledge of the measures of central tendency, dispersion and skewness, even
then we cannot get a complete idea of a distribution. Inaddition to these measures, we need to
know another measure to get the complete idea about the shape of the distribution which can be
studied with the help of Kurtosis. Prof. Karl Pearson has called it the “Convexity of a Curve”.
Kurtosis gives a measure of flatness of distribution.
The degree of kurtosis of a distribution is measured relative to that of a normal curve. The curves
with greater peakedness than the normal curve are called “Leptokurtic”. The curves which are
more flat than the normal curve are called “Platykurtic”. The normal curve is called
“Mesokurtic.”
❖ KURTOSIS CURVE