0% found this document useful (0 votes)
25 views44 pages

Statistics Report, Group I

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 44

Probability And Statistics- M

Statistics Report

Group I, SEC- E 11/17/23 Submitted To: Ashek Ahmed


GROUP I

 Graph Representation and Measures of Central


Tendency
o Md. Raisul Islam – 011 222 249

 Measures of Dispersion
o Sadiya Sultana Mariya- 011 222 176

 Stem-Leaf Plot, Box-Whisker Plot and Moments


o Md. Rubayet Khan – 011 222 284
 Correlation and Regression Analysis
o Ishmam Ahmed – 011 223 0141

NB: Two of our team members were absent, we


were unable to reach them and one of them didn’t
respond. Hence, we couldn’t finish entirely and
submitting late as a result. Hope you would be
kind enough to consider this. Thank you.
STATISTICS

Measures of Central Tendency


Measures of central tendency are fundamental in statistical analysis, serving as
crucial tools for summarizing and understanding datasets. Among these, the
Arithmetic Mean, Geometric Mean, and Harmonic Mean each provide unique
insights into the central tendency of a set of values.

Ungrouped Data: The following are the ages given of 15 employees of a


company-
20, 25, 31, 39, 21, 35, 40, 69, 35, 42, 61, 51, 59, 54, 55

Grouped Data: The following is the data that contains ages 15 employees of a
company-
Age 20-29 30-39 40-49 50-59 60-69
Frequency 3 4 2 4 2

Mean:

1. Arithmetic Mean: The Arithmetic Mean, often referred to simply as the "mean," is a
familiar and widely used measure of central tendency. Calculated by summing all values in a
dataset and dividing by the total number of observations, the arithmetic mean represents the
balance point of the dataset. Its strength lies in its simplicity and ease of interpretation,
making it suitable for a broad range of applications.

Ungrouped:
∑x 20+25+31+39+21+35+ 40+69+35+ 42+61+51+59+54 +55
AM= n = 15
= 42.467

Grouped:
Class Interval Class Class Mark Frequency (f) fx
Boundary (x)
20-29 20.5-30.5 25.5 3 76.5
30-39 30.5-40.5 35.5 4 142
40-49 40.5-50.5 45.5 2 91
50-59 50.5-60.5 55.5 4 222
60-69 60.5-70.5 65.5 2 131
∑f=15 ∑fx=662.5

∑ fx 662.5
A.M= ∑ f = 15 =44.167

Key Takeaways for Arithmetic Mean: 1. Reflects the central point of the dataset.
2.Sensitive to extreme values. 3.Appropriate for symmetric distributions.

2. Geometric Mean: The Geometric Mean offers a different perspective by


considering the product of all values raised to the power of the reciprocal of
the total number of observations. It is particularly useful for datasets
involving rates of growth, such as financial returns or environmental factors.
The geometric mean is less influenced by extreme values, making it suitable
for skewed distributions.

Ungrouped:

∑ logx 23.987
GM= anti-log ( n ¿ =anti-log ( )= 39.731
15

x logx
20 1.301
25 1.397
31 1.491
39 1.591
21 1.322
35 1.544
40 1.602
69 1.838
35 1.544
42 1.623
61 1.785
51 1.707
59 1.77
55 1.74
54 1.732

∑logx=23.987

Grouped:

Class Class Class Mark Frequency logx flogx


Interval Boundary (x) (f)
20-29 20.5-30.5 25.5 3 1.406 4.218
30-39 30.5-40.5 35.5 4 1.55 6.2
40-49 40.5-50.5 45.5 2 1.658 3.316
50-59 50.5-60.5 55.5 4 1.744 6.976
60-69 60.5-70.5 65.5 2 1.816 3.632
∑f=15 ∑flogx=24.342

24.342
G.M = anti-log ( 15 ¿ = 41.957

Key Takeaways for Geometric Mean: 1.Suitable for logarithmic growth or


rates. 2.Less sensitive to extreme values. 3.Appropriate for positively
skewed distributions.

3. Harmonic Mean: The Harmonic Mean is calculated by dividing the total


number of observations by the sum of the reciprocals of the values. This
mean is particularly useful in scenarios involving rates or ratios, such as
speed or efficiency. The harmonic mean is more resistant to extreme values
than the arithmetic mean, providing a balance between extreme values and
the dataset's central tendency.
Ungrouped:
x 1
x
20 0.05
25 0.04
31 0.032
39 0.026
21 0.048
35 0.029
40 0.025
69 0.144
35 0.029
42 0.024
61 0.016
51 0.019
59 0.017
55 0.018
54 0.019
1
∑ x =0.53
6
n
15
HM= ∑
1 =
0.536
= 27.985
x

Grouped:
∑f
HM= ∑( f )
x

Class Interval Class Class Mark Frequency (f) f/x


Boundary (x)
20-29 20.5-30.5 25.5 3 0.118
30-39 30.5-40.5 35.5 4 0.113
40-49 40.5-50.5 45.5 2 0.044
50-59 50.5-60.5 55.5 4 0.072
60-69 60.5-70.5 65.5 2 0.031
∑f=15 ∑f/x=0.378

15
HM= 0.378 = 39.683
(Ans).

Key Takeaways for Harmonic Mean: 1.Suitable for rates or ratios. 2.Less
sensitive to extreme values. 3.Appropriate for datasets with a reciprocal
relationship.

In summary, the choice between arithmetic, geometric, or harmonic mean


depends on the nature of the dataset and the specific characteristics under
consideration. Understanding the strengths and applications of each mean
empowers researchers and analysts to choose the most relevant measure for
their data, ensuring a more accurate representation of central tendency.

Median: “When the observations are arranged in ascending or descending


order, then a value, that divides a distribution into equal parts, is called
median”.

Ungrouped:
n+1
if n is odd, Median= 2 th data

(
If n is even, Median = 2
n
) t h data+ ( +1 ) t h data
n
2
2

Here, No. of data=15


In ascending order: 20, 21, 25, 31, 35, 35, 39, 40, 42, 51, 54, 55, 59, 61, 69
So, Median= (15+1)/2 = 8th data = 40

Grouped:
h n
Median= l+ f ( 2 -C)
Here,
n/2= Median class
l= Lower boundary of median class
n= ∑f
h= class size
f= No. of freq. in median class
C= Cumulative freq. of the class preceding to median class

Class Interval Class Class Mark Frequency (f) Cumulative


Boundary (x) Frequency
20-29 20.5-30.5 25.5 3 3
30-39 30.5-40.5 35.5 4 7
40-49 40.5-50.5 45.5 2 9
50-59 50.5-60.5 55.5 4 13
60-69 60.5-70.5 65.5 2 15
∑f=15
10
SO, Median= 40.5+ 2 (7.5-7) = 43 (ans).
Mode:

Ungrouped Data: "A value that occurs most frequently in a data is called mode".

Here, 35 appeared twice.


So, the mode is 35.

Grouped Data:
"A value which has the largest frequency in a set of data is called mode".
D1
Mode= l+ D 1+ D 2 × h

Here, Modal class= highest no. of observations in a class


l= Modal class lower boundary
D1= (Modal class cum. Freq.) - (Cum. Freq. of class preceding modal class)
D2= (Modal class cum. Freq.) - (Cum. Freq. of class next to modal class)
Quartiles, Deciles & Percentiles for Ungrouped Data:

Arrange the data in ascending order, then


i(n+1)
1. Quartiles: Qi = ( )th value of the observation, where, i = 1,2, 3.
4
i(n+1)
2. Deciles: Di = ( )th value of the observation, where, i=1,2,3….,9.
10
i(n+1)
3. Percentiles: Pi = ( )th value of the observation, where, i = 1, 2,..., 99.
100

In ascending order: 20, 21, 25, 31, 35, 35, 39, 40, 42, 51, 54, 55, 59, 61, 69
Here, Q2 = 8th data = 40

D7 = 11.2 = 11th + 0.2 (12th – 11th data) = 54+ 0.2(55-54)= 54.2

P91 = 14.56 = 14th + 0.56(15th – 14th data)= 61+ 0.56(69-61)= 61+4.48 =


65.48 (ans).
Quartiles, Deciles & Percentiles for Grouped Data:
Class Interval Class Class Mark Frequency (f) Cumulative
Boundary (x) Frequency
20-29 20.5-30.5 25.5 3 3
30-39 30.5-40.5 35.5 4 7
40-49 40.5-50.5 45.5 2 9
50-59 50.5-60.5 55.5 4 13
60-69 60.5-70.5 65.5 2 15
∑f=15

10 3 ×15
Q3= 50.5+ 4 ( 4 −9) = 50.5+5.625 = 56.125
10
D4 = 30.5+ 4 (6−3) = 30.5+7.5 = 38

10
P62 = 50.5+ 4 ( 9.3−9 ) =50.5+0.75 = 51.25 (ans).

Histogram:
Frequency
4.5

3.5

2.5

1.5

0.5

0
25.5 35.5 45.5 55.5 65.5

Pie Chart:

Frequency

20-29 30-39 40-49 50-59 60-69

Frequency Graph:
Frequency Graph
4.5

3.5

2.5

1.5

0.5

0
20-29 30-39 40-49 50-59 60-69

Done by:
MD. Raisul Islam
ID: 011 222 249
Measures of Dispersion
Measures of dispersion play a crucial role in statistical analysis by providing insights into the
spread or variability of a dataset. They complement measures of central tendency, offering a
more comprehensive understanding of the distribution of values. Dispersion indicates how
much individual data points deviate from the central value, shedding light on the data's
reliability and consistency. Several statistical measures capture different aspects of dispersion,
helping analysts and researchers interpret data patterns more effectively.

Range: A basic indicator of dispersion that provides a clear understanding of a dataset's


distribution is its range. It is computed as the dataset's difference between its maximum and
lowest values. Essentially, the range offers a brief evaluation of the whole variability seen in the
data. Although it is an easy-to-understand and straightforward measure, its interpretation is
restricted to extreme levels and it is susceptible to outliers. Notwithstanding these drawbacks,
the range provides a fundamental beginning point for comprehending the degree of variety in a
dataset.

Quartile Deviation: The statistic that quantifies the dispersion is known as the quantile
deviation in statistics. The state of becoming distributed or spread is called dispersion in this
context. The degree to which numerical data is expected to deviate from an average value is
known as statistical dispersion. In simple terms, dispersion helps in understanding the data's
distribution.

In mathematics, the quartile deviation is equal to half of the difference between the upper and
lower quartile. In this case, QD stands for quartile deviation; Q3 represents the upper quartile
and Q1 the lower quartile.

Quartile Deviation is also known as the Semi Interquartile range.

Quartile Deviation Formula


Suppose Q1 is the lower quartile, Q2 is the median, and Q3 is the upper quartile for the given
data set, then its quartile deviation can be calculated using the following formula.

QD = (Q3 – Q1)/2
Quartile Deviation for Ungrouped Data
For an ungrouped data, quartiles can be obtained using the following formulas,

Q1 = [(n+1)/4]th item

Q2 = [(n+1)/2]th item

Q3 = [3(n+1)/4]th item

Where n represents the total number of observations in the given data set.

Also, Q2 is the median of the given data set, Q1 is the median of the lower half of the data set
and Q3 is the median of the upper half of the data set

Quartile Deviation for Grouped Data


For a grouped data, we can find the quartiles using the formula,

QD= Q3 – Q1 /2

Here,

Q1 = [(n+1)/4]th item

Q2 = [(n+1)/2]th item

Q3 = [3(n+1)/4]th item

Mean Deviation:
Mean Deviation, a crucial measure of dispersion in statistical analysis, gauges the
average magnitude of deviations of individual data points from a central tendency
measure. It offers insights into the spread and variability of a dataset, providing a
more nuanced understanding than some other measures. Mean deviation
considers the absolute differences between each data point and a chosen central
value, reflecting the overall "average" distance of data points from that central
measure. There are three primary types of mean deviation, each calculated
concerning different measures of central tendency: mean, median, and mode.
 Mean Deviation from the Mean or Median:
o Formula:

o Here, xi represents individual data points, ˉXˉ is the mean of the


dataset, and n is the number of data points.
.
 Mean Deviation from the Mode:
 Formula:e

Standard deviation:
• Standard Deviation is the measure of dispersion. It means how far the data
values are spread out from the mean value.
• Standard Deviation is a measure which shows how much variation (such as
spread, dispersion, spread,) from the mean exists. The standard deviation
indicates a “typical” deviation from the mean. It is a popular measure of
variability because it returns to the original units of measure of the data set. Like
the variance, if the data points are close to the mean, there is a small variation
whereas the data points are highly spread out from the mean, then it has a high
variance. Standard deviation calculates the extent to which the values differ from
the average. Standard Deviation, the most widely used measure of dispersion, is
based on all values. Therefore, a change in even one value affects the value of
standard deviation. It is independent of origin but not of scale. It is also useful in
certain
advanced statistical problems.

Formulae:

We can also use this formulae,

Problems and Solution


Ungrouped Data: The following are the ages given of 15 employees of a
company-
20, 25, 31, 39, 21, 35, 40, 69, 35, 42, 61, 51, 59, 54, 55

Grouped Data: The following is the data that contains ages 15 employees of a
company-
Age 20-29 30-39 40-49 50-59 60-69
Frequency 3 4 2 4 2
Range:

For ungrouped:
R= Highest – Lowest = 69-20= 49
For Grouped:
R= Highest Boundary – Lowest Boundary = 69-20 = 49 (ans)

Quartile Deviation:

For ungrouped:
In ascending order: 20, 21, 25, 31, 35, 35, 39, 40, 42, 51, 54, 55, 59, 61, 69
Q1= 31
Q3=55

QD= (55-31)/2 =12 (ans).

For Grouped:

Class Interval Class Class Mark (x) Frequency (f) Cumulative


Boundary Frequency
20-29 20.5-30.5 25.5 3 3
30-39 30.5-40.5 35.5 4 7
40-49 40.5-50.5 45.5 2 9
50-59 50.5-60.5 55.5 4 13
60-69 60.5-70.5 65.5 2 15
∑f=15

Q1 =32.375

Q3 = 56.125

QD= (56.125 – 32.375) /2 = 11.875 (ans).


Mean Deviation:

Class Interval Class Class Mark (x) Frequency |x- x’| f|x-x’|
Boundary (f) x’ = 42.467
20-29 20.5-30.5 25.5 3 16.967 50.901
30-39 30.5-40.5 35.5 4 6.967 27.868
40-49 40.5-50.5 45.5 2 3.033 6.066
50-59 50.5-60.5 55.5 4 13.033 52.132
60-69 60.5-70.5 65.5 2 23.033 46.066
∑f=15 ∑|x- x’|= ∑ f|x-x’|=
63.033 183.033

For Ungrouped:
63.033
MD= 15
=4.2022
Grouped:
183.033
MD= 15
=¿12.2022

Standard Deviation:

Class Class Class Frequency (x- x’) F(x-x’) F(x-x’)2


Interval Boundary Mark (x) (f) x’ = 42.467
20-29 20.5-30.5 25.5 3 -16.967 -50.901 863.637
30-39 30.5-40.5 35.5 4 -6.967 -27.868 194.156
40-49 40.5-50.5 45.5 2 3.033 6.066 18.398
50-59 50.5-60.5 55.5 4 13.033 52.132 679.436
60-69 60.5-70.5 65.5 2 23.033 46.066 1061.038
∑f=15 ∑|x- x’|= ∑ f|x-x’|= ∑ f(x-x’)2=
15.165 25.495 2816.665
Ungrouped:
SD= √(1045.995 / 15) = 8.351

Grouped:

SD= √(2816.665 / 15) =13.703 (ans)

KEY Points
Range: *Provides a quick overview of data spread. *Sensitive to extreme values. *Limited in
capturing nuanced distribution patterns.
Coefficient of Range: *Normalizes range for comparison across datasets. *Particularly
useful when dealing with datasets of different scales.

Quartile Deviation: *Robust measure, less sensitive to extreme values. *Focuses on the
central 50% of the data. *Requires a cumulative frequency distribution for calculation.
Coefficient of Quartile Deviation: *Normalizes quartile deviation for comparison. *Useful
for assessing relative variability in datasets with different central tendencies.

Mean Deviation: *Measures the average deviation of data points. *Can be calculated from
the mean, median, or mode. *Provides a straightforward understanding of dispersion.
Coefficient of Mean Deviation: *Normalizes mean deviation for comparison. *Useful for
assessing relative variability across datasets.

Standard Deviation: *Widely used and sensitive measure of dispersion. *Accounts for
squared differences from the mean. *Offers a more nuanced perspective on variability.
Coefficient of Standard Deviation: *Normalizes standard deviation for comparison.
*Facilitates cross-dataset comparisons, especially when means differ.

WORK DONE BY:


Name: Sadiya Sultana Mariya
ID: 011 222 176
Probability And Statistics- MATH 2205

Ungrouped Data: The following are the ages given of 15 employes of a


company
20, 25, 31, 39, 21, 35, 40, 69, 35, 42, 61, 51, 59, 54. 55

0 1 5 2
1 5 9 3
0 2 4
1 4 5 9 5
1 6

Key: 5 |2 = 25

Box and Wisker:


Q1 =(25+32)/2=28,Q2=(39+40)/2=39.5,Q3=(54+55)/2=54.5
Probability And Statistics- MATH 2205

Ungrouped Data: The following are the ages given of 2 Teams who are
participating on a Pokémon Crad game competition:
Team A: 11,12,22,23,24,31,38,49
Team B: 13,19,11,21,26,28,28,37,33,32,46

A B
1 2 1 3 1 9
2 4 3 2 1 6 8
1 8 3 7 3 2
9 4 6

Key: 2 |1| 3 = 12,13


Box and Wisker:
QA1=19,QB1=12
QA2=28,QB2=(24+23)/2=23.5
QA3=33,QB3=38
Probability And Statistics- MATH 2205

Stem and Leaf Plot Report


A stem and leaf plot is a graphical method that displays the variation of data from a five-number
summary, such as minimum, first quartile, median, third quartile and maximum. It is also known as a
box plot or a box and whisker diagram. It is useful for descriptive data analysis and comparison of
different data sets.

The main components of a stem and leaf plot are:

 A stem that represents the first digit or digits of each data value.
 A leaf that represents the last digit of each data value.
 A vertical line that separates the stem and the leaf.
 A key that explains what the stem and the leaf mean.

A stem and leaf plot can be used to show the distribution and skewness of the data, as well as to
identify outliers. Some features of a stem and leaf plot are:

 If the leaves are evenly distributed on both sides of the stem, then the data are symmetric
and have a normal distribution.
 If the leaves are more concentrated on one side of the stem, then the data are skewed. If the
leaves are more concentrated on the right side of the stem, the data are positively skewed or
skewed right. If the leaves are more concentrated on the left side of the stem, the data are
negatively skewed or skewed left.
 If the stems are short, then the data have low variability or dispersion. If the stems are long,
then the data have high variability or dispersion.
 If the stems cover a wide range of values, then the data have a large range. If the stems cover
a narrow range of values, then the data have a small range.
 If there are leaves that are far away from the rest of the data, then the data have some
extreme values that may affect the analysis. These are called outliers and they may indicate
errors in the data collection or measurement, or may represent some special cases that need
further investigation.

To draw a stem and leaf plot, the following steps are required:

 Arrange the data in ascending order and find the minimum, maximum, median, first quartile
and third quartile values.
 Choose a suitable stem unit that covers the range of the data. For example, if the data values
are in the tens, the stem unit can be 10. If the data values are in the hundreds, the stem unit
can be 100.
 Divide each data value into a stem and a leaf according to the chosen stem unit. For
example, if the stem unit is 10, then the data value 42 can be split into a stem of 4 and a leaf
of 2. If the stem unit is 100, then the data value 420 can be split into a stem of 4 and a leaf of
20.
 List the stems in a vertical column and draw a vertical line to the right of the stems.
Probability And Statistics- MATH 2205

 Attach the leaves to the right of the corresponding stems in ascending order. If there are
repeated values, repeat the leaves accordingly.
 Write a key that shows what the stem and the leaf represent. For example, if the stem unit is
10, then the key can be 4 | 2 means 42.

An example of a stem and leaf plot is shown below. It shows the test scores of 25 students in a class.

![Stem and leaf plot example]

From the plot, we can see that:

 The minimum score is 42 and the maximum score is 98.


 The median score is 72 and the first and third quartiles are 62 and 82 respectively.
 The data are slightly skewed to the right, as there are more scores above the median than
below the median.
 The data have moderate variability, as the stems are not too short or too long.
 The data have a moderate range, as the stems cover a moderate range of values.
 There are no outliers, as there are no scores that are far away from the rest of the data.
Probability And Statistics- MATH 2205

Box and Whisker Plot Report


A box and whisker plot is a graphical method that displays the variation of data from a five-number
summary, such as minimum, first quartile, median, third quartile and maximum. It is also known as a
box plot or a box and whisker diagram. It is useful for descriptive data analysis and comparison of
different data sets.

The main components of a box and whisker plot are:

 A box that represents the middle 50% of the data, also called the interquartile range (IQR).
The length of the box indicates the spread of the data.
 A line inside the box that shows the median of the data, also called the second quartile. The
median is the middle value of the data, such that half of the data are above it and half are
below it.
 Two whiskers that extend from the box to the minimum and maximum values of the data,
also called the first and third quartiles. The whiskers show the range of the data, excluding
outliers.
 Outliers are data points that are far away from the rest of the data, usually more than 1.5
times the IQR. They are shown as dots or circles outside the whiskers.

A box and whisker plot can be used to show the distribution and skewness of the data, as well as to
identify outliers. Some features of a box and whisker plot are:

 If the median is in the middle of the box, and the whiskers are about the same length on
both sides of the box, then the data are symmetric and have a normal distribution.
 If the median is closer to one end of the box, and the whisker on that side is shorter than the
other, then the data are skewed. If the median is closer to the lower end of the box, the data
are positively skewed or skewed right. If the median is closer to the upper end of the box, the
data are negatively skewed or skewed left.
 If the box is narrow, then the data have low variability or dispersion. If the box is wide, then
the data have high variability or dispersion.
 If the whiskers are long, then the data have a large range. If the whiskers are short, then the
data have a small range.
 If there are outliers, then the data have some extreme values that may affect the analysis.
Outliers may indicate errors in the data collection or measurement, or may represent some
special cases that need further investigation.

To draw a box and whisker plot, the following steps are required:

 Arrange the data in ascending order and find the minimum, maximum, median, first quartile
and third quartile values.
 Draw a horizontal or vertical scale that covers the range of the data.
 Draw a box with the ends at the first and third quartiles, and a line inside the box at the
median.
 Draw whiskers from the ends of the box to the minimum and maximum values, unless there
are outliers.
Probability And Statistics- MATH 2205

 Identify outliers as data points that are more than 1.5 times the IQR away from the box, and
mark them with dots or circles.
 Label the plot with a title and the names of the variables.

An example of a box and whisker plot is shown below. It compares the heights of male and female
students in a class.

![Box and whisker plot example]

From the plot, we can see that:

 The median height of male students is higher than that of female students, indicating that
male students are generally taller than female students.
 The IQR of male students is smaller than that of female students, indicating that male
students have less variation in their heights than female students.
 The range of male students is larger than that of female students, indicating that male
students have more extreme values in their heights than female students.
 There are no outliers in either group, indicating that there are no unusually tall or short
students in the class.
 The data for both groups are slightly skewed to the right, indicating that there are more
students with heights above the median than below the median.
Probability And Statistics- MATH 2205

Moments of Statistics Report

In statistics, moments are numerical measures that describe the shape and characteristics of a
probability distribution. They are based on the powers of the deviations of the random variable from
a fixed point, usually the mean. Moments can be used to calculate various parameters of the
distribution, such as the mean, variance, skewness, and kurtosis.

The kth moment of a random variable X about a point c is defined as the expected value of (X - c)^k,
that is:

M k ( c) = E [ ( X − c) k]
The point c can be any value, but often it is chosen to be either zero or the mean of X. When c is
zero, the moments are called raw moments or moments about the origin. When c is the mean of X,
the moments are called central moments or moments about the mean.

The first raw moment of X is equal to the mean of X, that is:

M 1 ( 0) = E ( X) = μ
The first central moment of X is always zero, that is:

M 1 ( μ) = E [ ( X − μ)] = 0
The second central moment of X is equal to the variance of X, that is:

M 2 ( μ) = E [ ( X − μ) 2] = σ 2
The third central moment of X is related to the skewness of X, which measures the asymmetry of the
distribution. The skewness of X is defined as:

γ 1 = M 3 ( μ) / σ 3
The fourth central moment of X is related to the kurtosis of X, which measures the peakedness or
flatness of the distribution. The kurtosis of X is defined as:

γ 2 = M 4 ( μ) / σ 4
Higher moments can also be calculated, but they are less commonly used in practice.

To illustrate the concept of moments, let us consider an example of a binomial distribution. Suppose
that X is a binomial random variable with parameters n and p, that is, X counts the number of
successes in n independent trials, each with probability p of success. The probability mass function of
X is:

P ( X = x) = ( n x) p x ( 1 − p) n − x , for x = 0, 1, 2, …, n
The mean and variance of X are:

μ = E ( X) = n p
Probability And Statistics- MATH 2205

σ 2 = V a r ( X) = n p ( 1 − p)
The raw moments of X can be calculated by using the formula:

M k ( 0) = E ( X k) = ∑ x = 0 n x k P ( X = x)
For example, the first raw moment is:

M 1 ( 0) = E ( X) = ∑ x = 0 n x P ( X = x) = n p
The second raw moment is:

M 2 ( 0) = E ( X 2) = ∑ x = 0 n x 2 P ( X = x) = n p ( 1 + ( n − 1) p)
The third raw moment is:

M 3 ( 0) = E ( X 3) = ∑ x = 0 n x 3 P ( X = x) = n p ( 1 + ( n − 1) p + ( n − 1) ( n −
2) p 2)
The central moments of X can be calculated by using the formula:

M k ( μ) = E [ ( X − μ) k] = ∑ x = 0 n ( x − μ) k P ( X = x)
For example, the first central moment is:

M 1 ( μ) = E [ ( X − μ)] = ∑ x = 0 n ( x − μ) P ( X = x) = 0
The second central moment is:

M 2 ( μ) = E [ ( X − μ) 2] = ∑ x = 0 n ( x − μ) 2 P ( X = x) = n p ( 1 − p)
The third central moment is:

M 3 ( μ) = E [ ( X − μ) 3] = ∑ x = 0 n ( x − μ) 3 P ( X = x) = n p ( 1 − p) ( 1 − 2 p)
The skewness and kurtosis of X can be calculated by using the formulas:

γ 1 = M 3 ( μ) / σ 3 = ( 1 − 2 p) / √ n p ( 1 − p)
γ 2 = M 4 ( μ) / σ 4 = ( 1 − 6 p ( 1 − p)) / n p ( 1 − p)
Probability And Statistics- MATH 2205

Consider the following frequency distribution of CGPA of 100 students.

Find first four raw moments about 𝐴 = 1.75 and convert them to the central

moments. Also, find the co-efficient of skewness and kurtosis. Make comments

about the distribution.

𝑚2 = 𝑚2′ − 𝑚1′ ^2;

𝑚3 = 𝑚3′ − 3𝑚2′ 𝑚1′ + 2𝑚1′^ 3;

𝑚4 = 𝑚4′ − 4𝑚3′ 𝑚1′ + 6𝑚2′ 𝑚1′^2 − 3𝑚1'^4

Ans:

To find the first four raw moments about A = 1.75, we need to use the formula:

M k ( A) = ∑ i = 1 n ( x i − A) k f i / n

where x i are the midpoints of the class intervals, f i are the frequencies, n is the total
frequency, and k is the order of the moment.

The midpoints of the class intervals are:

x 1 = 1.25, x 2 = 1.75, x 3 = 2.25, x 4 = 2.75, x 5 = 3.25, x 6 = 3.75

The frequencies are:

f 1 = 7, f 2 = 18, f 3 = 35, f 4 = 27, f 5 = 10, f 6 = 3

The total frequency is:

n = 7 + 18 + 35 + 27 + 10 + 3 = 100

The first raw moment about A = 1.75 is:

M 1 ( 1.75) = ∑ i = 1 6 ( x i − 1.75) f i / 100

= ( ( 1.25 − 1.75) × 7 + ( 1.75 − 1.75) × 18 + ( 2.25 − 1.75) × 35 + ( 2.75 − 1.75) × 27


+ ( 3.25 − 1.75) × 10 + ( 3.75 − 1.75) × 3 ) / 100

= ( − 3.5 + 0 + 17.5 + 27 + 15 + 6 ) / 100


Probability And Statistics- MATH 2205

= 62 / 100

= 0.62

The second raw moment about A = 1.75 is:

M 2 ( 1.75) = ∑ i = 1 6 ( x i − 1.75) 2 f i / 100

= ( ( 1.25 − 1.75) 2 × 7 + ( 1.75 − 1.75) 2 × 18 + ( 2.25 − 1.75) 2 × 35 + ( 2.75 − 1.75)


2 × 27 + ( 3.25 − 1.75) 2 × 10 + ( 3.75 − 1.75) 2 × 3 ) / 100

= ( 1.75 2 × 7 + 0 + 0.25 2 × 35 + 1 2 × 27 + 1.5 2 × 10 + 2 2 × 3 ) / 100

= ( 86.25 + 0 + 2.1875 + 13.5 + 22.5 + 12 ) / 100

= 136.4375 / 100

= 1.364375

The third raw moment about A = 1.75 is:

M 3 ( 1.75) = ∑ i = 1 6 ( x i − 1.75) 3 f i / 100

= ( ( 1.25 − 1.75) 3 × 7 + ( 1.75 − 1.75) 3 × 18 + ( 2.25 − 1.75) 3 × 35 + ( 2.75 − 1.75)


3 × 27 + ( 3.25 − 1.75) 3 × 10 + ( 3.75 − 1.75) 3 × 3 ) / 100

= ( − 0.5 3 × 7 + 0 + 0.5 3 × 35 + 1 3 × 27 + 1.5 3 × 10 + 2 3 × 3 ) / 100

= ( − 0.875 + 0 + 4.375 + 9 + 33.75 + 24 ) / 100

= 70.25 / 100

= 0.7025

The fourth raw moment about A = 1.75 is:

M 4 ( 1.75) = ∑ i = 1 6 ( x i − 1.75) 4 f i / 100

= ( ( 1.25 − 1.75) 4 × 7 + ( 1.75 − 1.75) 4 × 18 + ( 2.25 − 1.75) 4 × 35 + ( 2.75 − 1.75)


4 × 27 + ( 3.25 − 1.75) 4 × 10 + ( 3.75 − 1.75) 4 × 3 ) / 100

= ( 0.25 4 × 7 + 0 + 0.25 4 × 35 + 1 4 × 27 + 1.5 4 × 10 + 2 4 × 3 ) / 100


Probability And Statistics- MATH 2205

= ( 0.00390625 × 7 + 0 + 0.00390625 × 35 + 6.75 + 50.625 + 48 ) / 100

= ( 0.02734375 + 0 + 0.13671875 + 6.75 + 50.625 + 48 ) / 100

= 105.5390625 / 100

= 1.055390625

To convert the raw moments to the central moments, we need to use the following
formulas:

M 1 ( μ) = M 1 ( A) − A

M 2 ( μ) = M 2 ( A) − M 1 ( A) 2

M 3 ( μ) = M 3 ( A) − 3 M 2 ( A) M 1 ( A) + 2 M 1 ( A) 3

M 4 ( μ) = M 4 ( A) − 4 M 3 ( A) M 1 ( A) + 6 M 2 ( A) M 1 ( A) 2 − 3 M 1 ( A) 4

where μ is the mean of the distribution.

The first central moment is:

M 1 ( μ) = M 1 ( 1.75) − 1.75

= 0.62 − 1.75

= − 1.13

The second central moment is:

M 2 ( μ) = M 2 ( 1.75) − M 1 ( 1.75) 2

= 1.364375 − 0.62 2

= 1.364375 − 0.3844

= 0.979975

The third central moment is:

M 3 ( μ) = M 3 ( 1.75) − 3 M 2 ( 1.75) M 1 ( 1.75) + 2 M 1 ( 1.75) 3


Probability And Statistics- MATH 2205

= 0.7025 − 3 × 1.364375 × 0.62 + 2 × 0.62 3

= 0.7025 − 2.540125 + 0.141192

= − 1.696433

The fourth central moment is:

M 4 ( μ) = M 4 ( 1.75) − 4 M 3 ( 1.75) M 1 ( 1.75) + 6 M 2 ( 1.75) M 1 ( 1.75) 2 − 3


M 1 ( 1.75) 4

= 1.055390625 − 4 × 0.7025 × 0.62 + 6 × 1.364375 × 0.62 2 − 3 × 0.62 4

= 1.055390625 − 1.7436 + 3.23655 − 0.139968

= 2.408372625

The coefficient of skewness is:

γ 1 = M 3 ( μ) / M 2 ( μ) 3/2

= − 1.696433 / 0.979975 3/2

= − 1.696433 / 0.989937

= − 1.713

The coefficient of kurtosis is:

γ 2 = M 4 ( μ) / M 2 ( μ) 2

= 2.408372625 / 0.979975 2

= 2.408372625 / 0.960352

= 2.508

The distribution is negatively skewed, as the coefficient of skewness is negative. This


means that the data are more spread out on the left side of the mean than on the right
side. The distribution is also leptokurtic, as the coefficient of kurtosis is greater than 3.
This means that the data have a higher peak and heavier tails than a normal
distribution.
Probability And Statistics- MATH 2205

Ishmam Ahmed, 0112230141

Oftentimes, we need to create a correlation between two variables. How a change in


one variable is related to the change of the other variable. Are they proportional or are
they inversely related? In order to create a relationship, the correlation coefficient is a
wonderful way of telling their relationship.

Throughout the report, this data sheet will illustrate the workings of the statistical
methods and how they produce different results. The following table shows different
datasets for temperature of water and reduction in pulse rate.

Temperatur 68 65 70 62 60 55 58 65 69 63
e in water
Reduction 2 5 1 10 9 13 10 3 4 6
in pulse
rate

According to Investopedia, it is defined as the statistical measure of the strength of a


linear relationship between two variables. Its values can range from -1 to 1.

A correlation coefficient of -1 describes a perfect negative, or inverse, correlation, with


values in one series rising as those in the other decline, and vice versa. A coefficient of
1 shows a perfect positive correlation, while -1 shows a perfect negative correlation. A
correlation coefficient of 0 means there is no linear relationship.

The idea was introduced by Francis Galton and was further researched and
developed by Karl Pearson in the 1880s. Hence, it is named after him as the Pearson
Correlation Coefficient. However, the formula was derived by another person named
Auguste Bravis in 1844.

The formula is as follows:


Probability And Statistics- MATH 2205

It can also be coded in Python using the following code:

def mean(data):
return sum(data) / len(data)

def correlation_coefficient(x, y):


mean_x = mean(x)
mean_y = mean(y)

# Calculate the numerator and denominators for the correlation coefficient formula
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y)) denominator_x =
sum((xi - mean_x)**2 for xi in x)
denominator_y = sum((yi - mean_y)**2 for yi in y)

corr_coefficient = numerator / (denominator_x**0.5 * denominator_y**0.5)


return corr_coefficient

The calculation for the above dataset is as follows:


Probability And Statistics- MATH 2205

Spearman’s Correlation Coefficient can be seen to be used in the following sectors


in our daily life:
1. Finance and Economics
2. Pharmaceutical production
3. Quality Control and Manufacturing
4. Environmental Science

What is a Scatter Diagram?


According to BYJUS, it is defined as the relationship between both axes (X and Y) with
one variable. In the graph, if the variables are correlated, then the point drops along a
curve or line. A scatter diagram or scatter plot gives an idea of the nature of a
relationship.
Probability And Statistics- MATH 2205

Once the graph has been plotted, a line is drawn through the middle and the
differences of the points from the line are measured. From here, certain results can be
drawn out.

A perfect correlation looks like:

A high degree of correlation graph looks like this:

A lower-degree of correlation graph looks like this:


Probability And Statistics- MATH 2205

Lastly, a scatter diagram graph with almost no correlation would look like:
Probability And Statistics- MATH 2205

The scatter graph plot for the above dataset comes as follows:

The use of Scatter Plot diagrams is seen in the following sectors:


1. Finance and Investment
2. Education
3. Market and Sales
4. Environmental Science

Similar to the workings of Pearson’s Correlation Coefficient, Spearman’s Rank


Correlation Coefficient discusses the relationship between two variables using a
monotonic function.

It was discovered by a psychologist named Spearman Charles in 1904. He was


well known for his work on the rank correlation coefficient and it was named after
his name.
Probability And Statistics- MATH 2205

The formula of the Spearman’s Rank Correlation Coefficient is written as follows:

The following code in Python can be used to find Spearman’s Rank Correlation
Coefficient:

def calculate_rank(data):
sorted_data = sorted(enumerate(data), key=lambda x: x[1])
ranks = [0] * len(data)

for i, (index, value) in enumerate(sorted_data):


ranks[index] = i + 1

return ranks

def spearman_rank_correlation(x, y):


n = len(x)

# Calculate ranks for x and y


ranks_x = calculate_rank(x)
ranks_y = calculate_rank(y)

# Calculate the differences between the ranks


d = [rx - ry for rx, ry in zip(ranks_x, ranks_y)]

# Calculate the sum of squared differences


sum_d_squared = sum([di**2 for di in d])

# Calculate Spearman's rank correlation coefficient


rho = 1 - (6 * sum_d_squared) / (n * (n**2 - 1))

return rho

Doing the dataset math:


Probability And Statistics- MATH 2205

Here are some practical examples of the usages of Spearman’s Rank


Correlation Coefficient:
1. Non-parametric analysis
2. Ordinal Data
3. Finance and Investment
4. Psychology and Education

Regression Analysis
Probability And Statistics- MATH 2205

Lastly, we talk about another powerful statistical method which is called regression
analysis. According to Alchemer, it is defined as the method of identifying which
variables have an impact on a topic of interest.

There are two types of values, dependent and independent values. Here the
dependent value is the one being studied or predicted. The independent variable is
used to study the impact on the dependent variable.

There are multiple types of regression analysis. However, we will mostly be focusing
our report on simple linear regression analysis. So what is simple linear regression
analysis?

Simple linear regression analysis is a statistical method used to model the relationship
between a single dependent variable and a single independent variable. The
relationship is assumed to be linear, meaning that changes in the independent variable
are associated with a constant change in the dependent variable.

The formula is written as follows:

The following formula can be coded in Python using this particular code:
# Sample data
X = [1, 2, 3, 4, 5] # Independent variable
y = [2, 4, 5, 4, 5] # Dependent variable

mean_X = sum(X) / len(X)


Probability And Statistics- MATH 2205

mean_y = sum(y) / len(y)

# Calculate the slope (beta1) and intercept (beta0)


numerator = sum((xi - mean_X) * (yi - mean_y) for xi, yi in zip(X, y))
denominator = sum((xi - mean_X)**2 for xi in X)

beta1 = numerator / denominator


beta0 = mean_y - beta1 * mean_X

# Make predictions for the original data


predictions = [beta0 + beta1 * xi for xi in X]

The calculations for the above dataset are as follows:


Probability And Statistics- MATH 2205

From this formula, we can also find the coefficient of determination, denoted by r^2.
This can help to understand how much a dataset can be explained using the linear
relationship established above.

For example, if the value comes to be 0.81, this would mean that 81% of the variation
in the dependent variable has been explained by the independent variable.

Here are some practical usages of the statistical method regression analysis:

1. Risk Assessment
2. Performance Evaluation
3. Quality assurance
4. Predictive Modeling

After writing this report, I was able to take away the statistical methods and was able
to establish a relationship between two variables. That helped me better understand
how certain variables are directly correlated with one another.

They also helped me to understand the topics in further depth and gave me insightful
knowledge that would help me out in the long run.

You might also like