0% found this document useful (0 votes)
23 views34 pages

RM Module 3

The document discusses descriptive statistics and different types of data visualization. It covers types of data, measures of central tendency, and common data visualization methods like bar graphs, pie charts, line graphs, histograms, and frequency polygons.

Uploaded by

sumitsuman732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views34 pages

RM Module 3

The document discusses descriptive statistics and different types of data visualization. It covers types of data, measures of central tendency, and common data visualization methods like bar graphs, pie charts, line graphs, histograms, and frequency polygons.

Uploaded by

sumitsuman732
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

3/5/2024

Descriptive Statistics
RESEARCH METHODOLOGY Statistics may be defined as the science of collecting
and analyzing data

An important aspect of dealing with data is organizing


and summarizing the data in ways that facilitate its
interpretation and subsequent analysis. This aspect of
statistics is called descriptive statistics.
Dr. Swarnambuj Suman
Assistant Professor
Mechanical Engineering Department
NIT Patna
87 88

1
3/5/2024

Descriptive Statistics Types of Data


Descriptive Statistics mainly consists of the following
aspects
 Types of Data
 Data Visualization
 Measure of Central Tendency
 Measure of Dispersion 1. Based on sources
• Primary data : Which are collected afresh and for the first time, and
thus happen to be original in character.

• Secondary data: Those which have already been collected by someone else
and which have already been passed through the statistical process.

89 90

2
3/5/2024

Types of Data Types of Data


2. Based on Organization 4. Based on Values
• Raw data : Raw data refers to the original, • Continuous data: Continuous data is the information that occurs in a continuous
unprocessed information collected or generated during a manner. Continuous data may take up any value in a given range.
research study, experiment, or data collection process.
• Discrete data: Discrete data is information that has noticeable gaps between values.
Discrete data is made up of discrete or distinct values.
• Organized data: When the raw data is acted upon to
enable the drawing of any meaningful information from it
then it is said to be organized data

3. Based on Variables
• Univariate Data : Univariate data involves only one
variable. Example: Height of all students in a class

• Bivariate Data: Bivariate data involves two variables.


They are mainly used to study the relation between two
variables.

• Multivariate Data: Multivariate data involves more than


two variable and are used to study the interaction
between different pairs of variables
91 92

3
3/5/2024

Data Visualization Bar Graphs


Data visualization is the representation of data through use of common • Bar graph is a pictorial
representation of data having
graphics, such as charts, plots, infographics, and even animations. These rectangular bars of equal
width.
visual displays of information communicate complex data relationships in a
• These bars are placed on
way that is easy to understand. equal space on one of the
axis i.e. either x or y.
Some common type of Visualization Methodologies are:
• Every bar depicts only one
• Bar Graph characteristics of data
• Pie Charts • The distance between the
• Line Graphs bars should be equal
• These bars can we either
• Histograms vertical or horizontal
• Frequency Polygons • Bars of a bar diagram can be
visually compared by there
relative height according to
the data.

93 94

4
3/5/2024

Pie Charts Line Graphs


• Pie chart is a circular chart in which
• Line graph connect a series of successive data points and shows how one
we divide the circle into multiple
sectors. variable behaves over the other variable by representing each value as a
• It is used when comparison of a dot and connecting the dots to form a line.
component part is required with
other component and the total. • It is useful for representing trends and making comparisons.
• They are also known as angular • Line graphs can also be used to show changes over time.
circle diagram
• To construct it we use the fact that
total of all given values corresponds
to the total number of degrees in
the circular arc i.e. 360 degrees.
• Each sector represents particular
component as a part of the whole.
• It is used to show relative sizes

95 96

5
3/5/2024

Histograms Frequency Polygons


• Histogram is like a bar graph but does not have gaps between the bars. • Frequency polygon is similar to line graph but it is used for a continuous frequency
distribution.
• It is used for continuous class interval. • The intervals in the continuous distribution are represented by the midpoint of each
corresponding interval
• Bars height of every bar is equal to the corresponding frequency of the class
intervals taller the bars means more data falls in that particular range • Mid points of each interval and corresponding frequencies are plotted in XY Plane
• All points are joined using free hand.

97 98

6
3/5/2024

Measure of Central Tendency Mean


Measures of central tendency (or statistical averages) tell us the point about Mean is the most common measure of central tendency.
which items have a tendency to cluster. They may be of three types
Such a measure is considered as the most representative figure for the • Arithmetic Mean
entire mass of data.
• Geometric Mean
Measure of central tendency is also known as statistical average.
• Harmonic Mean
Since a measure of central tendency (i.e. an average) indicates the location
or the general position of the distribution on the X-axis therefore it is also
known as a measure of location or position. Arithmetic Mean
Arithmetic Mean or Simply Mean may be defined as “A value obtained by dividing
CENTRAL
TENDENCY
the sum of all the observations by the number of observation.

MEAN MEDIAN MODE

ARITHMETI GEOMETRIC HARMONIC


C MEAN MEAN MEAN 99 100

7
3/5/2024

Mean Mean
Numerical Example: Numerical Example:
Calculate the arithmetic mean for the following the marks obtained by 9 The weight recorded to the nearest grams of 60 apples picked out at random
students are given below: from a consignment are given below:

101 102

8
3/5/2024

Mean Mean
Solution: Using formula of short cut method of arithmetic mean for grouped data:

103 104

9
3/5/2024

Mean Mean
Geometric Mean: Numerical example of geometric Mean for both grouped and ungrouped
data:
“The nth root of the product of “n” positive values is called Calculate the geometric mean for the following the marks obtained by 9
geometric mean” students are given below:

Using formula of geometric mean for ungrouped data:

The following are the formulae of geometric mean:

105 106

10
3/5/2024

Mean Mean
Given the following frequency distribution of weights of 60 apples, Harmonic Mean:
calculate the geometric mean for grouped data.
“The reciprocal of the Arithmetic mean of the reciprocal of the values is called
Harmonic mean”

107 108

11
3/5/2024

Mean Mean
Numerical example of harmonic Mean for both grouped and ungrouped data: Given the following frequency distribution of weights of 60 apples, calculate
Calculate the harmonic mean for the following the marks obtained by 9 the harmonic mean for grouped data.
students are given below:

109 110

12
3/5/2024

Median Median
Median Arrange the data in ascending order:
When the observation are arranged in ascending or descending order, then a
value, that divides a distribution into equal parts, is called median.

Calculate the median for the following the marks obtained by 10 students are given
below:
Numerical example of median for both grouped and ungrouped data:
Calculate the median for the following the marks obtained by 9 students are
given below:

111 112

13
3/5/2024

Median Median

Numerical examples:
The following distribution relates to the number of assistants in 50 retail
establishments

The number of values above the median balances (equals) the number of
values below the median i.e. 50% of the data falls above and below the
median.
113 114

14
3/5/2024

Median Median

115 116

15
3/5/2024

Median Mode
Numerical example: Find the median, for the distribution of examination A mode is defined as the value that has a higher frequency in a given set of values. It is the
marks given below:
value that appears the most number of times.

Example: In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has
appeared in the set twice.

Mode For Ungrouped Data

The value occurring most frequently in a set of observations is its mode. In other words,
the mode of data is the observation having the highest frequency in a set of data.

Example: The following table represents the number of wickets taken by a bowler in 10
matches. Find the mode of the given set of data.

It can be seen that 2 wickets were taken by the bowler frequently in different matches.
Hence, the mode of the given data is 2.
117 118

16
3/5/2024

Mode Mode
Mode For Grouped Data Example 1: Find the mode of the given data set: 3, 3, 6, 9, 15, 15, 15, 27, 27, 37, 48.
• In the case of grouped frequency distribution, calculation of mode just by looking into
the frequency is not possible. To determine the mode of data in such cases we Solution: In the following list of numbers, (3, 3, 6, 9, 15, 15, 15, 27, 27, 37, 48)
calculate the modal class. Mode lies inside the modal class. The mode of data is
given by the formula: 15 is the mode since it is appearing more number of times in the set compared to other numbers.

Example 2: Find the mode of 4, 4, 4, 9, 15, 15, 15, 27, 37, 48 data set.

Where, Solution: Given: 4, 4, 4, 9, 15, 15, 15, 27, 37, 48 is the data set.

l = lower limit of the modal class As we know, a data set or set of values can have more than one mode if more than one value
occurs with equal frequency and number of time compared to the other values in the set.
h = size of the class interval
Hence, here both the number 4 and 15 are modes of the set.
f1 = frequency of the modal class
Example 3: Find the mode of 3, 6, 9, 16, 27, 37, 48.
• f0 = frequency of the class preceding the modal class
Solution: If no value or number in a data set appears more than once, then the set has no mode.
• f2 = frequency of the class succeeding the modal class
Hence, for set 3, 6, 9, 16, 27, 37, 48, there is no mode available.
119 120

17
3/5/2024

Mode Mode
Example 4: In a class of 30 students marks obtained by students in mathematics out Bimodal, Trimodal & Multimodal (More than one mode)
of 50 is tabulated as below. Calculate the mode of data given.
When there are two modes in a data set, then the set is called bimodal

For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and
Solution:
5 is repeated three times in the given set.
The maximum class frequency is 12 and the class interval corresponding to this
frequency is 20 – 30. Thus, the modal class is 20 – 30.
When there are three modes in a data set, then the set is called trimodal
Lower limit of the modal class (l) = 20
Size of the class interval (h) = 10
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
Frequency of the modal class (f1) = 12
Frequency of the class preceding the modal class (f0) = 5
When there are four or more modes in a data set, then the set is called multimodal
Frequency of the class succeeding the modal class (f2)= 8
Substituting these values in the formula we get;
If the given set of observations do not have any value that is repeated in the set, more
than once, then it is said to be no mode.

121 122

18
3/5/2024

Measure of Dispersion Measure of Dispersion


Dispersion or variability describes how items are distributed (scattered) Range
from each other and the center of a distribution.
The range is the easiest measure of dispersion. It is simply calculated by
subtracting the highest value from the lowest value.
Example: Height of students
Range = Highest Value – Lowest Value
In statistics, dispersion helps to understand the distribution of the data.
Let’s understand this by an example:
There are 4 methods to measures the dispersion of the data: Let there be 5 students in the class having heights of 150cm, 160cm,
175cm, 190cm and 200 cm.
Range Calculate the range of heights?
Interquartile Range
Variance Range = 200cm – 150cm
Standard Deviation
Hence, Range = 50cm

123 124

19
3/5/2024

Measure of Dispersion Measure of Dispersion


Interquartile Range Problem Statement:
Quartiles: Quartiles divide the set into 4 equal parts. Let there are 8 numbers between 10 and 90 which are equally distributed.
There are three quartiles Q1, Q2 and Q3, where Q2 is the median of the Define the five-number summary and find the Interquartile Range?
distribution. Lowest value : 10
Five number summary: Q1 (25 percentile) : 25
Every dataset can be described using these 5 numbers Q2 (50 percentile) : 50
• Lowest value Q3 (75 percentile) : 75
• Q1: 25 percentile Highest value : 90
• Q2: Median Interquartile Range(IQR) = Q3 – Q1 = 75 – 25 = 50
• Q3: 75 Percentile Interquartile Range = 50
• Highest Value
Interquartile Range
Interquartile range is defined as the range between 75 percentile (Q3) and
25 percentile (Q1).

IQR = Q3 – Q1 125 126

20
3/5/2024

Measure of Dispersion Measure of Dispersion


Variance Standard Deviation
It is defined as the average of squared difference from the mean. Standard deviation is the square root of the variance.
It measures how far each data point in datasets from the mean. Population Standard Deviation
Population variation:
Variance population formula

Sample Standard Deviation


Sample Variation:
Variance sample formula

127 128

21
3/5/2024

Shape of Data Shape of Data


Shape of data is measured by Skewness
• Skewness Skewness is a measure of symmetry, or more precisely, the lack of
• Kurtosis symmetry.
A distribution, or data set, is symmetric if it looks the same to the left
and right of the center point.
Skew is used to describe the balance of the distribution.
Whereas kurtosis refers to the height of the distribution. For univariate data Y1, Y2, ..., YN, the formula for skewness is:

where 𝑌is the mean, s is the standard deviation, and N is the number
of data points.
The above formula for skewness is referred to as the Fisher-Pearson
coefficient of skewness.
129 130

22
3/5/2024

Shape of Data Shape of Data


It should be noted that there are alternative definitions of skewness in the • Skewness in data may be of three types:
literature. For example, the Galton skewness (also known as Bowley's 1. Positive skewness
skewness) is defined as 2. Negative skewness
3. Zero skewness

where Q1 is the lower quartile, Q3 is the upper quartile, and Q2 is the


median.
The Pearson skewness coefficient is defined as
𝑥- Median
𝑥- Mean

where 𝑌is the sample median.

131 132

23
3/5/2024

Shape of Data Kurtosis


• Kurtosisrefers to the peakedness or flatness of the distribution
compared with the normal distribution.
• Kurtosis can be classified in to following categories:
• Leptokurtic (positive kurtosis), a distribution that all taller or
more peaked than the normal distribution.
• Platykurtic (negative kurtosis), a distribution that is flatter than the
normal distribution.

Skewness describes the asymmetry of the dataset about the mean or indicates
the degree to which distribution deviates from symmetry.

Positively skewed: mode < median < mean


Negatively skewed: mode > median > mean
Not skewed: mode = median = mean

133 134

24
3/5/2024

Probability Distribution Probability Distribution


What Is a Probability Distribution? Types of Probability Distributions
There are two types of distributions based on the type of data generated by the
A probability distribution is a statistical function that describes all the experiments.
possible values and likelihoods that a random variable can take within
a given range.

This range will be bounded between the minimum and maximum


possible values, but precisely where the possible value is likely to be
plotted on the probability distribution depends on a number of factors.

These factors include the distribution's mean (average), standard


1. Continuous distributions. When the variable being measured is expressed on a
deviation, skewness, and kurtosis. continuous scale, its probability distribution is called a continuous distribution. The
probability distribution of metal layer thickness is continuous.
2. Discrete distributions. When the parameter being measured can only take on
certain values, such as the integers 0, 1, 2, . . . , the probability distribution is called
a discrete distribution. For example, the distribution of the number of nonconformities
or defects in printed circuit boards would be a discrete distribution.
135 136

25
3/5/2024

Probability Distribution Probability Distribution


• The appearance of a discrete distribution is that of a series of
vertical “spikes,” with the height of each spike proportional to the
probability. We write the probability that the random variable x takes
on the specific value xi as

• The appearance of a continuous distribution is that of a smooth


curve, with the area under the curve equal to probability, so that the
probability that x lies in the interval from a to b is written as

137 138

26
3/5/2024

Probability Distribution Probability Distribution


Mean Variance
The mean simply determines the location of the distribution. The The scatter, spread, or variability in a distribution is expressed by the
mean of a probability distribution is a measure of the central variance The definition of the variance is
tendency in the distribution, or its location. The mean is defined as

When the random variable is discrete with N equally likely values, then
equation becomes

For the case of a discrete random variable with exactly N equally Standard Deviation
likely values [that is, p(xi) =1/N], then equation for mean in case of
discrete distribution reduces to The standard deviation is a measure of spread or scatter in the population
expressed in the original units.

139 140

27
3/5/2024

Probability Distribution Probability Distribution


Important Discrete Distributions The Hypergeometric Distribution
Some common discrete distributions Suppose that there is a finite population consisting of N items.
Consider some number—say, D such that (D≤N)—of these items fall into a
The Hypergeometric distribution class of interest.
The Binomial distribution A random sample of n items is selected from the population without
The Poisson distribution replacement, and the number of items in the sample that fall into the class of
interest—say, x—is observed.
The Pascal or negative binomial distribution. Then x is a hypergeometric random variable with the probability distribution
defined as follows.

141 142

28
3/5/2024

Probability Distribution Probability Distribution


• In the above definition, the quantity
For example,

Suppose that a lot contains 100 items, 5 of which do not conform to


requirements. If 10 items are selected at random without replacement, then

is the number of combinations of a items taken b at a time. the probability of finding one or fewer nonconforming items in the sample is

• The hypergeometric distribution is the appropriate probability model for


selecting a random sample of n items without replacement from a lot of N
items of which D are nonconforming or defective.

• By a random sample, we mean a sample that has been selected in such a


way that all possible samples have an equal chance of being chosen.

• In these applications, x usually represents the number of nonconforming


items found in the sample.
143 144

29
3/5/2024

Probability Distribution Probability Distribution


The Binomial Distribution The binomial distribution is used frequently in quality engineering.
Consider a process that consists of a sequence of n independent trials. It is the appropriate probability model for sampling from an infinitely large
population, where p represents the fraction of defective or nonconforming
By independent trials, we mean that the outcome of each trial does not items in the population.
depend in any way on the outcome of previous trials.
In these applications, x usually represents the number of nonconforming
When the outcome of each trial is either a “success” or a “failure,” the trials items found in a random sample of n items.
are called Bernoulli trials.
If the probability of “success” on any trial—say, p—is constant, then the
number of “successes” x in n Bernoulli trials has the binomial distribution
with parameters n and p, defined as follows.

145 146

30
3/5/2024

Probability Distribution Probability Distribution


The Poisson Distribution • A typical application of the Poisson distribution in quality control is as a model of
the number of defects or nonconformities that occur in a unit of product.
A useful discrete distribution in statistical quality control is the Poisson distribution, defined
• Any random phenomenon that occurs on a per unit (or per unit area, per unit
as follows. volume, per unit time, etc.) basis is often well approximated by the Poisson
distribution.
Example,
Suppose that the number of wire-bonding defects per unit that occur in a
semiconductor device is Poisson distributed with parameter λ=4.
Then the probability that a randomly selected semiconductor device will contain two
or fewer wire-bonding defects is

Note that the mean and variance of the Poisson distribution are both equal to the parameter

147 148

31
3/5/2024

Probability Distribution Probability Distribution


Important Continuous Distributions The Normal Distribution
These include The normal distribution is probably the most important distribution in both the theory
and application of statistics.
The Normal Distribution
If x is a normal random variable, then the probability distribution of x is defined as
The Lognormal Distribution follows.
The Exponential Distribution
The Gamma Distribution
The Weibull Distribution.

149 150

32
3/5/2024

Probability Distribution Probability Distribution


• The normal distribution is used so much that we frequently employ a special notation • The cumulative normal distribution is defined as the probability that the
normal random variable x is less than or equal to some value a, or
to imply that x is normally distributed with mean μ and variance𝜎
• The visual appearance of the normal distribution is a symmetric, unimodal or bell-shaped
curve and is shown in Fig.

• This integral is difficult to be evaluated in closed form.


• However, by using the change of variable

• The evaluation can be made independent of mean μ and variance𝜎 .

151 152

33
3/5/2024

Probability Distribution

where is the cumulative distribution function of the standard normal


distribution (mean = 0, standard deviation = 1).
The transformation

is usually called standardization, because it converts a N(μ, 𝜎 )random


variable into an N(0, 1) random variable.

153

34

You might also like