Module 1 Introduction To Statistics and Data Analysis
Module 1 Introduction To Statistics and Data Analysis
Page 1 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Key Terms:
In statistics, we generally want to study a population. You can think of a population as a collection
of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling
is to select a portion (or subset) of the larger population and study that portion (the sample) to gain
information about the population. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical
technique. If you wished to compute the overall grade point average at your school, it would make sense to
select a sample of students who attend the school. The data collected from the sample would be the students'
grade point averages.
In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion poll is
supposed to represent the views of the people in the entire country. Manufacturers of canned carbonated
drinks take samples to determine if a 16 ounce can contain 16 ounces of carbonated drink. From the sample
data, we can calculate a statistic”.
A statistic is a number that represents a property of the sample. For example, if we consider one math
class to be a sample of the population of all math classes, then the average number of points earned by
students in that one math class at the end of the term is an example of a statistic. The statistic is an estimate
of a population parameter.
Page 2 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
A parameter is a numerical characteristic of the whole population that can be estimated by a statistic.
Since we considered all math classes to be the population, then the average number of points earned per
student over all the math classes is an example of a parameter. One of the main concerns in the field of
statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the
sample represents the population. The sample must contain the characteristics of the population in order to
be a representative sample. We are interested in both the sample statistic and the population parameter in
inferential statistics. In a later chapter, we will use the sample statistic to test the validity of the established
population parameter.
A variable, usually notated by capital letters such as X and Y, is a characteristic or measurement that
can be determined for each member of a population.
Variables may be numerical or categorical.
• Numerical variables take on values with equal units such as weight in pounds and time in hours.
• Categorical variables place the person or thing into a category.
“If we let X equal the number of points earned by one math student at the end of a term, then X is a numerical variable.
If we let Y be a person's party affiliation, then some examples of Y include Republican, Democrat, and Independent. Y is
a categorical variable”.
We could do some math with values of X (calculate the average number of points earned, for
example), but it makes no sense to do math with values of Y (calculating an average party affiliation makes
no sense).
Data are the actual values of the variable. They may be numbers or they may be words. Datum is a
single value. Two words that come up often in statistics are mean and proportion.
“If you were to take three exams in your math classes and obtain scores of 86, 75, and 92, you would calculate
your mean score by adding the three exam scores and dividing by three (your mean score would be 84.3 to one decimal
place). If, in your math class, there are 40 students and 22 are men and 18 are women, then the proportion of men
students is 22/40 and the proportion of women students is 18/40”.
Data is collected every second of every day from a vast array of sources. From the security cameras
that deploy facial recognition technology when people enter a building to the mobile devices that track
shopping, media, and communication habits, images and numbers are continually being collected by
Page 3 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
government agencies, consumer groups, and other organizations from all around the world. This data
contains information that can help businesses operate more efficiently and reach the right customers.
However, to be of any value, data must be correctly interpreted. Misinterpreted data can lead to flawed
insights that could disrupt an organization’s growth and stability strategies. To ensure that these copious
amounts of information are leveraged effectively, businesses and other groups are hiring data scientists to
help collect, store, and analyze pertinent information. While these professionals come from a variety of
backgrounds, the growing field of data science provides a number of rewarding opportunities specifically for
engineers. The fields overlap in significant ways, which often makes professionals with an engineering
background a good fit for a role working with data.
Data analysis involves gathering and studying data to form insights that can be used to make
decisions. The information derived can be useful in several different ways, such as for building a business
strategy or ensuring the safety and efficiency of an engineering project. Data collection and analysis is
becoming increasingly important across most every industry. Fields that collect this information include
marketing, sports, entertainment, medicine, communications, government, criminal justice, electronics, and
aerospace. Data can help companies make decisions on issues as diverse as how to engage their target
Page 4 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
audiences, what purchases to make, and how to organize their staff members. Ultimately, data science is not
just about collecting and analyzing information. It is about being able to predict the future and verify the
results of past decisions.
Engineering is one industry that has been particularly influenced by the growing need for data
collection and analysis. As big data has begun to play a larger role in industries around the world, engineers
have been called on to play an influential role in the way this information is gathered, stored, and leveraged.
Professionals with an engineering background generally prove to be particularly adept at developing
techniques for analyzing data groups to extract valuable information.
To succeed in a career as a data scientist, an engineer should possess the following qualifications:
Analytics expertise: Experience extrapolating information from large quantities of numbers will help you
succeed in this role. Depending on where you work, knowledge of specific analytic tools will also
likely be required.
Computer knowledge: Gone are the days of crunching numbers on a hand-held calculator — much less with
pen and paper. The vast majority of your day will be spent working on a computer, so knowledge of
coding, unstructured data, and cloud tools will increase your marketability.
Communication skills: It is important to be able to present your findings in a clear and concise way to ensure
that your employer understands the information and can act accordingly.
Strong drive: In data science, you should regularly be looking for ways to improve how information is
collected and processed. Being an intellectually curious self-starter will take you far in this role.
Exercise No. 1
Determine what the key terms refer to in the following study. We want to know the average (mean) amount
of money first year college students spend at ABC College on school supplies that do not include books.
We randomly surveyed 100 first year students at the college. Three of those students spent $150, $200,
and $225, respectively.
Answer:
• The population is all first-year students attending ABC College this term.
Page 5 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
• The sample could be all students enrolled in one section of a beginning statistics course at ABC
College (although this sample may not represent the entire population).
• The parameter is the average (mean) amount of money spent (excluding books) by first year college
students at ABC College this term.
• The statistic is the average (mean) amount of money spent (excluding books) by first year college
students in the sample.
• The variable could be the amount of money spent (excluding books) by one first year student. Let X =
the amount of money spent (excluding books) by one first year student attending ABC College.
• The data are the dollar amounts spent by the first-year students. Examples of the data are $150, $200,
and $225.
Exercise No. 2
Determine what the key terms refer to in the following study.
A study was conducted at a local college to analyze the average cumulative GPAs of students who graduated
last year. Fill in the letter of the phrase that best describes each of the items below.
1. Population_____ 2. Statistic _____ 3. Parameter _____ 4. Sample _____ 5. Variable _____
6. Data _____
a) all students who attended the college last year
b) the cumulative GPA of one student who graduated from the college last year
c) 3.65, 2.80, 1.50, 3.90
d) a group of students who graduated from the college last year, randomly selected
e) the average cumulative GPA of students who graduated from the college last year
f) all students who graduated from the college last year
g) the average cumulative GPA of students in the study who graduated from the college last year
Answer:
1. f; 2. g; 3. e; 4. d; 5. b; 6. c
Page 6 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Scales of Measurement
When gathering data by any method, measurements are usually obtained (e.g., height in inches,
weight in pounds, age in years, I.Q. scores, temperature in degrees Celsius, incidence rates, mortality rates,
etc.) Measurements are classified into four scales. In selecting the statistical tool to be used for drawing
inferences on a random sample, the type of measurement scale must be carefully chosen.
1. Nominal Scale
A measurement that classifies elements into two or more categories or classes. The numbers indicate that
the elements are different, but the difference is not according to order or magnitude.
Example: Distribution of Patients in XYZ Hospital According to Religion and Gender
2. Ordinal Scale
A measurement scale that ranks individuals in terms of the degree to which they possess a characteristic.
Page 7 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Legend:
0= not anxious
1= low anxiety level
2= moderate anxiety level
3= high anxiety level
3. Interval Scale
A measurement scale that, in addition to ordering scores from highest to lowest, establishes a uniform
unit in the scale so that any distance between two consecutive scores is of equal magnitude.
Example:
The aptitude scores from 80 to 90 are of equal difference as that of the aptitude scores from 90 to 100.
There is also no absolute zero in the scale. For example, a place where the temperature reading is 0 degrees
is 0 degree Celsius does not mean that there is no temperature in that place.
4. Ratio Scale
A measurement scale that, in addition to being an interval scale, also has an absolute zero in the scale.
Examples:
Height, weight, area, volume, speed, rate of doing work, amount of money deposited in a bank.
Page 8 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
There is no formula for selecting the best method to be used when gathering data. It depends on the
researcher’s design of the study, the type of data, the time allotment to complete the study, and the
researcher’s financial capacity.
There are several ways in collecting data among which are the following: Clerical tools and Mechanical
Devices
A. Clerical Tools
1. Questionnaire - Defines by Good as a list of planned, well-planned questions written questions related
to a particular topic, with space provided for indicating the response to each question, intended for
submission to a number of persons for reply; commonly used on a normative survey and in the
measurement of attitudes and opinions
Construction of Questionnaires:
1. Doing library search
2. Talking to knowledgeable people
3. Mastering the guidelines
4. Writing the questionnaire
5. Editing the questionnaire
6. Rewriting the questionnaire
7. Pretesting the questionnaire (dry run)
Page 9 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
e. Understanding
2. Interview- is one of the major techniques in gathering data or information. It is a purposeful face to
face relationship between two persons.
Types of Interviews
a. Direct Method- The researcher personally interviews the respondents.
b. Indirect Method- The researchers may use a telephone to interview the respondents.
Classes of Interview
a. Standardized- The interviewer is not allowed to change the specific wordings of the questions in the
interview schedule. He must conduct all interviews in the same manner, and he cannot adapt
questions for specific situations or pursue statements
b. Non-standardized- The interviewer has complete freedom to develop each interview in the most
appropriate manner for each situation. He is not held to any specific questions. This is the same as
so-called informal interview.
c. Semi-standardized- The interviewer is required to ask a number of specific major questions, and
beyond these he is free to probe as he chooses. There are prepared principal questions to be asked
and once these are asked and answered the interpreter is free to ask any questions as he sees fit for
the situations.
d. Focused- Also called depth interview. Similar to non-standardized interview, the researcher asks a
series of questions based on his previous understanding and insight of the situation. The interview
is focused on specific topics that are to be investigated in depth.
e. Nondirective- The interviewee or subject is allowed and even encouraged to express his feelings
without the fear of disapproval. The subject can express his feelings and views on certain topics even
without waiting to be questioned or even without pressure from the interviewer.
3. Empirical Observation Method- Means of gathering information for research, may be defined as
perceiving data through the senses: sight, hearing, taste, touch and smell. The sense of sight is the most
important and the most used among the senses.
Types of Observation
a. Participant and non-participant observation
1. Participant- Observer takes active part in the activities of the group being observed.
Page 10 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Exercise No. 2
Identify each quantitative variable as discrete or continuous. Write D if discrete or C if continuous
1. The boiling point of water is 100 deg. Cel.
2. Length of hair of female students.
3. Number of foreigners migrating to the Philippines every year
4. Her home telephone number is 2581376.
5. The number of children with missing/decayed teeth in barangay A is 2000.
6. John’s height is 168 cm.
7. The following data are the densities of sample substances taken from Tabing-Ilog River (g/cc):
23.6, 19.8, 15.0, 7.8 and 2.4.
8. Weights in pounds of the Math quiz contestants.
9. The average speed of motorboats cruising in Manila Bay every day is 50m/s.
10. Scores of 16 students in a Statistics Quiz.
Page 11 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Selection
This bias occurs when the sample selected does not reflect the population of interest. For
instance, you are interested in the attitude of female students regarding campus safety but when
sampling you also include males. In this case, your population of interest was female students
however your sample included subjects not in that population (i.e. males).
2. Gathering volunteers: Collecting data from subjects who volunteer to provide data.
Example:
Using an advertisement in a magazine or on a website inviting people to complete a form or
participate in a study.
b. Probability Methods
1. Simple random sample: making selections from a population where each subject in the
population has an equal chance of being selected.
2. Stratified random sample: where you have first identified the population of interest, you
then divide this population into strata or groups based on some characteristic (e.g. sex,
geographic region), then perform simple random sample from each strata.
3. Cluster sample: where a random cluster of subjects is taken from the population of interest.
For instance, if we were to estimate the average salary for faculty members at Penn State -
University Park Campus, we could take a simple random sample of departments and find
the salary of each faculty member within the sampled department. This would be our cluster
sample.
There are advantages and disadvantages to both types of methods. Non-probability methods
are often easier and cheaper to facilitate. When non-probability methods are used it is often the case
that the sample is not representative of the population. If it is not representative, you can make
generalizations only about the sample, not the population. The primary benefit of using probability
sampling methods is the ability to make inference. We can assume that by using random sampling
Page 12 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
we attain a representative sample of the population The results can be “extended” or “generalized”
to the population from which the sample came.
Page 13 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Then measure your chosen response variable at several (at least two) settings of the factor under study. If
changing the factor causes the phenomenon to change, then you conclude that there is indeed a cause-and-
effect relationship at work.
How many factors are involved when you do an experiment? Some say two - perhaps this is a comparative
experiment? Perhaps there is a treatment group and a control group? If you have a treatment group and a
control group then, in this case, you probably only have one factor with two levels.
Engineering Experiments
If we had infinite time and resource budgets there probably wouldn't be a big fuss made over designing
experiments. In production and quality control we want to control the error and learn as much as we can
about the process or the underlying theory with the resources at hand. From an engineering perspective we're
trying to use experimentation for the following purposes:
a. reduce time to design/develop new products & processes
b. improve performance of existing processes
c. improve reliability and performance of products
d. achieve product & process robustness
e. perform evaluation of materials, design alternatives, setting component & system tolerances, etc.
We always want to fine-tune or improve the process. In today's global world this drive for
competitiveness affects all of us both as consumers and producers.
Robustness is a concept that enters into statistics at several points. At the analysis, stage robustness refers
to a technique that isn't overly influenced by bad data. Even if there is an outlier or bad data you still want to
get the right answer. Regardless of who or what is involved in the process - it is still going to work.
Page 14 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Every experiment design has inputs. Back to the cake baking example: we have our ingredients such
as flour, sugar, milk, eggs, etc. Regardless of the quality of these ingredients we still want our cake to come
out successfully. In every experiment there are inputs and in addition, there are factors (such as time of baking,
temperature, geometry of the cake pan, etc.), some of which you can control and others that you can't control.
The experimenter must think about factors that affect the outcome. We also talk about the output and the
yield or the response to your experiment. For the cake, the output might be measured as texture, flavor,
height, size, or flavor.
Randomization
This is an essential component of any experiment that is going to have validity. If you are doing a comparative
experiment where you have two treatments, a treatment and a control, for instance, you need to include in
your experimental process the assignment of those treatments by some random process. An experiment
includes experimental units. You need to have a deliberate process to eliminate potential biases from the
conclusions, and random assignment is a critical step.
Replication
Replication is some in sense the heart of all of statistics. To make this point... Remember what the standard
error of the mean is? It is the square root of the estimate of the variance of the sample mean, i.e.,.The width
of the confidence interval is determined by this statistic. Our estimates of the mean become less variable as
the sample size increases. Replication is the basic issue behind every method we will use in order to get a
handle on how precise our estimates are at the end. We always want to estimate or control the uncertainty in
our results. We achieve this estimate through replication. Another way we can achieve short confidence
intervals is by reducing the error variance itself. However, when that isn't possible, we can reduce the error in
our estimate of the mean by increasing n.
Another way is to reduce the size or the length of the confidence interval is to reduce the error variance -
which brings us to blocking.
Blocking
Page 15 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Blocking is a technique to include other factors in our experiment which contribute to undesirable variation.
Much of the focus in this class will be to creatively use various blocking techniques to control sources of
variation that will reduce error variance. For example, in human studies, the gender of the subjects is often
an important factor. Age is another factor affecting the response. Age and gender are often considered
nuisance factors which contribute to variability and make it difficult to assess systematic effects of a treatment.
By using these as blocking factors, you can avoid biases that might occur due to differences between the
allocation of subjects to the treatments, and as a way of accounting for some noise in the experiment. We
want the unknown error variance at the end of the experiment to be as small as possible. Our goal is usually
to find out something about a treatment factor (or a factor of primary interest), but in addition to this, we
want to include any blocking factors that will explain variation.
There are other ways that we can categorize factors: Experimental vs. Classification Factors
1. Experimental Factors
Page 16 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
These are factors that you can specify (and set the levels) and then assign at random as the treatment to the
experimental units. Examples would be temperature, level of an additive fertilizer amount per acre, etc.
2. Classification Factors
These can't be changed or assigned, these come as labels on the experimental units. The age and sex of the
participants are classification factors which can't be changed or randomly assigned. But you can select
individuals from these groups randomly.
a. Quantitative Factors- You can assign any specified level of a quantitative factor. Examples: percent
or pH level of a chemical.
b. Qualitative Factors- These factors have categories which are different types. Examples might be
species of a plant or animal, a brand in the marketing field, gender, - these are not ordered or
continuous but are arranged perhaps in sets.
Once we determine that a variable is Qualitative (or Categorical), we need tools to summarize
the data. We can summarize the data by using frequencies and by graphing the data.
Let’s start by an example. In a class size of 30 students, a survey question asked the students
to indicate their eye color. The responses are shown in the table.
Hazel Brown Brown Brown
Brown Brown Brown Brown
Brown Brown Brown Brown
Blue Brown Brown Brown
Brown Brown Brown Brown
From this list, we can clearly see that the eye color brown is the most common. Which is
more frequent, Hazel or Green? It may only take a few seconds to answer the question but what if
there were 100 students? Or 1000? The best way to summarize categorical data is to use frequencies
and percentages (or proportions).
Proportion- A proportion is a fraction or part of the total that possesses a certain characteristic.
The best way to summarize categorical data is to use frequencies and percentages like in the
table.
Page 17 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
The table is much easier to read than the actual data. It is clear to see that more students
have Hazel than Green eyes in the class. As the saying goes, “A picture is worth 1000 words”, it is
helpful to visualize the data in a graph.
There are other measures, such as a trimmed mean, that we do not discuss here.
Mean
The mean is the average of data.
Median
The median is the middle value of the ordered data. The most important step in finding the median
is to first order the data from smallest to largest.
Page 18 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
3. The value that represents the location found in Step 2 is the median.
Mode
The mode is the value that occurs most often in the data. It is important to note that there may be
more than one mode in the dataset.
Effects of Outliers
One shortcoming of the mean is that means are easily affected by extreme values. Measures
that are not that affected by extreme values are called resistant. Measures that are affected by extreme
values are called sensitive.
Page 19 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Using the data from Example 1-5, how would the mean and median change, if the entry 91
is mistakenly recorded as 9?
Answer:
The data set would be
9, 69, 76, 76, 78, 80, 82, 86, 88, 95
Mean
The mean would be x¯=110(9+78+69+95+82+76+76+86+88+80) =73.9
The mean would be 73.9, which is very different from 82.1.
Median
Let us see the effect of the mistake on the median value.
The data set (with 91 coded as 9) in increasing order is:
9, 69, 76, 76, 78, 80, 82, 86, 88, 95
where the median = 79
The medians of the two sets are not that different. Therefore, the median is not that affected by the
extreme value 9.
The mean is a sensitive measure (or sensitive statistic) and the median is a resistant measure
(or resistant statistic).
After reading this lesson you should know that there are quite a few options when one wants
to describe central tendency. In future lessons, we talk about mainly about the mean. However, we
need to be aware of one of its shortcomings, which is that it is easily affected by extreme values.
Unless data points are known mistakes, one should not remove them from the data set! One should
keep the extreme points and use more resistant measures. For example, use the sample median to
estimate the population median.
What happens to the mean and median if we add or multiply each observation in a data set
by a constant?
Consider for example if an instructor curves an exam by adding five points to each student’s
score. What effect does this have on the mean and the median? The result of adding a constant to
each value has the intended effect of altering the mean and median by the constant.
For example, if in the above example where we have 10 aptitude scores, if 5 was added to
each score the mean of this new data set would be 87.1 (the original mean of 82.1 plus 5) and the
new median would be 86 (the original median of 81 plus 5).
Page 20 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Similarly, if each observed data value was multiplied by a constant, the new mean and median
would change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original
scores were doubled, the then the new mean and new median would be double the original mean
and median. As we will learn shortly, the effect is not the same on the variance!
Looking Ahead!
Why would you want to know this? One reason, especially for those moving onward to more
applied statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical
methods, a required assumption is that the data is normal, or very near bell-shaped. When the data
is not normal, statisticians will transform the data using numerous techniques e.g. logarithmic
transformation. We just need to remember the original data was transformed!!
Shape
The shape of the data helps us to determine the most appropriate measure of central
tendency. The three most important descriptions of shape are Symmetric, Left-skewed, and Right-
skewed. Skewness is a measure of the degree of asymmetry of the distribution.
Page 21 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Note! When one has very skewed data, it is better to use the median as measure of central tendency
since the median is not much affected by extreme values.
Measures of Position
While measures of central tendency are important, they do not tell the whole story. For
example, suppose the mean score on a statistics exam is 80%. From this information, can we
determine a range in which most people scored? The answer is no. There are two other types of
measures, measures of position and variability, that help paint a more concise picture of what is
going on in the data. In this section, we will consider the measures of position and discuss measures
of variability in the next one.
Measures of position give a range where a certain percentage of the data fall. The measures
we consider here are percentiles and quartiles.
Percentiles
The pth percentile of the data set is a measurement such that after the data are ordered from
smallest to largest, at most, p% of the data are at or below this value and at most, (100 - p) % at or
above it.
A common application of percentiles is their use in determining passing or failure cutoffs
for standardized exams such as the GRE. If you have a 95th percentile score then you are at or above
95% of all test takers.
Page 22 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
The median is the value where fifty percent or the data values fall at or below it. Therefore,
the median is the 50th percentile.
We can find any percentile we wish. There are two other important percentiles. The 25th
percentile, typically denoted, Q1, and the 75th percentile, typically denoted as Q3. Q1 is commonly
called the lower quartile and Q3 is commonly called the upper quartile.
Finding Quartiles
The method we will demonstrate for calculating Q1 and Q3 may differ from the method
described in our textbook. The results shown here will always be the same as Minitab's results. The
method here is also different from the method presented in many undergraduate statistics courses.
This method is what we require students to use.
Page 23 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Note! If the value found in part 1 is not a whole number, interpolate the value.
Page 24 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Measures of Variability
The dot plot for the pieces of candy from vending machine A and vending machine B is
displayed in figure 1.4.
They have the same center, but what about their spreads?
Measures of Variability
Page 25 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Range
The range is the difference in the maximum and minimum values of a data set. The
maximum is the largest value in the dataset and the minimum is the smallest value. The range is easy
to calculate but it is very much affected by extreme values.
Like the range, the IQR is a measure of variability, but you must find the quartiles in order
to compute its value.
Note! The IQR is not affected by extreme values. It is thus a resistant measure of variability.
One way to describe spread or variability is to compute the standard deviation. In the
following section, we are going to talk about how to compute the sample variance and the sample
standard deviation for a data set. The standard deviation is the square root of the variance.
Page 26 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
When we calculate the sample sd we estimate the population mean with the sample mean,
and dividing by (n-1) rather than n which gives it a special property that we call an "unbiased
estimator". Therefore, s2 is an unbiased estimator for the population variance.
The sample variance (and therefore sample standard deviation) are the common default
calculations used by software. When asked to calculate the variance or standard deviation of a set of
data, assume - unless otherwise instructed - this is sample data and therefore calculating the sample
variance and sample standard deviation.
Answer
Page 27 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Finally,
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥)2 46908/9 5212
𝑠2 = = = ≈ 306.588
18 − 1 17 17
Try it!
Calculate the sample variances for the data set from vending machines A and B yourself
and check that it the variance for B is smaller than that for data set A. Work out your answer first,
then click the graphic to compare answers.
a. 1, 2, 3, 3, 4, 5
b. 2, 3, 3, 3, 3, 4
Page 28 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Standard Deviation
The standard deviation is a very useful measure. One reason is that it has the same unit of
measurement as the data itself (e.g. if a sample of student heights were in inches then so, too, would
be the standard deviation. The variance would be in squared units, for example inches2). Also, the
empirical rule, which will be explained later, makes the standard deviation an important yardstick
to find out approximately what percentage of the measurements fall within certain intervals.
Standard Deviation
Approximately the average distance the values of a data set are from the mean or the square
root of the variance
Page 29 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Now that we discussed how to find summary statistics for quantitative variables, the next step
is to graph the data. The graphs we will discuss include:
Dotplot
Page 30 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Stem-and-Leaf Diagrams
To produce the diagram, the data need to be grouped based on the “stem”, which depends
on the number of digits of the quantitative variable. The “leaves” represent the last digit. One
advantage of this diagram is that the original data can be recovered (except the order the data is
taken) from the diagram.
The first column, called depths, are used to display cumulative frequencies. Starting from
the top, the depths indicate the number of observations that lie in a given row or before. For
example, the 11 in the third row indicates that there are 11 observations in the first three rows. The
row that contains the middle observation is denoted by having a bracketed number of observations
in that row; (7) for our example. We thus know that the middle value lies in the fourth row. The
depths following that row indicate the number of observations that lie in a given row or after. For
example, the 4 in the seventh row indicates that there are four observations in the last three rows.
Histograms
If there are many data points and we would like to see the distribution of the data, we can
represent the data by a frequency histogram or a relative frequency histogram. A histogram looks
similar to a bar chart but it is for quantitative data.
Page 31 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
To create a histogram, the data need to be grouped into class intervals. Then create a tally to
show the frequency (or relative frequency) of the data into each interval. The relative frequency is
the frequency in a particular class divided by the total number of observations. The bars are as wide
as the class interval and as tall as the frequency (or relative frequency).
Answer
For histograms, we usually want to have from 5 to 20 intervals. Since the data range is from
132 to 148, it is convenient to have a class of width 2 since that will give us 9 intervals.
• 131.5-133.5
• 133.5-135.5
• 135.5-137.5
• 137.5-139.5
• 139.5-141.5
• 141.5-143.5
• 143.5-145.5
• 145.5-147.5
• 147.5-149.5
The reason that we choose the end points as .5 is to avoid confusion whether the end point
belongs to the interval to its left or the interval to its right. An alternative is to specify the endpoint
convention. For example, Minitab includes the left end point and excludes the right end point.
Having the intervals, one can construct the frequency table and then draw the frequency
histogram or get the relative frequency histogram to construct the relative frequency histogram. The
following histogram is produced by Minitab when we specify the midpoints for the definition of
intervals according to the intervals chosen above.
Page 32 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
If we do not specify the midpoint for the definition of intervals, Minitab will default to
choose another set of class intervals resulting in the following histogram. According to the include
left and exclude right endpoint convention, the observation 133 is included in the class 133-135.
Note that different choices of class intervals will result in different histograms. Relative frequency
histograms are constructed in much the same way as a frequency histogram except that the vertical
axis represents the relative frequency instead of the frequency. For the purpose of visually comparing
the distribution of two data sets, it is better to use relative frequency rather than a frequency
histogram since the same vertical scale is used for all relative frequency--from 0 to 1.
Boxplot
To create this plot we need the five number summary. Therefore, we need:
• minimum value,
• Q1 (lower quartile),
Page 33 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
• Q2 (median),
• Q3 (upper quartile), and
• maximum value.
Using the five number summary, one can construct a skeletal boxplot.
1. Mark the five number summary above the horizontal axis with vertical lines.
2. Connect Q1, Q2, Q3 to form a box, then connect the box to min and max with a line to
form the whisker.
Most statistical software does NOT create graphs of a skeletal boxplot but instead opt for the
boxplot as follows below. Boxplots from statistical software are more detailed than skeletal boxplots
because they also show outliers. However, if there are no outliers, what is produced by the software
is essentially the skeletal boxplot.
The following terminology will prepare us to understand and draw this more detailed type of the
boxplot.
Potential outliers are observations that lie outside the lower and upper limits.
Lower limit = Q1 - 1.5 * IQR
Upper limit = Q3 +1.5 * IQR
Adjacent values are the most extreme values that are not potential outliers.
Page 34 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Right-Skewed Data- A right-skewed distribution along with it's corresponding box plot:
Page 35 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis
Left-Skewed Data
A left-skewed distribution along with it's corresponding box plot.:
References:
Barbara Illowsky and Susan Dean, 2018, Introductory to Statistics
Calderon, J.F., and Gonzales, E.C., (2016) Methods of Research and Thesis Writing
De Belen, R., and Feliciano, P., (2015) 1st Edition Basic Statistics for Research
Pareño, E., and Jimenez, R., (2006) Basic Statistics: A Worktext
https://online.stat.psu.edu/stat503/book/export/html/632
Page 36 of 36