0% found this document useful (0 votes)
3 views36 pages

Module 1 Introduction To Statistics and Data Analysis

The document provides an introduction to statistics and data analysis, emphasizing the importance of understanding data collection, analysis, and interpretation in various fields. It defines key statistical concepts such as population, sample, parameter, and statistic, and discusses methods of data collection including qualitative and quantitative data. Additionally, it highlights the significance of data science in engineering and the qualifications needed for a career in data analysis.

Uploaded by

amihankeishamay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

Module 1 Introduction To Statistics and Data Analysis

The document provides an introduction to statistics and data analysis, emphasizing the importance of understanding data collection, analysis, and interpretation in various fields. It defines key statistical concepts such as population, sample, parameter, and statistic, and discusses methods of data collection including qualitative and quantitative data. Additionally, it highlights the significance of data science in engineering and the qualifications needed for a career in data analysis.

Uploaded by

amihankeishamay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

College of Engineering and Industrial Technology

Department of Agricultural and Biosystems Engineering


Engineering Data Analysis

Chapter 1: Introduction to Statistics and Data Analysis:


Obtaining Data
You are probably asking yourself the question, "When and where will I use statistics?" If you read any
newspaper, watch television, or use the Internet, you will see statistical information. There are statistics about
crime, sports, education, politics, and real estate. Typically, when you read a newspaper article or watch a
television news program, you are given sample information.
With this information, you may make a decision about the correctness of a statement, claim, or "fact."
Statistical methods can help you make the "best educated guess." Since you will undoubtedly be given
statistical information at some point in your life, you need to know some techniques for analyzing the
information thoughtfully. Think about buying a house or managing a budget.
The science of statistics deals with the collection, analysis, interpretation, and presentation of data.
We see and use data in our everyday lives.

Page 1 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

1.1 Definitions of Statistics, Probability, and Key Terms


Organizing and summarizing data is called descriptive statistics. Two ways to summarize data are by
graphing and by using numbers (for example, finding an average). After you have studied probability and
probability distributions, you will use formal methods for drawing conclusions from "good" data. The formal
methods are called inferential statistics.
Statistical inference uses probability to determine how confident we can be that our conclusions are
correct. Effective interpretation of data (inference) is based on good procedures for producing data and
thoughtful examination of the data. You will encounter what will seem to be too many mathematical formulas
for interpreting data. The goal of statistics is not to perform numerous calculations using the formulas, but
to gain an understanding of your data. The calculations can be done using a calculator or a computer. The
understanding must come from you. If you can thoroughly grasp the basics of statistics, you can be more
confident in the decisions you make in life.

Key Terms:

In statistics, we generally want to study a population. You can think of a population as a collection
of persons, things, or objects under study. To study the population, we select a sample. The idea of sampling
is to select a portion (or subset) of the larger population and study that portion (the sample) to gain
information about the population. Data are the result of sampling from a population.
Because it takes a lot of time and money to examine an entire population, sampling is a very practical
technique. If you wished to compute the overall grade point average at your school, it would make sense to
select a sample of students who attend the school. The data collected from the sample would be the students'
grade point averages.
In presidential elections, opinion poll samples of 1,000–2,000 people are taken. The opinion poll is
supposed to represent the views of the people in the entire country. Manufacturers of canned carbonated
drinks take samples to determine if a 16 ounce can contain 16 ounces of carbonated drink. From the sample
data, we can calculate a statistic”.
A statistic is a number that represents a property of the sample. For example, if we consider one math
class to be a sample of the population of all math classes, then the average number of points earned by
students in that one math class at the end of the term is an example of a statistic. The statistic is an estimate
of a population parameter.

Page 2 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

A parameter is a numerical characteristic of the whole population that can be estimated by a statistic.
Since we considered all math classes to be the population, then the average number of points earned per
student over all the math classes is an example of a parameter. One of the main concerns in the field of
statistics is how accurately a statistic estimates a parameter. The accuracy really depends on how well the
sample represents the population. The sample must contain the characteristics of the population in order to
be a representative sample. We are interested in both the sample statistic and the population parameter in
inferential statistics. In a later chapter, we will use the sample statistic to test the validity of the established
population parameter.
A variable, usually notated by capital letters such as X and Y, is a characteristic or measurement that
can be determined for each member of a population.
Variables may be numerical or categorical.
• Numerical variables take on values with equal units such as weight in pounds and time in hours.
• Categorical variables place the person or thing into a category.

“If we let X equal the number of points earned by one math student at the end of a term, then X is a numerical variable.
If we let Y be a person's party affiliation, then some examples of Y include Republican, Democrat, and Independent. Y is
a categorical variable”.

We could do some math with values of X (calculate the average number of points earned, for
example), but it makes no sense to do math with values of Y (calculating an average party affiliation makes
no sense).
Data are the actual values of the variable. They may be numbers or they may be words. Datum is a
single value. Two words that come up often in statistics are mean and proportion.

“If you were to take three exams in your math classes and obtain scores of 86, 75, and 92, you would calculate
your mean score by adding the three exam scores and dividing by three (your mean score would be 84.3 to one decimal
place). If, in your math class, there are 40 students and 22 are men and 18 are women, then the proportion of men
students is 22/40 and the proportion of women students is 18/40”.

Data is collected every second of every day from a vast array of sources. From the security cameras
that deploy facial recognition technology when people enter a building to the mobile devices that track
shopping, media, and communication habits, images and numbers are continually being collected by

Page 3 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

government agencies, consumer groups, and other organizations from all around the world. This data
contains information that can help businesses operate more efficiently and reach the right customers.
However, to be of any value, data must be correctly interpreted. Misinterpreted data can lead to flawed
insights that could disrupt an organization’s growth and stability strategies. To ensure that these copious
amounts of information are leveraged effectively, businesses and other groups are hiring data scientists to
help collect, store, and analyze pertinent information. While these professionals come from a variety of
backgrounds, the growing field of data science provides a number of rewarding opportunities specifically for
engineers. The fields overlap in significant ways, which often makes professionals with an engineering
background a good fit for a role working with data.

Why Data Analytics is Gaining HYPE in the 21st Century (https://towardsdatascience.com)

What Is Data Analysis?

Data analysis involves gathering and studying data to form insights that can be used to make
decisions. The information derived can be useful in several different ways, such as for building a business
strategy or ensuring the safety and efficiency of an engineering project. Data collection and analysis is
becoming increasingly important across most every industry. Fields that collect this information include
marketing, sports, entertainment, medicine, communications, government, criminal justice, electronics, and
aerospace. Data can help companies make decisions on issues as diverse as how to engage their target

Page 4 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

audiences, what purchases to make, and how to organize their staff members. Ultimately, data science is not
just about collecting and analyzing information. It is about being able to predict the future and verify the
results of past decisions.

Data Science and Engineering

Engineering is one industry that has been particularly influenced by the growing need for data
collection and analysis. As big data has begun to play a larger role in industries around the world, engineers
have been called on to play an influential role in the way this information is gathered, stored, and leveraged.
Professionals with an engineering background generally prove to be particularly adept at developing
techniques for analyzing data groups to extract valuable information.
To succeed in a career as a data scientist, an engineer should possess the following qualifications:
Analytics expertise: Experience extrapolating information from large quantities of numbers will help you
succeed in this role. Depending on where you work, knowledge of specific analytic tools will also
likely be required.
Computer knowledge: Gone are the days of crunching numbers on a hand-held calculator — much less with
pen and paper. The vast majority of your day will be spent working on a computer, so knowledge of
coding, unstructured data, and cloud tools will increase your marketability.
Communication skills: It is important to be able to present your findings in a clear and concise way to ensure
that your employer understands the information and can act accordingly.
Strong drive: In data science, you should regularly be looking for ways to improve how information is
collected and processed. Being an intellectually curious self-starter will take you far in this role.

Exercise No. 1
Determine what the key terms refer to in the following study. We want to know the average (mean) amount
of money first year college students spend at ABC College on school supplies that do not include books.
We randomly surveyed 100 first year students at the college. Three of those students spent $150, $200,
and $225, respectively.

Answer:
• The population is all first-year students attending ABC College this term.

Page 5 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

• The sample could be all students enrolled in one section of a beginning statistics course at ABC
College (although this sample may not represent the entire population).
• The parameter is the average (mean) amount of money spent (excluding books) by first year college
students at ABC College this term.
• The statistic is the average (mean) amount of money spent (excluding books) by first year college
students in the sample.
• The variable could be the amount of money spent (excluding books) by one first year student. Let X =
the amount of money spent (excluding books) by one first year student attending ABC College.
• The data are the dollar amounts spent by the first-year students. Examples of the data are $150, $200,
and $225.

Exercise No. 2
Determine what the key terms refer to in the following study.
A study was conducted at a local college to analyze the average cumulative GPAs of students who graduated
last year. Fill in the letter of the phrase that best describes each of the items below.
1. Population_____ 2. Statistic _____ 3. Parameter _____ 4. Sample _____ 5. Variable _____
6. Data _____
a) all students who attended the college last year
b) the cumulative GPA of one student who graduated from the college last year
c) 3.65, 2.80, 1.50, 3.90
d) a group of students who graduated from the college last year, randomly selected
e) the average cumulative GPA of students who graduated from the college last year
f) all students who graduated from the college last year
g) the average cumulative GPA of students in the study who graduated from the college last year

Answer:
1. f; 2. g; 3. e; 4. d; 5. b; 6. c

Page 6 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

1.2 Methods of Collecting Data


Two Types of Data
Qualitative data pertain to observation that can be categorized to some characteristics or qualities such as
sex, civil status, occupation, religion and other personal data. While quantitative data pertain to observations
that can be measured such as weight, height, pulse rate, blood pressure heart rate. Quantitative data can be
further subdivided into discrete and continuous data.
• Discrete data are those expressed as a whole number or integer such as a specific numerical test score
(90 out of 100 points).
• Continuous data are those that full of category of being “measured to the nearest” such as height,
weight, and age.

Scales of Measurement
When gathering data by any method, measurements are usually obtained (e.g., height in inches,
weight in pounds, age in years, I.Q. scores, temperature in degrees Celsius, incidence rates, mortality rates,
etc.) Measurements are classified into four scales. In selecting the statistical tool to be used for drawing
inferences on a random sample, the type of measurement scale must be carefully chosen.

1. Nominal Scale
A measurement that classifies elements into two or more categories or classes. The numbers indicate that
the elements are different, but the difference is not according to order or magnitude.
Example: Distribution of Patients in XYZ Hospital According to Religion and Gender

2. Ordinal Scale
A measurement scale that ranks individuals in terms of the degree to which they possess a characteristic.

Page 7 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Example: Anxiety Level of Mentally- Retarded Patients in Hospital XYZ

Legend:
0= not anxious
1= low anxiety level
2= moderate anxiety level
3= high anxiety level

3. Interval Scale
A measurement scale that, in addition to ordering scores from highest to lowest, establishes a uniform
unit in the scale so that any distance between two consecutive scores is of equal magnitude.
Example:
The aptitude scores from 80 to 90 are of equal difference as that of the aptitude scores from 90 to 100.
There is also no absolute zero in the scale. For example, a place where the temperature reading is 0 degrees
is 0 degree Celsius does not mean that there is no temperature in that place.

4. Ratio Scale
A measurement scale that, in addition to being an interval scale, also has an absolute zero in the scale.
Examples:
Height, weight, area, volume, speed, rate of doing work, amount of money deposited in a bank.

Classification of Data According to Source:


1. Primary Data – are those gathered from primary sources such as individual persons, organized groups,
documents in their original form and living organisms such as animals, fowls and etc.
2. Secondary Data – are those gathered from secondary sources such as books including dictionaries,
encyclopedias, articles published in professional journals, unpublished master’s thesis and all other
second-hand sources.

Page 8 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Ways of Collecting Data

There is no formula for selecting the best method to be used when gathering data. It depends on the
researcher’s design of the study, the type of data, the time allotment to complete the study, and the
researcher’s financial capacity.
There are several ways in collecting data among which are the following: Clerical tools and Mechanical
Devices
A. Clerical Tools
1. Questionnaire - Defines by Good as a list of planned, well-planned questions written questions related
to a particular topic, with space provided for indicating the response to each question, intended for
submission to a number of persons for reply; commonly used on a normative survey and in the
measurement of attitudes and opinions

Construction of Questionnaires:
1. Doing library search
2. Talking to knowledgeable people
3. Mastering the guidelines
4. Writing the questionnaire
5. Editing the questionnaire
6. Rewriting the questionnaire
7. Pretesting the questionnaire (dry run)

Types of Questions Asked in Survey Questionnaires


The types of questions asked in questionnaires for survey purposes are:
A. According to Form
1. The free response type
2. The guided response type
B. According to the kind of data asked for
a. Descriptive
b. Quantified
c. Intensity of feelings, emotion and attitude
d. Degree of judgment

Page 9 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

e. Understanding

2. Interview- is one of the major techniques in gathering data or information. It is a purposeful face to
face relationship between two persons.
Types of Interviews
a. Direct Method- The researcher personally interviews the respondents.
b. Indirect Method- The researchers may use a telephone to interview the respondents.

Classes of Interview
a. Standardized- The interviewer is not allowed to change the specific wordings of the questions in the
interview schedule. He must conduct all interviews in the same manner, and he cannot adapt
questions for specific situations or pursue statements
b. Non-standardized- The interviewer has complete freedom to develop each interview in the most
appropriate manner for each situation. He is not held to any specific questions. This is the same as
so-called informal interview.
c. Semi-standardized- The interviewer is required to ask a number of specific major questions, and
beyond these he is free to probe as he chooses. There are prepared principal questions to be asked
and once these are asked and answered the interpreter is free to ask any questions as he sees fit for
the situations.
d. Focused- Also called depth interview. Similar to non-standardized interview, the researcher asks a
series of questions based on his previous understanding and insight of the situation. The interview
is focused on specific topics that are to be investigated in depth.
e. Nondirective- The interviewee or subject is allowed and even encouraged to express his feelings
without the fear of disapproval. The subject can express his feelings and views on certain topics even
without waiting to be questioned or even without pressure from the interviewer.

3. Empirical Observation Method- Means of gathering information for research, may be defined as
perceiving data through the senses: sight, hearing, taste, touch and smell. The sense of sight is the most
important and the most used among the senses.
Types of Observation
a. Participant and non-participant observation
1. Participant- Observer takes active part in the activities of the group being observed.

Page 10 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

2. Non-participant observation- Observer is a mere by-stander observing the group he is studying


about.
b. Structured and unstructured observation
1. Structured – concentrates on a particular aspect/s of the variable being observed, be it a thing,
behavior, condition, or situation. Usually used in non-participant or controlled observation.
2. Unstructured – observer does not hold any list of the items to be observed. Usually used in
participant or uncontrolled observation.
c. Controlled and uncontrolled observation
1. Controlled – usually utilized in experimental studies in which the experimental as well as the
non-experimental variables are controlled by the researcher.
2. Uncontrolled – usually utilized in natural settings, in observation area no any control placed
upon any variable.
4. Registration, Test, Experiment and Library
B. Mechanical devices
Microscopes, Thermometers, Cameras, etc.

Exercise No. 2
Identify each quantitative variable as discrete or continuous. Write D if discrete or C if continuous
1. The boiling point of water is 100 deg. Cel.
2. Length of hair of female students.
3. Number of foreigners migrating to the Philippines every year
4. Her home telephone number is 2581376.
5. The number of children with missing/decayed teeth in barangay A is 2000.
6. John’s height is 168 cm.
7. The following data are the densities of sample substances taken from Tabing-Ilog River (g/cc):
23.6, 19.8, 15.0, 7.8 and 2.4.
8. Weights in pounds of the Math quiz contestants.
9. The average speed of motorboats cruising in Manila Bay every day is 50m/s.
10. Scores of 16 students in a Statistics Quiz.

Page 11 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Selection
This bias occurs when the sample selected does not reflect the population of interest. For
instance, you are interested in the attitude of female students regarding campus safety but when
sampling you also include males. In this case, your population of interest was female students
however your sample included subjects not in that population (i.e. males).

A. Strategies for Collecting Data


How can we get data? How do we select observations or measurements for a study?
There are two types of methods for collecting data, non-probability methods and probability
methods.
a. Non-probability Methods
These might include:
1. Convenience sampling (haphazard): Collecting data from subjects who are conveniently
obtained.
Example:
Surveying students as they pass by in the university's student union building.

2. Gathering volunteers: Collecting data from subjects who volunteer to provide data.
Example:
Using an advertisement in a magazine or on a website inviting people to complete a form or
participate in a study.
b. Probability Methods
1. Simple random sample: making selections from a population where each subject in the
population has an equal chance of being selected.
2. Stratified random sample: where you have first identified the population of interest, you
then divide this population into strata or groups based on some characteristic (e.g. sex,
geographic region), then perform simple random sample from each strata.
3. Cluster sample: where a random cluster of subjects is taken from the population of interest.
For instance, if we were to estimate the average salary for faculty members at Penn State -
University Park Campus, we could take a simple random sample of departments and find
the salary of each faculty member within the sampled department. This would be our cluster
sample.

There are advantages and disadvantages to both types of methods. Non-probability methods
are often easier and cheaper to facilitate. When non-probability methods are used it is often the case
that the sample is not representative of the population. If it is not representative, you can make
generalizations only about the sample, not the population. The primary benefit of using probability
sampling methods is the ability to make inference. We can assume that by using random sampling

Page 12 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

we attain a representative sample of the population The results can be “extended” or “generalized”
to the population from which the sample came.

Example 1-1: Survey Methods

Airline Company Survey of Passengers


Let's say that you are the owner of a large airline company and you live in Los Angeles. You want to survey
your L.A. passengers on what they like and dislike about traveling on your airline. For each of the methods,
determine if a non-probability method or a probability method is used. Then determine the type of sampling.
a. Since you live in L.A. you go to the airport and just interview passengers as they approach your
ticket counter.
Answer: Non-probability method; convenience sampling.
b. You have your ticket counter personnel distribute a questionnaire to each passenger requesting
they complete the survey and return it at end of the flight.
Answer: Non-probability methods; Volunteer sampling
c. You randomly select a set of passengers flying on your airline and question those that you have
selected.
Answer: Probability method; Simple random sampling
d. You group your passengers by the class they fly (first, business, economy), and then take a random
sample from each of these groups.
Answer: Probability method: Stratified sampling
e. You group your passengers by the class they fly (first, business, economy) and randomly select
such classes from various flights and survey each passenger in that class and flight selected.
Answer: Probability method; Cluster sampling

1.3 Planning and Conducting Surveys:


Introduction to Design of Experiments
Design of Experiments
Do you remember learning about this back in high school or junior high even? What were those steps
again? Decide what phenomenon you wish to investigate. Specify how you can manipulate the factor and
hold all other conditions fixed, to ensure that these extraneous conditions aren't influencing the response
you plan to measure.

Page 13 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Then measure your chosen response variable at several (at least two) settings of the factor under study. If
changing the factor causes the phenomenon to change, then you conclude that there is indeed a cause-and-
effect relationship at work.
How many factors are involved when you do an experiment? Some say two - perhaps this is a comparative
experiment? Perhaps there is a treatment group and a control group? If you have a treatment group and a
control group then, in this case, you probably only have one factor with two levels.

Engineering Experiments
If we had infinite time and resource budgets there probably wouldn't be a big fuss made over designing
experiments. In production and quality control we want to control the error and learn as much as we can
about the process or the underlying theory with the resources at hand. From an engineering perspective we're
trying to use experimentation for the following purposes:
a. reduce time to design/develop new products & processes
b. improve performance of existing processes
c. improve reliability and performance of products
d. achieve product & process robustness
e. perform evaluation of materials, design alternatives, setting component & system tolerances, etc.
We always want to fine-tune or improve the process. In today's global world this drive for
competitiveness affects all of us both as consumers and producers.
Robustness is a concept that enters into statistics at several points. At the analysis, stage robustness refers
to a technique that isn't overly influenced by bad data. Even if there is an outlier or bad data you still want to
get the right answer. Regardless of who or what is involved in the process - it is still going to work.

Page 14 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Every experiment design has inputs. Back to the cake baking example: we have our ingredients such
as flour, sugar, milk, eggs, etc. Regardless of the quality of these ingredients we still want our cake to come
out successfully. In every experiment there are inputs and in addition, there are factors (such as time of baking,
temperature, geometry of the cake pan, etc.), some of which you can control and others that you can't control.
The experimenter must think about factors that affect the outcome. We also talk about the output and the
yield or the response to your experiment. For the cake, the output might be measured as texture, flavor,
height, size, or flavor.

The Basic Principles of DOE

Randomization
This is an essential component of any experiment that is going to have validity. If you are doing a comparative
experiment where you have two treatments, a treatment and a control, for instance, you need to include in
your experimental process the assignment of those treatments by some random process. An experiment
includes experimental units. You need to have a deliberate process to eliminate potential biases from the
conclusions, and random assignment is a critical step.

Replication
Replication is some in sense the heart of all of statistics. To make this point... Remember what the standard
error of the mean is? It is the square root of the estimate of the variance of the sample mean, i.e.,.The width
of the confidence interval is determined by this statistic. Our estimates of the mean become less variable as
the sample size increases. Replication is the basic issue behind every method we will use in order to get a
handle on how precise our estimates are at the end. We always want to estimate or control the uncertainty in
our results. We achieve this estimate through replication. Another way we can achieve short confidence
intervals is by reducing the error variance itself. However, when that isn't possible, we can reduce the error in
our estimate of the mean by increasing n.

Another way is to reduce the size or the length of the confidence interval is to reduce the error variance -
which brings us to blocking.

Blocking

Page 15 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Blocking is a technique to include other factors in our experiment which contribute to undesirable variation.
Much of the focus in this class will be to creatively use various blocking techniques to control sources of
variation that will reduce error variance. For example, in human studies, the gender of the subjects is often
an important factor. Age is another factor affecting the response. Age and gender are often considered
nuisance factors which contribute to variability and make it difficult to assess systematic effects of a treatment.
By using these as blocking factors, you can avoid biases that might occur due to differences between the
allocation of subjects to the treatments, and as a way of accounting for some noise in the experiment. We
want the unknown error variance at the end of the experiment to be as small as possible. Our goal is usually
to find out something about a treatment factor (or a factor of primary interest), but in addition to this, we
want to include any blocking factors that will explain variation.

Steps for Planning, Conducting and Analyzing an Experiment


The practical steps needed for planning and conducting an experiment include: recognizing the goal of the
experiment, choice of factors, choice of response, choice of the design, analysis and then drawing conclusions.
This pretty much covers the steps involved in the scientific method.
a. Recognition and statement of the problem
b. Choice of factors, levels, and ranges
c. Selection of the response variable(s)
d. Choice of design
e. Conducting the experiment
f. Statistical analysis
g. Drawing conclusions, and making recommendations
Factors
We usually talk about "treatment" factors, which are the factors of primary interest to you. In addition to
treatment factors, there are nuisance factors which are not your primary focus, but you have to deal with
them. Sometimes these are called blocking factors, mainly because we will try to block on these factors to
prevent them from influencing the results.

There are other ways that we can categorize factors: Experimental vs. Classification Factors

1. Experimental Factors

Page 16 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

These are factors that you can specify (and set the levels) and then assign at random as the treatment to the
experimental units. Examples would be temperature, level of an additive fertilizer amount per acre, etc.

2. Classification Factors
These can't be changed or assigned, these come as labels on the experimental units. The age and sex of the
participants are classification factors which can't be changed or randomly assigned. But you can select
individuals from these groups randomly.
a. Quantitative Factors- You can assign any specified level of a quantitative factor. Examples: percent
or pH level of a chemical.
b. Qualitative Factors- These factors have categories which are different types. Examples might be
species of a plant or animal, a brand in the marketing field, gender, - these are not ordered or
continuous but are arranged perhaps in sets.

Summarizing One Qualitative Variable

Once we determine that a variable is Qualitative (or Categorical), we need tools to summarize
the data. We can summarize the data by using frequencies and by graphing the data.
Let’s start by an example. In a class size of 30 students, a survey question asked the students
to indicate their eye color. The responses are shown in the table.
Hazel Brown Brown Brown
Brown Brown Brown Brown
Brown Brown Brown Brown
Blue Brown Brown Brown
Brown Brown Brown Brown

From this list, we can clearly see that the eye color brown is the most common. Which is
more frequent, Hazel or Green? It may only take a few seconds to answer the question but what if
there were 100 students? Or 1000? The best way to summarize categorical data is to use frequencies
and percentages (or proportions).

Proportion- A proportion is a fraction or part of the total that possesses a certain characteristic.

The best way to summarize categorical data is to use frequencies and percentages like in the
table.

Page 17 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Eye Color Frequency Percentage


Brown 24 80%
Blue 3 10%
Hazel 2 6.67%
Green 1 3.33%

The table is much easier to read than the actual data. It is clear to see that more students
have Hazel than Green eyes in the class. As the saying goes, “A picture is worth 1000 words”, it is
helpful to visualize the data in a graph.

Measures of Central Tendency

A. Mean, Median, and Mode


A measure of central tendency is an important aspect of quantitative data. It is an estimate
of a “typical” value. Three of the many ways to measure central tendency are
the mean, median and mode.

There are other measures, such as a trimmed mean, that we do not discuss here.

Mean
The mean is the average of data.

The sample mean is a statistic and a population mean is a parameter.

Median
The median is the middle value of the ordered data. The most important step in finding the median
is to first order the data from smallest to largest.

Steps to finding the median for a set of data:


1. Arrange the data in increasing order, i.e. smallest to largest.
𝑛+1
2. Find the location of the median in the ordered data by , where n is the sample size.
2

Page 18 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

3. The value that represents the location found in Step 2 is the median.

Note on Odd or Even Sample Sizes


If the sample size is an odd number, then the location point will produce a median that is an
observed value. If the sample size is an even number, then the location will require one to take the
mean of two numbers to calculate the median. The result may or may not be an observed value as
the example below illustrates.

Mode
The mode is the value that occurs most often in the data. It is important to note that there may be
more than one mode in the dataset.

Effects of Outliers

One shortcoming of the mean is that means are easily affected by extreme values. Measures
that are not that affected by extreme values are called resistant. Measures that are affected by extreme
values are called sensitive.

Example 1-6: Test Scores Cont'd...

Page 19 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Using the data from Example 1-5, how would the mean and median change, if the entry 91
is mistakenly recorded as 9?

Answer:
The data set would be
9, 69, 76, 76, 78, 80, 82, 86, 88, 95

Mean
The mean would be x¯=110(9+78+69+95+82+76+76+86+88+80) =73.9
The mean would be 73.9, which is very different from 82.1.

Median
Let us see the effect of the mistake on the median value.
The data set (with 91 coded as 9) in increasing order is:
9, 69, 76, 76, 78, 80, 82, 86, 88, 95
where the median = 79
The medians of the two sets are not that different. Therefore, the median is not that affected by the
extreme value 9.
The mean is a sensitive measure (or sensitive statistic) and the median is a resistant measure
(or resistant statistic).
After reading this lesson you should know that there are quite a few options when one wants
to describe central tendency. In future lessons, we talk about mainly about the mean. However, we
need to be aware of one of its shortcomings, which is that it is easily affected by extreme values.
Unless data points are known mistakes, one should not remove them from the data set! One should
keep the extreme points and use more resistant measures. For example, use the sample median to
estimate the population median.

Adding and Multiplying Constants

What happens to the mean and median if we add or multiply each observation in a data set
by a constant?
Consider for example if an instructor curves an exam by adding five points to each student’s
score. What effect does this have on the mean and the median? The result of adding a constant to
each value has the intended effect of altering the mean and median by the constant.

For example, if in the above example where we have 10 aptitude scores, if 5 was added to
each score the mean of this new data set would be 87.1 (the original mean of 82.1 plus 5) and the
new median would be 86 (the original median of 81 plus 5).

Page 20 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Similarly, if each observed data value was multiplied by a constant, the new mean and median
would change by a factor of this constant. Returning to the 10 aptitude scores, if all of the original
scores were doubled, the then the new mean and new median would be double the original mean
and median. As we will learn shortly, the effect is not the same on the variance!

Looking Ahead!
Why would you want to know this? One reason, especially for those moving onward to more
applied statistics (e.g. Regression, ANOVA), is the transforming data. For many applied statistical
methods, a required assumption is that the data is normal, or very near bell-shaped. When the data
is not normal, statisticians will transform the data using numerous techniques e.g. logarithmic
transformation. We just need to remember the original data was transformed!!

Shape
The shape of the data helps us to determine the most appropriate measure of central
tendency. The three most important descriptions of shape are Symmetric, Left-skewed, and Right-
skewed. Skewness is a measure of the degree of asymmetry of the distribution.

Page 21 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Note! When one has very skewed data, it is better to use the median as measure of central tendency
since the median is not much affected by extreme values.

Measures of Position

While measures of central tendency are important, they do not tell the whole story. For
example, suppose the mean score on a statistics exam is 80%. From this information, can we
determine a range in which most people scored? The answer is no. There are two other types of
measures, measures of position and variability, that help paint a more concise picture of what is
going on in the data. In this section, we will consider the measures of position and discuss measures
of variability in the next one.

Measures of position give a range where a certain percentage of the data fall. The measures
we consider here are percentiles and quartiles.

Percentiles

The pth percentile of the data set is a measurement such that after the data are ordered from
smallest to largest, at most, p% of the data are at or below this value and at most, (100 - p) % at or
above it.
A common application of percentiles is their use in determining passing or failure cutoffs
for standardized exams such as the GRE. If you have a 95th percentile score then you are at or above
95% of all test takers.

Page 22 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

The median is the value where fifty percent or the data values fall at or below it. Therefore,
the median is the 50th percentile.

We can find any percentile we wish. There are two other important percentiles. The 25th
percentile, typically denoted, Q1, and the 75th percentile, typically denoted as Q3. Q1 is commonly
called the lower quartile and Q3 is commonly called the upper quartile.

Finding Quartiles
The method we will demonstrate for calculating Q1 and Q3 may differ from the method
described in our textbook. The results shown here will always be the same as Minitab's results. The
method here is also different from the method presented in many undergraduate statistics courses.
This method is what we require students to use.

There are two steps to follow:


1. Find the location of the desired quartile
If there are n observations, arranged in increasing order, then the first quartile is at
position (n+1)/4, second quartile (i.e. the median) is at position 2(n+1)/4, and the third quartile
is at position 3(n+1)/4.
2. Find the value in that position for the ordered data.

Page 23 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Note! If the value found in part 1 is not a whole number, interpolate the value.

The 5 - Number Summary

The Five-Number Summary:


A helpful summary of the data is called the five number summary. The five number
summary consists of five values:
1. The minimum
2. The lower quartile, Q1
3. The median (also known as Q2)
4. The upper quartile, Q3
5. The maximum

Page 24 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Measures of Variability

To introduce the idea of variability, consider this example. Two vending


machines A and B drop candies when a quarter is inserted. The number of pieces of candy one gets
is random. The following data are recorded for six trials at each vending machine:

The dot plot for the pieces of candy from vending machine A and vending machine B is
displayed in figure 1.4.

They have the same center, but what about their spreads?

Measures of Variability

There are many ways to describe variability or spread including:


• Range
• Interquartile range (IQR)
• Variance and Standard Deviation

Page 25 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Range
The range is the difference in the maximum and minimum values of a data set. The
maximum is the largest value in the dataset and the minimum is the smallest value. The range is easy
to calculate but it is very much affected by extreme values.

𝑅𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 − 𝑚𝑖𝑛𝑖𝑚𝑢𝑚

Like the range, the IQR is a measure of variability, but you must find the quartiles in order
to compute its value.

Interquartile Range (IQR)


The interquartile range is the difference between upper and lower quartiles and denoted
as IQR.
IQR=Q3−Q1=upper quartile−lower quartile=75th percentile−25th percentile

Note! The IQR is not affected by extreme values. It is thus a resistant measure of variability.

Variance and Standard Deviation

One way to describe spread or variability is to compute the standard deviation. In the
following section, we are going to talk about how to compute the sample variance and the sample
standard deviation for a data set. The standard deviation is the square root of the variance.

Variance- the average squared distance from the mean

𝜎 2 is often estimated by using the sample variance.

Why do we divide by n−1 instead of by n?

Page 26 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

When we calculate the sample sd we estimate the population mean with the sample mean,
and dividing by (n-1) rather than n which gives it a special property that we call an "unbiased
estimator". Therefore, s2 is an unbiased estimator for the population variance.
The sample variance (and therefore sample standard deviation) are the common default
calculations used by software. When asked to calculate the variance or standard deviation of a set of
data, assume - unless otherwise instructed - this is sample data and therefore calculating the sample
variance and sample standard deviation.

Calculate the variance for these final exam scores.


24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93, 94, 97

Answer

First, find the mean:


24 + 58 + 61 + 67 + 71 + 73 + 76 + 79 + 82 + 83 + 85 + 87 + 88 + 88 + 92 + 93 + 94 + 97 233
𝑥= =
18 3

Page 27 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Finally,
∑𝑛𝑖=1 (𝑥𝑖 − 𝑥)2 46908/9 5212
𝑠2 = = = ≈ 306.588
18 − 1 17 17

Try it!
Calculate the sample variances for the data set from vending machines A and B yourself
and check that it the variance for B is smaller than that for data set A. Work out your answer first,
then click the graphic to compare answers.

a. 1, 2, 3, 3, 4, 5
b. 2, 3, 3, 3, 3, 4

Page 28 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Standard Deviation
The standard deviation is a very useful measure. One reason is that it has the same unit of
measurement as the data itself (e.g. if a sample of student heights were in inches then so, too, would
be the standard deviation. The variance would be in squared units, for example inches2). Also, the
empirical rule, which will be explained later, makes the standard deviation an important yardstick
to find out approximately what percentage of the measurements fall within certain intervals.

Standard Deviation
Approximately the average distance the values of a data set are from the mean or the square
root of the variance

Page 29 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Now that we discussed how to find summary statistics for quantitative variables, the next step
is to graph the data. The graphs we will discuss include:

Dotplot

Page 30 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Stem-and-Leaf Diagrams

To produce the diagram, the data need to be grouped based on the “stem”, which depends
on the number of digits of the quantitative variable. The “leaves” represent the last digit. One
advantage of this diagram is that the original data can be recovered (except the order the data is
taken) from the diagram.

The first column, called depths, are used to display cumulative frequencies. Starting from
the top, the depths indicate the number of observations that lie in a given row or before. For
example, the 11 in the third row indicates that there are 11 observations in the first three rows. The
row that contains the middle observation is denoted by having a bracketed number of observations
in that row; (7) for our example. We thus know that the middle value lies in the fourth row. The
depths following that row indicate the number of observations that lie in a given row or after. For
example, the 4 in the seventh row indicates that there are four observations in the last three rows.

Histograms

If there are many data points and we would like to see the distribution of the data, we can
represent the data by a frequency histogram or a relative frequency histogram. A histogram looks
similar to a bar chart but it is for quantitative data.

Page 31 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

To create a histogram, the data need to be grouped into class intervals. Then create a tally to
show the frequency (or relative frequency) of the data into each interval. The relative frequency is
the frequency in a particular class divided by the total number of observations. The bars are as wide
as the class interval and as tall as the frequency (or relative frequency).

Answer
For histograms, we usually want to have from 5 to 20 intervals. Since the data range is from
132 to 148, it is convenient to have a class of width 2 since that will give us 9 intervals.

• 131.5-133.5
• 133.5-135.5
• 135.5-137.5
• 137.5-139.5
• 139.5-141.5
• 141.5-143.5
• 143.5-145.5
• 145.5-147.5
• 147.5-149.5

The reason that we choose the end points as .5 is to avoid confusion whether the end point
belongs to the interval to its left or the interval to its right. An alternative is to specify the endpoint
convention. For example, Minitab includes the left end point and excludes the right end point.
Having the intervals, one can construct the frequency table and then draw the frequency
histogram or get the relative frequency histogram to construct the relative frequency histogram. The
following histogram is produced by Minitab when we specify the midpoints for the definition of
intervals according to the intervals chosen above.

Page 32 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

If we do not specify the midpoint for the definition of intervals, Minitab will default to
choose another set of class intervals resulting in the following histogram. According to the include
left and exclude right endpoint convention, the observation 133 is included in the class 133-135.

Note that different choices of class intervals will result in different histograms. Relative frequency
histograms are constructed in much the same way as a frequency histogram except that the vertical
axis represents the relative frequency instead of the frequency. For the purpose of visually comparing
the distribution of two data sets, it is better to use relative frequency rather than a frequency
histogram since the same vertical scale is used for all relative frequency--from 0 to 1.

Boxplot
To create this plot we need the five number summary. Therefore, we need:
• minimum value,
• Q1 (lower quartile),

Page 33 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

• Q2 (median),
• Q3 (upper quartile), and
• maximum value.
Using the five number summary, one can construct a skeletal boxplot.
1. Mark the five number summary above the horizontal axis with vertical lines.
2. Connect Q1, Q2, Q3 to form a box, then connect the box to min and max with a line to
form the whisker.

Most statistical software does NOT create graphs of a skeletal boxplot but instead opt for the
boxplot as follows below. Boxplots from statistical software are more detailed than skeletal boxplots
because they also show outliers. However, if there are no outliers, what is produced by the software
is essentially the skeletal boxplot.

The following terminology will prepare us to understand and draw this more detailed type of the
boxplot.
Potential outliers are observations that lie outside the lower and upper limits.
Lower limit = Q1 - 1.5 * IQR
Upper limit = Q3 +1.5 * IQR
Adjacent values are the most extreme values that are not potential outliers.

Page 34 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Boxplots and Distribution Shapes

Symmetric Data- A symmetric distribution with its corresponding box plot:

Right-Skewed Data- A right-skewed distribution along with it's corresponding box plot:

Page 35 of 36
College of Engineering and Industrial Technology
Department of Agricultural and Biosystems Engineering
Engineering Data Analysis

Left-Skewed Data
A left-skewed distribution along with it's corresponding box plot.:

References:
Barbara Illowsky and Susan Dean, 2018, Introductory to Statistics
Calderon, J.F., and Gonzales, E.C., (2016) Methods of Research and Thesis Writing
De Belen, R., and Feliciano, P., (2015) 1st Edition Basic Statistics for Research
Pareño, E., and Jimenez, R., (2006) Basic Statistics: A Worktext
https://online.stat.psu.edu/stat503/book/export/html/632

Page 36 of 36

You might also like