Sample Problems On Data Analysis: What Is Your Favorite Class?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Sample problems on Data Analysis

What is your favorite class?


Class data was collected on which subject is your favorite class. Here are the results.
(Please pardon the binary gender identification.)

Male Female Total


Art 2 2/10 =20% 4 20% 6
Physical education 1 10% 8 40% 9
Math 4 40% 2 10% 6
Science 3 30% 6 30% 9
Total 10 20 30

(a) How many variables does this table have? Are the variables categorical or
quantitative?

Two categorical variables.

(b) Identify the variables.

• The respondent’s gender (OR whether the subject is male or female)


• The respondent’s favorite class.

(c) Find each of the following:


The percentage of all students who chose physical education (marginal relative
frequency).
(9/30)*100% = 30%

The percentage of all students who are male students and who chose art (joint
relative frequency).
(2/30)*100% = 6.7%

The percentage of all male students who chose math (conditional relative
frequency).
(4/10)*100% = 40%

(d) In the table above, calculate the relative frequencies for each gender. Show your
work for at least one calculation. See above

1
(e) Sketch side-by-side bar graphs, a segmented bar graph, and a mosaic plot for this
data. For help, you can go to https://www.statsmedic.com/applets (1 Categorical
Variable, Multiple Groups) and enter the class data.

Note: If you stacked all of the bars for the


male side-by-side graph, it would look like
the male segmented bar graph.

The bar widths of the mosaic plot are


proportional to the number of
respondents who are male and female.

(f) Is there an association between favorite class and gender? Explain.


Yes. There is an association between favorite class and gender.
• Females are four times as likely to choose physical education as their favorite
subject.
• Males are four times as likely to choose math as their favorite subject.

(g) If there was no association between favorite class and gender, what would the
graphs look like?
The corresponding bars would be the same height, e.g.,

2
3
What is your pulse rate?
Suppose that we collected data on the pulse rates of 19 high school students when they
were resting and after 5 minutes of running.

(a) Is this variable categorical or quantitative?


Quantitative.

The data obtained is displayed below three ways: (1) dot plots, (2) a back-to-back
stemplots, and (3) histograms.

(b) Discuss the advantages and disadvantages of each graph.


• Dot plot. The dot plot is most granular and shows each value in the
distribution. However, it takes time to draw dot plots.
• The stemplots and histogram show the shape of the distributions. It is easy to
make comparison across distributions with these. However, the shape of a
histogram varies depending on the width of the bins. The histogram conceals
the distribution of data within each bin.
• Stemplots do not work well for large sets of data, particularly if each stem
must hold a large number of leaves. Stemplots take time to draw.

(c) Write a few sentences comparing the distributions of resting and after-exercise
pulse rates.
Center: The center for after-exercise is higher than the center for resting.

Unusual features: For resting pulse rates, 120 is a potential outlier; for after-
exercise pulse rates, 146 is a potential outlier. We’ll learn a rule for identifying
outliers soon.

Shape: The distribution of resting pulse rates and after-exercise pulse rates are

4
both similarly skewed to the right.

Spread: The variability of after-exercise is a bit greater (range of 60) than that of
resting (range of 52).

Tips: Include context by referring to “resting” or “after-exercise.” Use comparative


words like “higher” and “greater.”

How many colleges are you applying to?


Ten CHS seniors were discussing the number of colleges to which they were
applying. Here is the data:

3, 3, 4, 4, 4, 5, 5, 6, 9, 12

(a) What percent of the students are applying to 7 or more colleges?

P(# of colleges ≥7) = (2/10)*100% = 0.20 or 20%

(b) Calculate the mean number of colleges. Show your work.


Mean = (3 + 3 + … + 12)/10 = 5.5 colleges Be sure to include units

(c) What is the median number of colleges? Describe how you found it.
Median = 4.5 colleges
Put the data in order. Find the average of 5th and 6th observations.
If there is an even number of values, find the average of the two middle values.
If there is an odd number of values, find the middle value.

(d) What is Q1 and Q3? Describe how you found them.


Q1 = 4 colleges Q3 = 6 colleges
Q1 is the median of the lower half of the data; Q3 is the median of the upper half of
the data.
Note: If the median is one of the values in the data set (odd number of values), do
not include the median as part of the lower half or upper half of the data.

(e) Using your calculator, calculate the following values (five number summary).
Minimum: 3 Q1: 4 Median: 4.5 Q3: 6 Maximum: 12

(f) The interquartile range (or IQR) is defined as Q3 – Q1. Find the IQR. Where do you
see the IQR in the boxplot?
IQR = Q3 – Q1 = 6 – 4 = 2 colleges
The IQR is the length of the box.

5
(g) An observation is an outlier if it is less than Q1 - (1.5*IQR) or greater than Q3 +
(1.5*IQR). Are there any outliers? Show your work.
Q1 - (1.5*IQR) = 4 – (1.5*2) = 1 No low outliers
Q3 + (1.5*IQR) = 6 + (1.5*2) = 9 < 12 12 is a high outlier

(h) Use the five number summary to make a boxplot. Add labels, e.g., min, Q1.

If an outlier(s) is present, the outlier(s) is represented by a *.

(i) Compare the mean and median for the set of data.
Median (4.5) < mean (5.5)

(j) Would you use the mean or the median to summarize the center of this
data? Explain.
I would use the median because the data is skewed to the right with a high
outlier at 12. The median (and IQR) are resistant to the influence of outliers.
The mean (and standard deviation) are not resistant to outliers.
If the data is symmetric, use the mean and standard deviation.

(k) Calculate the standard deviation. Show your work.


Population variance = s2 = (Σ(xi - µ)2) /(n) = 7.45 colleges2

Population standard deviation = s = √7.45 = 2.729 colleges


On your calculator, use the population standard deviation, sx, since we are
treating this group as a population.

(l) How do you interpret the standard deviation in context?

If we take many, many samples of 10 CHS seniors, the number of colleges


to which they apply typically varies by 2.73 colleges from the mean of about
5.5 colleges.

(m)Describe the distribution in context. Be sure to address center, outliers,


shape and spread.
Center: Since the distribution is skewed, I will use the median and the IQR
to describe the center and spread of the distribution, respectively. The
center of the distribution of colleges is at the median of 4.5 colleges.
Unusual features: There are no low outliers. There is a high outlier at 12
colleges.
Shape: The distribution is skewed right.

6
Spread: The IQR for the distribution is 2 colleges.
(n) Another CHS senior, Anika joins the group. She applied to 4 colleges.
What effect will adding this observation have on the group’s mean and
standard deviation? Include in your explanation the direction of the
changes.
Since Anika’s number of colleges is less than the average (5.5), the mean of
the group will decrease. New mean = 5.36 colleges
Since Anika’s number of colleges is less than one standard deviation from
the mean, the standard deviation will decrease. New standard deviation =
2.64 colleges.

(o) Instead, suppose that each senior in the original group of 10 applied to 3
more colleges. What would happen to the group mean and standard
deviation?
By properties of means, the mean of the group would increase by 3.
By properties of standard deviations, the standard deviation of the group
would stay the same.
Adding a constant a to the original values shifts the center of the distribution
by that amount, but doesn’t change the spread, i.e.,
μX+a = μX + a σX+a = σX

(p) Instead, suppose each senior in the original group applied to double the
number of colleges. What would happen to the group mean and standard
deviation?
By properties of means, the mean of the group would double.
By properties of standard deviations, the standard deviation of the group
would double.
Multiplying all of the original values by a constant b multiplies the mean by
that constant and multiplies the spread by that constant, i.e.,
μbX = bμX σbX = |b|*σX

7
Match the following histograms to their corresponding boxplot.

Histogram Corresponding box plot


A 2
B 3
C 4
D 1

You might also like