0% found this document useful (0 votes)
2 views17 pages

Lecture Notes 3

Chapter 3 discusses the display and description of quantitative data, focusing on graphical and tabular representations such as histograms and stem-and-leaf plots. It includes exercises on analyzing patient ages and soil pH levels, as well as understanding the shape of histograms and measures of central tendency like mean and median. The chapter emphasizes the importance of visualizing data to identify patterns and summarize key statistics.

Uploaded by

seid yimer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views17 pages

Lecture Notes 3

Chapter 3 discusses the display and description of quantitative data, focusing on graphical and tabular representations such as histograms and stem-and-leaf plots. It includes exercises on analyzing patient ages and soil pH levels, as well as understanding the shape of histograms and measures of central tendency like mean and median. The chapter emphasizes the importance of visualizing data to identify patterns and summarize key statistics.

Uploaded by

seid yimer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Chapter 3

Displaying and Describing


Quantitative Data

After collecting a sample, statistical data is often first analyzed in a descriptive man-
ner. In particular, quantitative data is described in both a graphical and tabular
form.

3.1 Displaying Quantitative Variables

Exercise 3.1 (Displaying Quantitative Variables)


1. Histogram: Patient Ages. Distribution table and histogram of age when first
becoming Chief Executive Officer (CEO) of established companies are given
below.

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,
44, 45, 45, 45, 46, 47, 47, 49, 50, 51

bin frequency relative


frequency
1
30–35 1 20
= 0.05
35–40 2 0.10
40–45 8 0.40
45–50 7 0.35
50–55 2 0.10

Import chapter2.CEO.ages text file into R: Environment panel, Import Dataset.

data <- chapter3.CEO.ages; attach(data); head(data)


data.freq <- as.data.frame(table(factor(cut(age, right=FALSE, breaks=c(30,35,40,45,50,55)))))
transform(data.freq, relative = prop.table(Freq))

25
26 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

8 Histogram of age Histogram of age

6
(a) (b)

5
6
Frequency

Frequency

4
4

3
2
2

1
0

0
30 35 40 45 50 55 30 35 40 45 50 55

age (years) age (years)

Histogram of age Histogram of age


40

30
(c) (d)
30

20
Density

Density
20

10
10

5
0

30 35 40 45 50 55 30 35 40 45 50 55

age age

Figure 3.1: Histogram for CEO Ages

par(mfrow=c(2,2)) # set for 4 graphs per panel


hist(age,right=FALSE,breaks=c(30,35,40,45,50,55),xlab="age (years)",col="green")
hist(age,right=FALSE,breaks=c(30,32.5,35,37.5,40,42.5,45,47.5,50,52.5,55),xlab="age (years)",col="green")
h <- hist(age,right=FALSE,breaks=c(30,35,40,45,50,55),plot=FALSE) # percentage histogram
h$density <- h$counts/sum(h$counts)*100
plot(h,freq=FALSE, col="green")
h <- hist(age,right=FALSE,breaks=c(30,32.5,35,37.5,40,42.5,45,47.5,50,52.5,55),plot=FALSE)
h$density <- h$counts/sum(h$counts)*100
plot(h,freq=FALSE, col="green")
par(mfrow=c(1,1) # reset to 1 graph per panel

(a) Histogram Figure 3.1(a),


Number of bins is 3 / 4 / 5 / 10.
First bin is [30, 35) / [30, 32.5) / [40, 44).
Lower limit of first bin is 30 / 32.5 / 34 / 35.
Upper limit of first bin is almost 30 / 32.5 / 34 / 35.
Width of first bin is 35 − 30 = (circle one) 2.5 / 4 / 5 / 6 years.
Number of CEOs in [30,35) age bin is 1 / 2 / 3 / 4.
(b) Histogram Figure 3.1(b),
Number of bins is 3 / 4 / 5 / 10.
First bin is [30, 35) / [30, 32.5) / [40, 44).
Lower limit of first bin is 30 / 32.5 / 34 / 35.
Upper limit of first bin is almost 30 / 32.5 / 34 / 35.
Width of first bin is 32.5 − 30 = (circle one) 2.5 / 4 / 5 / 6 years.
Number of CEOs in [30,32.5) age bin is 1 / 2 / 3 / 4.
Section 1. Displaying Quantitative Variables (lecture notes 3) 27

Increasing number of bins changes / does not change the shape of the
histogram.
(c) Histogram Figure 3.1(c),
Percentage of patients in [30,35) age bin: 5% / 10% / 35% / 40%
Shape of density histogram in Figure 3. 3.1(c) same / different to fre-
quency histogram in Figure 3. 3.1(a).
(d) Histogram Figure 3.1(d),
Percentage of patients in [30,32.5) age bin: 5% / 10% / 35% / 40%.

2. Histogram: pH levels. Consider distribution table and histogram of 28 pH levels


of soil data below.

4.3 5 5.9 6.5 7.6 7.7 7.7 8.2 8.3 9.5


10.4 10.4 10.5 10.8 11.5 12 12 12.3 12.6 12.6
13 13.1 13.2 13.5 13.6 14.1 14.1 15.1

bin frequency relative


frequency
3
4–6 3 28
≈ 0.107
6–8 4 0.143
8–10 3 0.107
10–12 5 0.179
12–14
14–16

Histogram of pH
35
30
25
Density

20
15
10
5
0

4 6 8 10 12 14 16

pH

Figure 3.2: Histogram for pH Level Data

Import chapter2.pH.soil text file into R: Environment panel, Import Dataset.


28 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

data <- chapter3.pH.soil; attach(data); head(data)


data.freq <- as.data.frame(table(factor(cut(pH, right=FALSE, breaks=c(4,6,8,10,12,14,16)))))
transform(data.freq, relative = prop.table(Freq))
h <- hist(age,right=FALSE,breaks=c(4,6,8,10,12,14,16),plot=FALSE) # percentage histogram
h$density <- h$counts/sum(h$counts)*100
plot(h,freq=FALSE, col="green")

(a) Fill in blanks in distribution table. Hint: 28 readings total.


(b) Number of bins is (circle one) 3 / 4 / 5 / 6.
(c) Width of each bin is (circle one) 2 / 3 / 4 / 5 pH.
(d) Most frequent pH reading is
[8, 10) / [10, 12) / [12, 14) / [14, 16).

3. Stem-and-leaf: patient ages. Stem-and-leaf plot for CEO ages is given below.

32, 37, 39, 40, 41, 41, 41, 42, 42, 43,
44, 45, 45, 45, 46, 47, 47, 49, 50, 51

Import chapter2.CEO.age text file into R: Environment panel, Import Dataset.

3 2 7 9∗
4 0 1 1 1∗∗ 2 2 3 4 5 5 5 6 7 7 9 stem: 10s
5 0 1 leaf: 1s

stem(age0.5,scale=0.5)

(a) Starred number, 9∗ , represents age (circle one) 39 / 93 / 9.


Double–starred number, 1∗∗ , represents age (circle one) 41 / 14 / 1.
(b) Numbers left of double line (in first column) are called stems / leaves;
numbers to right are called (circle one) stems / leaves.
(c) Starred number 9∗ is a leaf of stem (circle one) 3 / 4 / 5.
(d) True / False Note to right of stem-and-leaf plot specifies numbers used
as stems are “tens” (or “10s”) and numbers used as leaves are “ones”
(or “1s”). So, for instance, stem “3” represents 3x10 = 30 and leaf “2”
represents 1x2 = 2.
(e) Stem–and–leaf plot is ordered where, in first stem, for example, 32 is fol-
lowed by (circle one) 37 / 39 / 40.
(f) Stem–and–leaf plot is useful in identifying “center” of data, or, where
“most” data values are located. In this case, this is 30s / 40s / 50s.
Section 2. Shape (lecture notes 3) 29

4. Split stem-and-leaf plots: patient ages. Sometimes, to spread data out, stems
are split as, for example, in following table.

3 2
3 7 9
4 0 1 1 1 2 2 3 4
4 5 5 5 6 7 7 9
5 0 1 stem: 10s
5 leaf: 1s

stem(age)

(a) True / False Low stem 3 contains one half of leaves, 0, 1, 2, 3 or 4; high
stem 3 contains other half of leaves, 5, 6, 7, 8 and 9.
(b) True / False Stem-and-leaf plots can have stems split not only twice,
but also three or more times. Splitting each stem three times might, say,
consist of a low stem which contains leaves 0, 1 and 2; a middle stem with
leaves 3, 4, 5 and 6 and a high stem with leaves 7, 8 and 9.
(c) A stem and leaf plot with 10s as stems can be split at most
(circle one) 5 / 7 / 10 / 100 times.
(d) True / False Although no one “best” way of constructing a stem–and–leaf
plot, most stem–and–leaf plots consist of 5 to 20 stems.

3.2 Shape
Shape of histograms discussed, including symmetry, skewness, mode and outliers.

Exercise 3.2 (Shape)


1. Shapes of Histograms for Continuous Quantitative Data.
Describe the shape, symmetry, skewness, mode and outliers of the histograms
in Figure 3.3.
(a) Histogram Figure 3.3(a),
shape is symmetric / skewed left / skewed right / none
number of modes is one / two / more than two
has no / one / two / more than two outlier(s)
(b) Histogram Figure 3.3(b),
shape is symmetric / skewed left / skewed right / none
number of modes is one / two / more than two
has no / one / two / more than two outlier(s)
30 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

0.40 0.40
0.35 0.35
relative frequency

relative frequency
0.30 0.30
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
(a) (b)

0.40 0.40
0.35 0.35
relative frequency

relative frequency
0.30 0.30
0.25 0.25
0.20 0.20
0.15 0.15
0.10 0.10
0.05 0.05
0.00 0.00
(c) (d)

Figure 3.3: Shapes of Histograms

(c) Histogram Figure 3.3(c),


shape is symmetric / skewed left / skewed right / none
number of modes is one / two / more than two
has no / one / two / more than two outlier(s)
(d) Histogram Figure 3.3(d),
shape is symmetric / skewed left / skewed right / none
number of modes is one / two / more than two
has no / one / two / more than two outlier(s)

3.3 Center
Two measures of central tendency are average (or, equivalently, mean) and median.
These measures are either statistics for samples for parameters for populations.

measure statistic (sample, nPmembers) parameter (population,


P N members)
x x
average (mean) x̄ = x1 +x2 +···+x
n
n
= ni µ = x1 +x2 N
+···+xN
= Ni
median M: middle, n+1 2
, of ordered sample middle, N 2+1 , of ordered population

Exercise 3.3 (Center)

1. Mean and Median: Bidding on Land. Consider small population of bidders for
N = 9 parcels of land:
Section 4. Spread of the Distribution (lecture notes 3) 31

0, 0, 0, 0, 1, 1, 2, 2, 3.

(a) Population average is


0+0+0+0+1+1+2+2+3 9
µ= = =
9 9
(choose one) 0 / 1 / 1.5 / 2.
(b) Population median is middle of 9 ordered bid numbers: N 2+1 = 9+1
2
= 5th
observation. Since first bid number is x1 = 0, second is x2 = 0, third is
x3 = 0, fourth is x4 = 0, fifth bid number is 0 / 1 / 1.5 / 2.
(c) If sample n = 5 bid numbers {0, 1, 2, 2, 3},
sample average x̄ = 0+1+2+2+3
5
= 58 = (choose one) 0 / 1 / 1.6 / 2,
sample median M is middle, 3rd, of 5: (choose one) 0 / 1 / 1.6 / 2,
(d) If sample n = 5 bid numbers {0, 0, 1, 2, 3},
sample average x̄ = 0+0+1+2+3
5
= 65 = 0 / 1 / 1.2 / 1.4,
sample median M is n+12
= 5+1
2
= 3rd observation: 0 / 1 / 1.2 / 1.4,

2. Mean and Median: Goals Scored.


Consider small sample of number of goals scored in N = 9 soccer games:

0, 1, 1, 2, 2, 2, 3, 3, 4.

goals <- c(0,1,1,2,2,2,3,3,4)

(a) Population average is µ = 0+1+1+2+2+2+3+3+4


9
= 18
9
= 0 / 1 / 1.5 / 2,
N +1 9+1
Population median is 2 = 2 = 5th observation: 0 / 1 / 1.5 / 2,
mean(goals); median(goals)

(b) If sample n = 5 of {0, 1, 2, 3, 4} goals scored,


sample average x̄ = 0+1+2+3+4
5
= 10
5
= 1 / 1.6 / 1.8 / 2,
n+1 5+1
sample median M is 2 = 2 = 3rd observation: 1 / 1.6 / 1.8 / 2,
goals.s1 <- c(0,1,2,3,4); mean(goals.s1); median(goals.s1)

(c) If sample n = 6 of {0, 0, 1, 2, 3, 4} goals scored,


sample average x̄ = 0+0+1+2+3+4
6
= 10
6
≈ 0 / 1.5 / 1.7 / 2,
n+1 6+1
sample median M is 2 = 2 = 3.5rd observation; in other words,
average of 3rd and 4th observations 1+2 2
= 0 / 1.5 / 1.7 / 2,
goals.s2 <- c(0,0,1,2,3,4); mean(goals.s2); median(goals.s2)
32 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

3.4 Spread of the Distribution


In addition to standard deviation and variance, two versions of the lower quartile,
upper quartile and interquartile range are discussed.

measure statistic (sample, n members) parameter (population, N members)


rP
i (x −x̄)2
standard deviation s= n−1
σ
variance s2 σ2
range R = maximum − minimum maximum − minimum

Exercise 3.4 (Spread of the Distribution)

1. Range, Standard Deviation and Variance: Goals Scored.


Consider small sample of number of goals scored in n = 9 soccer games:

0, 1, 1, 2, 2, 2, 3, 3, 4.

(a) Measuring dispersion (spread) in entire set of goals scored.


range = Max − Min = 4 − 0 = r (circle one) 1 / 2 / 4 goals scored
P
i (x −µ)2
standard deviation (SD) σ = N
≈ 1.05 / 1.15 / 1.22 goals
variance σ ≈ 1.15 ≈ 1.11 / 1.26 / 1.5 goals2
2 2

goals <- c(0,1,1,2,2,2,3,3,4); range(goals); sd(goals); var(goals)

(b) If sample n = 5 games with {0, 1, 2, 3, 4} goals scored,


R = Max − Min = 4 − 0 = (circle
r one) 1 / 2 / 4 goals scored
P
i (x −x̄)2
standard deviation (SD) s = n−1
≈ 1.05 / 1.41 / 1.58 goals scored
variance s2 ≈ 1.582 ≈ 1 / 1.5 / 2.5 goals scored2
goals.s1 <- c(0,1,2,3,4); range(goals.s1); sd(goals.s1); var(goals.s1)

(c) If sample n = 6 games with {0, 0, 1, 2, 3, 4} goals scored,


R = Max − Min = 4 − 0 = (circle
r one) 1 / 2 / 4 goals scored
P
i (x −x̄)2
standard deviation (SD) s = n−1
≈ 1.05 / 1.41 / 1.63 goals scored
variance s ≈ 1.63 ≈ 1.56 / 1.87 / 2.67 goals scored2
2 2

goals.s2 <- c(0,0,1,2,3,4); range(goals.s2); sd(goals.s2); var(goals.s2)

2. Tukey quartiles and interquartile range: temperatures.


Consider small sample of n = 10 temperatures, set A:

0, 1, 1, 2, 2, 2, 3, 3, 5, 7.
Section 4. Spread of the Distribution (lecture notes 3) 33

temperature.A <- c(0,1,1,2,2,2,3,3,5,7)


quart.tukey <- fivenum(temperature.A) # Tukey five number summary
c(Q1.tukey=quart.tukey[2],Q3.tukey=quart.tukey[4],IQR.tukey=quart.tukey[4] - quart.tukey[2])

(a) Since
0, 1, 1, 2, 2, 2 , 3, 3, 5, 7,
|{z}
M
10+1
median temperature located 2
= 5.5th position and so M = 1 / 2 / 3
median(temperature.A)

(b) Since lower half of ordered data set

1 , 2, 2,
0, 1, |{z}
Q1

5+1
first (lower) quartile located 2
= 3rd position, Q1 = 1 / 2 / 3
(c) Since upper half of ordered data set

3 , 5, 7,
2, 3, |{z}
Q3

5+1
third (upper) quartile located 2
= 3rd position, Q3 = 3 / 5 / 7
(d) Interquartile range is IQR = Q3 − Q1 = 1 / 2 / 3

3. Tukey quartiles and interquartile range: more temperatures.


Another sample of n = 9 temperatures, set B

0, 0, 0, 0, 1, 1, 1, 2, 3
is compared to the first set of temperatures.
temperature.B <- c(0,0,0,0,1,1,1,2,3)
quart.tukey <- fivenum(temperature.B) # Tukey five-number summary
c(Q1.tukey=quart.tukey[2],Q3.tukey=quart.tukey[4],IQR.tukey=quart.tukey[4] - quart.tukey[2])

(a) Since
0, 0, 0, 0, |{z}
1 , 1, 1, 2, 3
M
9+1
median temperature located 2
= 5th position and so M = 1 / 2 / 3
median(temperature.B)

(b) Since lower half of ordered data set

0, 0, |{z}
0 , 0, 1
M

5+1
first (lower) quartile located 2
= 3rd position, Q1 = 0 / 2 / 3
34 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

(c) Since upper half of ordered data set

1, 1, |{z}
1 , 2, 3
M

5+1
third (upper) quartile located 2
= 3rd position, Q3 = 2 / 5 / 7
(d) Interquartile range is IQR = Q3 − Q1 = 1 / 2 / 3

4. TI calculator quartiles and interquartile range: more temperatures.


Reconsider temperatures, set B

0, 0, 0, 0, 1, 1, 1, 2, 3

quart.TI <- function(x) {


x <- sort(x)
n <- length(x)
m <- (n+1)/2
if (floor(m) != m) {
l <- m-1/2; u <- m+1/2
} else {
l <- m-1; u <- m+1
}
c(Q1.TI=median(x[1:l]), Q3.TI=median(x[u:n]), IQR.IT= median(x[u:n]) - median(x[1:l]))
}
quart.TI(temperature.B)

(a) Since
0, 0, 0, 0, |{z}
1 , 1, 1, 2, 3
M

median temperature located 9+1 2


= 5th position and so M = 1 / 2 / 3
temperature.B <- c(0,0,0,0,1,1,1,2,3)
median(temperature.B)

(b) Since lower half of ordered data set, excluding median:

0, 0, 0 , 0
|{z}
M

4+1
first (lower) quartile located 2
= 2.5th position, Q1 = 0 / 2 / 3
(c) Since upper half of ordered data set, excluding median:

1, 1, 2 , 3
|{z}
M

4+1
third (upper) quartile located 2
= 2.5th position, Q3 = 1.5 / 5 / 7
(d) Interquartile range is IQR = Q3 − Q1 = 1.5 / 2 / 3
Section 6. Standardizing Variables (lecture notes 3) 35

3.5 Shape, Center and Spread–A Summary


This material is covered in the previous sections.

3.6 Standardizing Variables


We look at the measure of position, the z-score:
x − x̄ x−µ
z= , (sample) z = (population)
s σ

Exercise 3.6 (Standardizing Variables)


1. z-scores: IQ scores. IQ scores differ for different ages. Mean, SD for 16 year
olds are µ = 100 and σ = 16; mean, SD for 20 year olds are µ = 120 and σ = 20.

16 year olds
mean 100, SD 16

52 68 84 100 116 132 148 X, nonstandard


-3 -2 -1 0 1 2 3 Z, standard

84 132

20 year olds
mean 120, SD 20

60 80 100 120 140 160 180 X, nonstandard


-3 -2 -1 0 1 2 3 Z, standard

Figure 3.4: Comparing IQ Scores with Z-Scores

132−100
(a) A 16 year old with IQ 132 has z-score z = 16
= 0 / 1 / 2.
This IQ is two SDs above average.
84−100
(b) A 16 year old with IQ 84 has z-score z = 16
= −2 / −1 / 0.
This IQ is one SD below average.
132−120
(c) A 20 year old with IQ 132 has z-score z = 20
= 0 / 0.6 / 2.
This IQ is 0.6 of a SD above average.
36 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

84−120
(d) A 20 year old with IQ 84 has z-score z = 20
= −2 / −1.8 / −1.
This IQ is 1.8 SDs below average.
(e) True / False. z-scores allow comparison of position of data points in
different data sets, data sets with different averages and SDs.
(f) If z = x−µ
σ
, then x = zσ + µ, so a 16 year old with IQ three SDs above
average has IQ x = 3(16) + 100 = (choose one) 116 / 132 / 148.
(g) A 20 year old with IQ two SDs below average has IQ
x = −2(20) + 120 = (choose one) 60 / 80 / 100.

2. Using z-scores to find outliers: Temperatures.


Consider small sample of n = 10 temperatures, set A:

0, 1, 1, 2, 2, 2, 3, 3, 5, 7.

(a) average temperature, x̄ = (choose one) 0 / 1.6 / 2.6 degrees.


SD in temperature, s ≈ (choose one) 1.15 / 1.23 / 2.07 degrees.
temperature.A <- c(0,1,1,2,2,2,3,3,5,7); mean(temperature.A); sd(temperature.A)

(b) Temperature 0o has z-score z ≈ 0−2.6


2.07
≈ −1.98 / −1.27 / −0.56.
This temperature is roughly 1.3 SDs below average.
(c) Temperature 7o has z-score z ≈ 7−2.6
2.07
≈ 1.68 / 1.97 / 2.13.
This temperature is roughly 2.1 SDs above average.
(d) z-scores less than z = −2 or greater than z = 2 are considered outliers.
So, 7o is / is not an outlier because it is more than 2 SDs above average.
(e) Temperature 1.5 SDs above average is
x ≈ 1.5(2.07) + 2.6 = (choose one) 3.335 / 3.745 / 5.705.

3.7 Five-Number Summary and Boxplots


We look at five-number summary

{min, P25 = Q1 , M = P50 = Q2 , P75 = Q3 , max},

and related boxplots.

Exercise 3.7 (Five-Number Summary and Boxplots)

1. Five Number Summary and Boxplot: Temperatures.


Consider small sample of n = 10 temperatures, set A:

0, 1, 1, 2, 2, 2, 3, 3, 5, 7.
Section 8. Comparing Groups (lecture notes 3) 37

(a) Five–number summary for temperatures, is (choose one)


(i) {0, 1, 1.5, 3, 4}
(ii) {0, 0, 1.5, 3, 6}
(iii) {0, 1, 2, 3, 7}
temperature.A <- c(0,1,1,2,2,2,3,3,5,7); fivenum(temperature.A)

(b) Consider boxplot for temperatures.

lower median upper


quartile (2) quartile
(1) (3)
min and max (7)
smallest value (outlier)
above lower fence
(0)
IQR
largest value
below upper fence
lower upper (5)
whiskers
fence fence
(1 - 1.5(2) = -2) (3 + 1.5(2) = 6)

Figure 3.5: Boxplot for temperatures

Boxplot indicates data symmetric / skewed right / skewed left.


boxplot(temperature.A, xlab="temperature",col="green",horizontal=TRUE)

2. Five Number Summary and Boxplot: More Temperatures.


Another sample of n = 9 temperatures, set B

0, 0, 0, 0, 1, 1, 2, 2, 3

is compared to the first set of temperatures.


(a) Five–number summary for this set of temperatures, is (choose one)
(i) {0, 1, 1.5, 3, 4}
(ii) {0, 0, 1, 2, 3}
(iii) {0, 1, 2, 3, 7}
temperature.B <- c(0,0,0,0,1,1,2,2,3); fivenum(temperature.B)

(b) Consider side-by-side boxplot for temperatures. Set A has warmer /


same / colder median temperature than set B.
Set A has smaller / same / larger IQR in temperature than set B.
response <- c(temperature.A,temperature.B)
temperature <- c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B")
data <- cbind.data.frame(temperature,response); attach(data); head(data)
boxplot(response~temperature, xlab="temperature",col="green",horizontal=TRUE)
38 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

B
A

0 1 2 3 4 5 6 7

temperature

Figure 3.6: Side-by-side boxplot for two sets of temperatures

3.8 Comparing Groups


This material is covered in the previous sections.

3.9 Identifying Outliers


This material is covered in the previous sections.

3.10 Time Series

Exercise 3.9 (Time Series)


month 1 2 3 4 5 6 7 8 9 10 11 12 13
gold 80 70 90 75 95 100 105 107 109 112 111 115 120
technology 70 60 58 55 58 60 70 60 58 55 58 60 70

Two stock prices (in dollars) are measured monthly over a 13 month period.
Import chapter3.gold.tech.timeseries text file into R: Environment panel, Import Dataset.

data <- chapter3.gold.tech.timeseries; attach(data); head(data)


data.freq <- as.data.frame(table(factor(cut(gold, right=FALSE, breaks=c(70,80,90,100,110,120))))); data.freq
hist(gold,right=FALSE,breaks=c(70,80,90,100,110,120),xlab="Price ($)",col="red")
data.freq <- as.data.frame(table(factor(cut(technology, right=FALSE, breaks=c(55,60,65,70))))); data.freq
hist(technology,right=FALSE,breaks=c(55,60,65,70),xlab="Price ($)",col="blue")

gold.ts <- as.ts(data$gold)


technology.ts <- as.ts(data$technology)
ts.plot(gold.ts, technology.ts, gpars=list(xlab="Month", ylab="Price ($)", lty=c(1:2), col=c("red","blue")))
Section 10. Time Series (lecture notes 3) 39

Histogram of gold
Histogram of technology
4

6
5
3

4
Frequency

Frequency
2

3
2
1

1
0

0
70 80 90 100 110 120
55 60 65 70
Price ($)
Price ($)
120
110
100
Price ($)

90
80
70
60

2 4 6 8 10 12

Month

Figure 3.7: Line graph of two stocks

1. Histograms.
Histogram of gold prices
least frequent: 70 − 80 / 80 − 90 / 90 − 100
unimodal / bimodal
symmetric / asymmetric
Histogram of technology prices
least frequent: 55 − 65 / 65 − 70 / 70 − 75
unimodal / bimodal
symmetric / asymmetric

2. Time Series.
Overall trend in gold stock
increasing / stationary / decreasing;
there is / is not a seasonal variation.
Overall trend in technology stock is
increasing / stationary / decreasing;
there is / is not a seasonal variation.

3. Histograms are more informative for stationary (no strong trend or variability)
40 Chapter 3. Organizing and Summarizing Data (lecture notes 3)

rather than non-stationary times series.


True / False

3.11 Transforming Skewed Data


Transformations of data x, in particular, the natural log transformation, ln(x), makes
skewed histograms more symmetric.

Exercise 3.10 (Transforming Skewed Data) Histogram of hours watching TV as


well as various transformations of hours watching TV are given below.

Histogram of hours Histogram of hours.ln


Frequency

Frequency

25
25

10
0 10

0 5 10 15 20 −0.5 0.5 1.5 2.5

Hours Watching TV ln(Hours)

Histogram of hours.sqrt Histogram of hours.inverse


40
Frequency

Frequency

25
20

0 10
0

1 2 3 4 0.0 0.5 1.0 1.5

sqrt(Hours) 1/Hours

Figure 3.8: Hours and Transformations of Hours Watching TV

Import chapter3.tv.hours text file into R: Environment panel, Import Dataset.


data <- chapter3.tv.hours; attach(data); head(data)
par(mfrow=c(2,2))
hist(hours,right=FALSE,breaks=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20),xlab="Hours Watching TV",col="gold")
hours.ln <- logb(hours, base=exp(1))
hours.sqrt <- sqrt(hours)
hours.inverse <- 1/hours
hist(hours.ln,right=FALSE,xlab="ln(Hours)",col="gold")
hist(hours.sqrt,right=FALSE,xlab="sqrt(Hours)",col="gold")
hist(hours.inverse,right=FALSE,xlab="1/Hours",col="gold")
par(mfrow=c(1,1))
Section 11. Transforming Skewed Data (lecture notes 3) 41

The transformation of hours watching TV which made the original histogram most
symmetrical is ln(hours) / sqrt(hours) 1/hours

You might also like