0% found this document useful (0 votes)
157 views

CH 04 Descriptive Data Mining

Uploaded by

唐嘉玥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

CH 04 Descriptive Data Mining

Uploaded by

唐嘉玥
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 40

CH 04 - Descriptive Data Mining

Multiple Choice

1. The goal of __________ is to use the variable values to identify relationships between observations.
a. unsupervised learning
b. data mining
c. McQuitty’s method
d. Ward's method
ANSWER: a
RATIONALE: In an unsupervised learning application, there is no outcome variable to predict; rather, the
goal is to use the variable values to identify relationships between observations.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:10 PM

2. In preparing categorical variables for analysis, it is usually best to


a. convert the categories to numeric representations.
b. convert the categories to binary, dummy variables.
c. combine as many categories as possible.
d. let them remain categorical.
ANSWER: b
RATIONALE: Typically, it is best to encode categorical variables with 0-1 dummy variables. Using 0-1
dummy variables to encode categorical variables with many different categories results in a
large number of variables. In some cases, the number of categories may be reduced by
combining categories.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

3. Observation refers to the


a. estimated continuous outcome variable.
b. set of recorded values of variables associated with a single entity.

Copyright Cengage Learning. Powered by Cognero. Page 1


CH 04 - Descriptive Data Mining

c. goal of predicting a categorical outcome based on a set of variables.


d. mean of all variable values associated with one particular entity.
ANSWER: b
RATIONALE: An observation is defined as the set of recorded values of variables associated with a single
entity. It is often displayed as a row of values in a spreadsheet or database in which the
columns correspond to the variables.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

4. __________ approaches are designed to describe patterns and relationships in large data sets with many observations of
many variables.
a. Data mining b. Unsupervised learning
c. Dimension reduction d. Data sampling
ANSWER: b
RATIONALE: Unsupervised learning approaches are designed to describe patterns and relationships in large
data sets with many observations of many variables
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

5. Suppose we had a data set of from a call center where customers were asked to choose between the following three
options: hear account information, billing questions, and customer service. Using the given order of the three options, and
using 0-1 dummy variables to encode the categorical variables, which of the following combinations would yield an entry
“customer service”?
a. 000 b. 100
c. 010 d. 001
ANSWER: d
RATIONALE: An entry of “customer service” would be captured using a dummy variable of 0 for “hear
account information”, a dummy variable of 0 for “billing questions”, and a dummy variable of
1 for “customer service.” Therefore, the correct combination is 001.
POINTS: 1
Copyright Cengage Learning. Powered by Cognero. Page 2
CH 04 - Descriptive Data Mining

DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:11 PM

6. The data preparation technique used in market segmentation to divide consumers into different homogeneous groups is
called
a. data visualization. b. cluster analysis.
c. market analysis. d. supervised
learning.
ANSWER: b
RATIONALE: Clustering can be employed during the data preparation step to identify variables or
observations that can be aggregated or removed from consideration. Cluster analysis is
commonly used in marketing to divide consumers into different homogeneous groups, a
process known as market segmentation.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

7. k-means clustering is the process of


a. agglomerating observations into a series of nested groups based on a measure of
similarity.
b. organizing observations into distinct groups based on a measure of similarity.
c. reducing the number of variables to consider in data-mining.
d. estimating the value of a continuous outcome variable.
ANSWER: b
RATIONALE: k-means clustering is the process of organizing observations into one of k groups based on a
measure of similarity.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False

Copyright Cengage Learning. Powered by Cognero. Page 3


CH 04 - Descriptive Data Mining

NATIONAL STANDARDS: United States - BUSPROG: Analytic


United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

8. Euclidean distance can be used to calculate the dissimilarity between two observations. Let u = (25, $350) correspond
to a 25-year-old customer that spent $350 at Store A in the previous fiscal year. Let v = (53, $420) correspond to a 53-
year-old customer that spent $4,100 at Store A in the previous fiscal year. Calculate the dissimilarity between these two
observations using Euclidean distance.
a. 66.21 b. 72.28
c. 75.39 d. 88.57
ANSWER: c
RATIONALE: The Euclidean distance between these two observations is calculated using the
formula.

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:12 PM

9. Which of the following is true of Euclidean distances?


a. It is used to measure dissimilarity between categorical variable observations.
b. It is not affected by the scale on which variables are measured.
c. It increases with the increase in similarity between variable values.
d. It is commonly used as a method of measuring dissimilarity between quantitative
observations.
ANSWER: d
RATIONALE: When observations include numerical variables, Euclidean distance is the most common
method to measure dissimilarity between observations.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
Copyright Cengage Learning. Powered by Cognero. Page 4
CH 04 - Descriptive Data Mining

DATE MODIFIED: 3/19/2021 11:16 AM

10. Jaccard’s coefficient is different from the matching coefficient in that the former
a. measures overlap while the latter measures dissimilarity.
b. does not count matching zero entries while the latter does.
c. deals with categorical variable while the latter deals with continuous
variables.
d. is affected by the scale used to measure variables while the latter is not.
ANSWER: b
RATIONALE: Jaccard’s coefficient refers to a measure of similarity between observations consisting solely
of binary categorical variables that consider only matches of nonzero entries.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

11. Single linkage is a measure of calculating dissimilarity between clusters by


a. considering only the two most dissimilar observations in the two clusters.
b. computing the average dissimilarity between every pair of observations between the two
clusters.
c. considering only the two most similar observations in the two clusters.
d. considering the distance between the cluster centroids.
ANSWER: c
RATIONALE: Single linkage is a measure of calculating dissimilarity between clusters by considering only
the two most similar observations in the two clusters.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

12. __________ is a measure of calculating dissimilarity between clusters by considering only the two most dissimilar
observations in the two clusters.
a. Single linkage b. Complete linkage

Copyright Cengage Learning. Powered by Cognero. Page 5


CH 04 - Descriptive Data Mining

c. Average linkage
d. Average group linkage
ANSWER: b
RATIONALE: Complete linkage is a measure of calculating dissimilarity between clusters by considering
only the two most dissimilar observations in the two clusters.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

13. If the Euclidean distance were to be represented in a right triangle, which of the following would be considered the
distance between two observations of a cluster?
a. The short leg
b. The long leg
c. The hypotenuse
d. Euclidean distance is not related to right
triangles.
ANSWER: c
RATIONALE: The distance between two observations in a cluster would be represented by the hypotenuse of
a right triangle.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:14 PM

14. When clustering only by dummy variables that represent categorical variables, the simplest measure of similarity
between two observations is called the
a. matching coefficient.
b. Jaccard's coefficient.
c. Euclidean distance.
d. antecedent.
ANSWER: a
RATIONALE: When clustering observations solely on the basis of categorical variables encoded as 0-1 (or
dummy variables), a better measure of similarity between two observations can be achieved

Copyright Cengage Learning. Powered by Cognero. Page 6


CH 04 - Descriptive Data Mining

by counting the number of variables with matching values. The simplest overlap measure is
called the matching coefficient. To avoid misstating similarity due to the absence of a feature,
a similarity measure called Jaccard’s coefficient does not count matching zero entries.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:14 PM

15. A method for modifying variables that reduces bias prior to cluster analysis is
a. standardization.
b. weighting.
c. removing outliers.
d. randomizing.
ANSWER: a
RATIONALE: The conversion to z-scores makes it easier to identify outlier measurements, which can distort
the Euclidean distance between observations. Standardizing (or normalizing) observations
removes bias due to the difference in measurement units, and variable weighting allows the
analyst to introduce appropriate bias based on the business context.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:15 PM

16. Euclidean distance can be used to measure the distance between __________ in cluster analysis.
a. objects
b. clusters
c. observations
d. ward
ANSWER: c
RATIONALE: Euclidean distance is a geometric measure of dissimilarity between observations based on
Pythagorean Theorem.
POINTS: 1
DIFFICULTY: Moderate
Copyright Cengage Learning. Powered by Cognero. Page 7
CH 04 - Descriptive Data Mining

REFERENCES: CLUSTER ANALYSIS


QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

17. Average linkage is a measure of calculating dissimilarity between two clusters by


a. finding the distance between the two most dissimilar observations in the two clusters.
b. computing the average distance between every pair of observations between two clusters.
c. finding the distance between the two closest observations in the two clusters.
d. computing the distance between the cluster centroids.
ANSWER: b
RATIONALE: Average linkage measures dissimilarity between clusters by computing the average similarity
between every pair of observations between two clusters.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:16 PM

18. __________ is a method of calculating dissimilarity between clusters by calculating the distance between the
centroids of the two clusters.
a. Single linkage b. Complete
linkage
c. Average linkage d. Centroid linkage
ANSWER: d
RATIONALE: Centroid linkage uses the averaging concept of cluster centroids to define between cluster
similarity. The similarity between two clusters is defined as the similarity of the centroids of
the two clusters.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
Copyright Cengage Learning. Powered by Cognero. Page 8
CH 04 - Descriptive Data Mining

DATE CREATED: 1/23/2021 10:16 AM


DATE MODIFIED: 3/28/2021 5:16 PM

19. __________ can be used to partition observations in a manner to obtain clusters with the least amount of information
loss due to the aggregation.
a. Single linkage b. Ward’s method
c. Average group linkage d. Dendrogram
ANSWER: b
RATIONALE: Ward’s method is a procedure that can be used to partition observations in a manner to obtain
clusters with the least amount of information loss due to the aggregation.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

20. Suppose the dissimilarity between clusters A and B has the value 24 and the dissimilarity between cluster B and C has
the value 12. Use McQuitty’s method to determine the dissimilarity of clusters A and B.
a. 12 b. 18
c. 24 d. 36
ANSWER: b
RATIONALE: Using McQuitty’s method, the dissimilarity between clusters A and B is calculated as the
average of the dissimilarity between A and C and the dissimilarity between B and C. The
calculated value is (12 + 24) / 2 = 18.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

21. A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering is known as a
a. dendrogram. b. scatter chart.
c. decile-wise lift chart. d. cumulative lift tree.
ANSWER: a
RATIONALE: A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical
clustering is known as a dendrogram.
Copyright Cengage Learning. Powered by Cognero. Page 9
CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Easy
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

22. The endpoint of a k-means clustering algorithm occurs when


a. Euclidean distance between clusters is minimized.
b. Euclidean distance between observations in a cluster is maximized.
c. no further changes are observed in cluster structure and number.
d. all of the observations are encompassed within a single large cluster with mean
k.
ANSWER: c
RATIONALE: The k-means algorithm repeats the process (calculate cluster centroid, assign observation to
cluster with nearest centroid) until there is no change in the clusters or a specified maximum
number of iterations is reached.
POINTS: 1
DIFFICULTY: Challenging
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

23. A cluster’s __________ can be measured by the difference between the distance value at which a cluster is originally
formed and the distance value at which it is merged with another cluster in a dendrogram.
a. dimension b. affordability
c. durability d. span
ANSWER: c
RATIONALE: A cluster’s durability (or strength) can be measured by the difference between the distance
value at which a cluster is originally formed and the distance value at which it is merged with
another cluster in a dendrogram.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False

Copyright Cengage Learning. Powered by Cognero. Page 10


CH 04 - Descriptive Data Mining

NATIONAL STANDARDS: United States - BUSPROG: Analytic


United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

24. Complete linkage can be used to measure the distance between _________ in cluster analysis.
a. objects
b. clusters
c. observations
d. wards
ANSWER: b
RATIONALE: Complete linkage is a measure of calculating dissimilarity between clusters by considering
only the two most dissimilar observations in the two clusters.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

25. Complete linkage can be used to measure the distance between clusters that are the __________ in cluster analysis.
a. most similar
b. most different
c. farthest apart
d. closest
ANSWER: b
RATIONALE: Complete linkage is a measure of calculating dissimilarity between clusters by considering
only the two most dissimilar observations in the two clusters.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

26. Single linkage can be used to measure the distance between clusters that are the __________ in cluster analysis.
Copyright Cengage Learning. Powered by Cognero. Page 11
CH 04 - Descriptive Data Mining

a. most similar
b. most different
c. farthest apart
d. closest
ANSWER: a
RATIONALE: Single linkage is a measure of calculating dissimilarity between clusters by considering only
the two most similar observations in the two clusters.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

27. __________ is a measure that computes the dissimilarity between a cluster AB and a cluster C by averaging the
distance between A and C and the distance between B and C.
a. Ward's method
b. Jaccard's coefficient
c. McQuitty's method
d. None of these are
correct.
ANSWER: c
RATIONALE: McQuitty’s method is a measure that computes the dissimilarity between a cluster AB and a
cluster C by averaging the distance between A and C and the distance between B and C.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:18 PM

28. Hierarchical clustering using __________ results in a sequence of aggregated clusters that minimizes the loss of
information between the individual observation level and the cluster level.
a. McQuitty’s method
b. centroid linkage
c. median linkage
d. Ward’s method
Copyright Cengage Learning. Powered by Cognero. Page 12
CH 04 - Descriptive Data Mining

ANSWER: d
RATIONALE: Ward’s minimum variance can be used to measure the distance between clusters in cluster
analysis.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:25 PM

29. In k-means clustering, k represents the


a. number of variables.
b. number of clusters.
c. number of observations in a
cluster.
d. mean of the cluster.
ANSWER: b
RATIONALE: In k-means clustering, k represents the number of clusters.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

30. The strength of a cluster can be measured by comparing the average distance in a cluster to the distance between
cluster centroids. One rule of thumb is that the ratio for between-cluster distance to within-cluster distance should exceed
what value for useful clusters?
a. 0.5
b. 1
c. 1.5
d. 2
ANSWER: b
RATIONALE: The ratio of between-cluster distance to within-cluster distance should exceed 1.0 for useful
clusters.
POINTS: 1
DIFFICULTY: Easy
Copyright Cengage Learning. Powered by Cognero. Page 13
CH 04 - Descriptive Data Mining

REFERENCES: CLUSTER ANALYSIS


QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

31. In which of the following scenarios would it be appropriate to use hierarchical clustering?
a. When the number of observations in the dataset is relatively high
b. When it is not necessary to know the nesting of clusters
c. When the number of clusters is known beforehand
d. When binary or ordinal data needs to be clustered
ANSWER: d
RATIONALE: If one has a small data set and want to easily examine solutions with increasing numbers of
clusters, one may want to use hierarchical clustering. Hierarchical clusters are also convenient
if one wants to observe how clusters are nested. k-means clustering partitions the
observations, which is appropriate if trying to summarize the data with k “average”
observations that describe the data with the minimum amount of error. Because Euclidean
distance is the standard metric for k-means clustering, it is generally not as appropriate for
binary or ordinal data for which an “average” is not meaningful.
POINTS: 1
DIFFICULTY: Challenging
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:27 PM

32. An analysis of items frequently co-occurring in transactions is known as


a. market b. market basket analysis.
segmentation.
c. regression analysis. d. cluster analysis.
ANSWER: b
RATIONALE: An analysis of items frequently co-occurring in transactions is known as market basket
analysis.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
Copyright Cengage Learning. Powered by Cognero. Page 14
CH 04 - Descriptive Data Mining

United States - DISC: - Descriptive Statistics


KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

33. A __________ refers to the number of times a collection of items occurs together in a transaction data set.
a. consequent b. validation count
c. support count d. antecedent
ANSWER: c
RATIONALE: The number of times that a collection of items occurs together in a transaction data set is
known as the support count.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:27 PM

34. To identify patterns across transactions, we can use


a. association rules.
b. complete linkage.
c. centroid linkage.
d. k-means.
ANSWER: a
RATIONALE: Association rules are if-then statements describing the relationship between item sets.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

35. The lift ratio of an association rule with a confidence value of 0.45 and in which the consequent occurs in 6 out of 10
cases is
a. 1.40. b. 0.54.
c. 1.00. d. 0.75.
ANSWER: d
Copyright Cengage Learning. Powered by Cognero. Page 15
CH 04 - Descriptive Data Mining

RATIONALE: The lift ratio is given by confidence / (support of consequent/total number of


transactions).

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

36. Which statement is true of an association rule?


a. It is ultimately judged on how actionable it is and how well it explains the relationship between item
sets.
b. It is a data reduction technique that reduces large information into smaller homogeneous groups.
c. It uses analytic models to describe the relationship between metrics that drive business performance.
d. It seeks to classify a categorical outcome into two or more categories.
ANSWER: a
RATIONALE: An association rule is ultimately judged on how actionable it is and how well it explains the
relationship between item sets.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Understand
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

37. The __________ the lift ratio, the __________ the association rule.
a. higher; stronger
b. higher; weaker
c. lower; stronger
d. lower; weaker
ANSWER: a
RATIONALE: Lift is defined as the ratio of confidence to expected confidence. Expected confidence is the
number of transactions that include the consequent divided by the total number of
transactions. The higher the lift ratio, the stronger the association rule. A lift ratio greater
Copyright Cengage Learning. Powered by Cognero. Page 16
CH 04 - Descriptive Data Mining

than 1.0 suggests that there is some usefulness to the rule and that it is better at identifying
cases when the consequent occurs than no rule at all.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:28 PM

38. Suppose that the confidence of an association rule is 0.75 and the total number of transactions is 250. How many of
those transactions support the consequent if the lift ratio is 1.875?
a. 100
b. 125
c. 150
d. 175
ANSWER: a
RATIONALE:

POINTS: 1
DIFFICULTY: Challenging
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

39. The strength of the association rule is known as __________ and is calculated as the ratio of the confidence of an
Copyright Cengage Learning. Powered by Cognero. Page 17
CH 04 - Descriptive Data Mining
association rule to the benchmark confidence.
a. lift
b. antecedent
c. support count
d. consequent
ANSWER: a
RATIONALE: The strength of the association rule is known as lift and is calculated as the ratio of the
confidence of an association rule to the benchmark confidence.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

40. The process of extracting useful information from text data is known as __________.
a. text mining
b. tokenization
c. stemming
d. corpus
ANSWER: a
RATIONALE: The process of extracting useful information from text data is known as text mining
POINTS: 1
DIFFICULTY: Easy
REFERENCES: TEXT MINING
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 3/14/2021 12:40 PM
DATE MODIFIED: 3/19/2021 11:16 AM

41. A collection of text documents to be analyzed is called a ___________.


a. book
b. corpus
c. library
d. consequent
ANSWER: b
RATIONALE: A collection of text documents to be analyzed is called a
corpus.
Copyright Cengage Learning. Powered by Cognero. Page 18
CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Easy
REFERENCES: TEXT MINING
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 3/14/2021 12:40 PM
DATE MODIFIED: 3/19/2021 11:16 AM

42. The process of dividing text into separate terms is referred to as __________.
a. data cleaning
b. stemming
c. tokenization
d. stacking
ANSWER: c
RATIONALE: The process of dividing text into separate terms is referred to as
tokenization.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: TEXT MINING
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 3/14/2021 12:40 PM
DATE MODIFIED: 3/19/2021 11:16 AM

43. The process of converting a word to its stem, or root word, is referred to as __________.
a. data cleaning
b. stemming
c. tokenization
d. stacking
ANSWER: b
RATIONALE: The process of converting a word to its stem, or root word, is referred to as
stemming.
POINTS: 1
DIFFICULTY: Easy
REFERENCES: TEXT MINING
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
Copyright Cengage Learning. Powered by Cognero. Page 19
CH 04 - Descriptive Data Mining

KEYWORDS: Bloom's: Remember


DATE CREATED: 3/14/2021 12:41 PM
DATE MODIFIED: 3/19/2021 11:16 AM

44. In the text mining process, the text is first preprocessed by deriving a smaller set of _________ from the larger set of
words contained in a collection of documents.
a. tokens
b. stems
c. terms
d. stack
ANSWER: a
RATIONALE: In the text mining process, the text is first preprocessed by deriving a smaller set of tokens
from the larger set of words contained in a collection of documents.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: TEXT MINING
QUESTION TYPE: Multiple Choice
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom's: Remember
DATE CREATED: 3/14/2021 12:41 PM
DATE MODIFIED: 3/19/2021 11:16 AM

Subjective Short Answer

As part of the quarterly reviews, the manager of a retail store analyzes the quality of customer service based on the
periodic customer satisfaction ratings (on a scale of 1 to 10 with 1 = Poor and 10 = Excellent). To understand the level of
service quality, which includes the waiting times of the customers in the checkout section, he collected the data shown
below on 100 customers who visited the store.

Customer
Customer Purchase Customer
Wait Time (min) Satisfaction
Number Amount ($) Age
Rating
1 2.3 436 42 7
2 2.8 408 33 6
3 3.2 432 38 5
4 3.4 431 40 5
5 3.4 456 29 6
6 4.2 537 46 4
7 3.2 456 42 5
8 1.4 430 40 8
9 6.4 663 24 3
10 7.8 839 37 4
11 6.5 659 52 5
12 9.8 836 43 2
13 5 543 56 4
14 1.8 419 35 8
15 6.1 700 39 6
Copyright Cengage Learning. Powered by Cognero. Page 20
CH 04 - Descriptive Data Mining
16 3.4 432 44 7
17 7.8 845 33 5
18 2.8 467 42 6
19 1.2 425 46 8
20 9.5 848 50 4
21 8.2 808 55 3
22 7.6 674 35 3
23 5.4 547 52 4
24 6.7 691 38 5
25 9.6 847 53 4
26 11.4 826 48 2
27 2.1 426 52 7
28 5.6 535 32 7
29 3.7 521 43 8
30 4.9 513 44 6
31 6.4 645 53 5
32 9.3 846 52 4
33 10.6 730 51 3
34 6.5 786 53 3
35 5.4 523 46 5
36 7.6 654 36 6
37 3.2 443 48 7
38 2.4 409 54 8
39 1 400 39 6
40 0.2 418 51 7
41 2.4 498 30 6
42 5.7 532 32 5
43 6.4 663 44 7
44 6 681 39 8
45 3.7 543 54 5
46 8.7 800 51 5
47 6.9 673 45 5
48 9.8 856 43 4
49 10 756 44 4
50 9.5 854 43 6
51 6.3 672 50 6
52 7.4 698 47 7
53 2.3 434 43 7
54 4.6 544 40 4
55 4.9 523 53 6
56 5.7 546 55 6
57 7.4 676 42 8
58 6.8 662 36 6
59 9.6 1000 40 5
60 6.4 678 46 5
61 7.2 655 32 4
62 5.6 535 36 5
63 9.7 833 35 3
64 2.3 498 30 7
65 4.3 508 41 6
66 5.7 542 49 6
67 2.4 435 39 8
68 6.7 665 41 5
Copyright Cengage Learning. Powered by Cognero. Page 21
CH 04 - Descriptive Data Mining
69 2.4 387 54 9
70 9.8 845 34 7
71 4.5 532 40 6
72 6.7 687 30 5
73 7.2 643 33 4
74 3.5 424 49 7
75 8.9 836 47 5
76 9.7 876 31 4
77 3.5 456 47 7
78 4.7 523 49 6
79 8.5 818 35 5
80 9.7 845 54 4
81 2.7 401 55 7
82 5.7 554 43 6
83 7.6 648 51 7
84 4.4 540 31 6
85 7.8 839 45 5
86 9.4 845 48 4
87 4.9 534 36 5
88 7.1 693 44 4
89 5.4 512 39 3
90 6.7 665 49 5
91 8.6 825 36 5
92 4.5 548 30 7
93 6.1 704 31 5
94 5.3 509 31 6
95 6.7 672 35 5
96 8.1 824 36 4
97 6.3 632 30 4
98 7.4 689 35 2
99 8.8 839 50 4
100 9.6 847 35 2

45. Using the data given, apply k-means clustering with k = 5 using Wait Time (min), Purchase Amount ($), Customer
Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data and specify 50 iterations and
10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Analyze the resultant clusters.
What is the smallest cluster? What is the least dense cluster (as measured by the average distance in the cluster)?
What reasons do you see for low customer satisfaction ratings?
ANSWER: We specify # Iterations = 50 and # Starts = 10. We use the default fixed seed of 12345.
We see that the size of the clusters does not vary much. Size of cluster varies from 6 to 36.
The smallest cluster has 6 customers, Cluster-4. The least dense cluster is the 36-
customer cluster, Cluster-5, which includes customers with waiting time ranging from 6.1 to
11.4, purchase amount ranging from 654 to 1000, age between 31 and 55, and customer
satisfaction rating ranging from 2 to 7.

From the below output, it appears that more waiting times and high purchase amounts are
the reasons for low customer satisfaction ratings. The high purchase amounts can be attributed
to high prices of the products in the store.

Copyright Cengage Learning. Powered by Cognero. Page 22


CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Challenging
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: quarterly reviews
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:31 PM

46. Using the data given, apply hierarchical clustering with five clusters using Wait Time (min), Purchase Amount ($),
Customer Age, and Customer Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the
XLMiner Hierarchical Clustering procedure. Use Ward’s method as the clustering method.
a. Use a PivotTable on the data in the HC_Clusters1 worksheet to compute the cluster centers for the five clusters
in the hierarchical clustering.
b. Identify the cluster with the largest average waiting time. Using all the variables, how would you characterize
this cluster?
Copyright Cengage Learning. Powered by Cognero. Page 23
CH 04 - Descriptive Data Mining
c. Identify the smallest cluster.
d. By examining the dendrogram on the HC_Dendrogram worksheet (as well as the sequence of clustering stages
in HC_Output1), what number of clusters seems to be the most natural fit based on the distance?
ANSWER: a. Below is the PivotTable obtained on the data in the “HC_Clusters1” worksheet.

b. Cluster 5 has the largest average waiting time (approx. 9.35 min).
This cluster is a collection of 11 customers characterized by the largest average purchase
amount of about $823, the oldest average customer age, and the lowest average customer
satisfaction rating 3.36.

c. We see that the size of the clusters does not vary much. However, Cluster 5 is the smallest
cluster with a collection of 11 customers.

d. From the below figure, four clusters appear to be a natural fit for this data. When there are
more than four clusters, mergers result in a small marginal increase in distance, but when
there
are less than four clusters, mergers lead to a large marginal increase in distance.

Copyright Cengage Learning. Powered by Cognero. Page 24


CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: quarterly reviews
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/30/2021 10:25 AM

47. a. Using the data given, apply hierarchical clustering with five clusters using Wait Time (min) and Customer
Satisfaction Rating as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical
Clustering procedure and specify single linkage as the clustering method. Analyze the resulting clusters
by computing the cluster size. It may be helpful to use a PivotTable on the data in the HC_Clusters worksheet
generated by XLMiner to compute descriptive measures of the Wait Time and Customer Satisfaction Rating
variables in each cluster. You can also visualize the clusters by creating a scatter plot with Wait Time (min)
as the x-variable and Customer Satisfaction Rating as the y-variable.

b. Repeat part a using average linkage as the clustering method. Compare the clusters to the previous method.
ANSWER: a. Single linkage results in clusters with extreme sizes. There are three single-customer
clusters (customer-40, customer-70, and customer-98). There is one 90-customer cluster with
waiting time ranging between 1 min to 11.4 min.

Copyright Cengage Learning. Powered by Cognero. Page 25


CH 04 - Descriptive Data Mining

e. Average linkage results in two clusters which have two customers. Some of the single
linkage clusters are closely related to the average linkage clusters. For example, Cluster 1 in
the single linkage is the merger of Clusters 1, 2, and 3 from the average linkage. And, Cluster
4 of the single linkage cluster is similar to Cluster 5 of the average linkage cluster.

Copyright Cengage Learning. Powered by Cognero. Page 26


CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: quarterly reviews
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:34 PM

48. Using the data given, apply k-means clustering using Wait time (min) as the variable with k = 3. Be sure to Normalize
input data and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then
create one distinct data set for each of the three resulting clusters for waiting time.
a. For the observations composing the cluster which has the low waiting time, apply hierarchical clustering with Ward’s
method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be
sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data
in HC_Clusters, report the characteristics of each cluster.
b. For the observations composing the cluster which has the medium waiting time, apply hierarchical clustering with
Ward’s method to form three clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as
variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a
PivotTable on the data in HC_Clusters, report the characteristics of each cluster.
c. For the observations composing the cluster which has the high waiting time, apply hierarchical clustering with Ward’s
method to form two clusters using Purchase Amount, Customer Age, and Customer Satisfaction Rating as variables. Be
sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering procedure. Using a PivotTable on the data
in HC_Clusters, report the characteristics of each cluster.
Copyright Cengage Learning. Powered by Cognero. Page 27
CH 04 - Descriptive Data Mining

ANSWER:

Below is the Pivot table on the data in KM_Cluster1.

a. The interval with the low waiting time is separated into two clusters with respect to
Purchase amount, Age, and Customer satisfaction rating.

b. The interval with the medium waiting time is separated into clusters of 22 and 19
customers with about similar Customer age and Customer Satisfaction Rating. The other
cluster differs primarily in terms of Customer age and Customer Satisfaction Rating.

c. The interval with the high waiting time is separated into two clusters of 14 and 9 customers
which have similar purchase amount and Customer Satisfaction Rating.

POINTS: 1
DIFFICULTY: Challenging
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: quarterly reviews
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

To examine the local housing market in a particular region, a sample of 120 homes sold during a year is collected. The
Copyright Cengage Learning. Powered by Cognero. Page 28
CH 04 - Descriptive Data Mining
data are given below.
LandValue ($) BuildingValue ($) Acres Age Price ($)
18,100 92,500 0.5 53.9 114,885
23,600 152,700 0.22 19.7 180,895
25,900 134,300 0.3 15.9 162,038
22,100 129,600 0.23 41 154,496
23,900 168,700 0.32 39.9 196,973
22,400 118,300 0.25 41.8 145,075
24,100 123,300 0.26 70.9 151,480
26,300 133,800 0.26 37.8 164,762
24,900 139,400 0.24 33 166,528
13,600 87,200 0.17 34.7 105,762
36,100 210,400 0.6 52.9 250,170
19,500 101,300 0.16 67.8 125,082
38,800 224,700 0.44 21.7 265,066
23,500 139,000 0.22 10.8 166,697
26,300 164,200 0.35 3.9 194,881
21,900 122,400 0.17 15.7 146,818
23,400 149,600 0.22 15.7 176,048
15,000 102,200 0.12 97.8 119,584
15,000 102,200 0.12 97.7 121,759
9,200 22,000 0.17 120.9 34,947
9,200 22,000 0.17 120.9 35,214
5,600 48,000 0.12 103.9 57,142
9,000 58,800 0.24 88 72,192
21,000 109,600 0.21 36.7 133,848
23,500 165,900 0.15 5.7 194,079
36,000 262,500 0.22 2.9 300,407
23,700 114,900 0.22 37.7 141,700
22,000 102,700 0.2 48.9 128,866
19,900 95,800 0.23 78.9 119,189
22,100 116,300 0.18 30.8 141,018
24,600 165,500 0.29 43 193,661
21,500 113,400 0.17 44.9 137,308
15,000 81,100 0.16 62.9 99,817
15,700 129,200 0.23 46.7 148,909
14,200 81,600 0.15 57.9 100,701
10,700 49,700 0.15 99.8 65,082
16,600 72,700 0.18 91.8 92,614
25,500 110,700 0.21 48 137,889
15,100 74,300 0.23 71.8 91,180
7,400 55,500 0.15 96.8 64,119
28,500 129,400 0.25 49.9 160,139
25,100 83,900 0.2 45.8 113,043
50,100 164,600 0.23 44 217,684
83,300 276,000 0.61 47.9 360,936
124,500 552,300 1.05 5.7 679,795
47,000 214,400 0.22 92.9 264,115
64,600 185,000 0.58 91 254,075
33,900 138,800 0.22 97.9 173,987
41,100 156,300 0.18 76 200,251
29,100 96,400 0.28 57.8 130,214
56,400 256,400 0.4 56.8 316,874
Copyright Cengage Learning. Powered by Cognero. Page 29
CH 04 - Descriptive Data Mining
45,400 219,200 0.21 79.8 267,672
23,800 92,100 0.15 91.9 119,769
52,800 172,800 0.27 74.8 229,499
25,100 99,200 0.19 36.7 128,456
27,200 152,600 0.18 16.7 181,102
28,100 102,900 0.18 75.8 132,977
28,800 98,800 0.19 53.9 131,411
33,400 103,900 0.45 84.9 139,697
20,700 95,600 0.14 89.8 120,046
25,600 101,900 0.2 57.8 131,026
25,800 110,700 0.18 51.9 141,202
29,300 147,700 0.2 90.9 181,575
26,000 116,000 0.18 44 144,513
25,900 73,500 0.16 81.8 100,953
32,800 125,000 0.35 68.7 160,546
31,100 166,800 0.2 57.7 199,970
25,800 105,300 0.17 58.8 134,647
27,200 94,800 0.17 42.9 124,311
25,000 105,900 0.16 82 133,543
29,200 117,500 0.2 53.8 151,392
30,000 93,300 0.26 55.7 124,476
20,400 112,000 0.13 80.9 136,599
23,600 83,400 0.16 57.7 110,399
16,200 85,800 0.1 67 105,027
29,300 123,900 0.22 44.8 157,819
27,000 97,800 0.18 46.8 129,675
25,600 86,300 0.16 61.7 115,952
46,200 220,500 0.57 50.8 268,552
22,900 160,000 0.15 20.7 187,870
27,100 105,200 0.21 51.8 135,549
30,700 107,100 0.3 70 142,738
29,100 102,400 0.23 58 135,284
34,700 150,400 0.28 68.9 189,790
20,000 80,400 0.24 66.9 105,302
35,700 159,400 0.28 1.7 196,936
35,100 161,500 0.25 8.8 201,349
33,700 162,500 0.21 8.8 198,580
33,700 162,500 0.21 8.8 200,228
36,400 176,100 0.29 8.9 215,634
33,200 122,300 0.2 4.9 157,208
39,200 169,200 0.36 5.9 212,662
33,100 180,100 0.2 5.8 217,543
16,000 98,400 0.19 49.9 118,491
24,900 63,800 0.45 83.9 91,539
22,000 121,300 0.27 34.9 147,802
20,000 107,600 0.23 36.7 131,948
33,900 230,800 0.27 10 268,444
22,100 153,800 0.3 46.8 180,464
22,800 111,100 0.23 52 137,326
24,700 11,7800 0.32 48.7 145,115
38,700 118,700 0.81 47.8 159,644
25,800 108,000 0.26 53.3 135,049
31,700 140,500 0.34 40.6 174,475
Copyright Cengage Learning. Powered by Cognero. Page 30
CH 04 - Descriptive Data Mining
82,200 171,700 1.23 56.4 257,467
19,500 147,600 0.53 28.2 169,311
24,400 132,000 0.25 14.2 157,570
22,500 119,800 0.18 15.5 143,676
25,900 117,100 0.29 17.7 146,960
22,700 95,000 0.25 55.6 121,175
21,200 56,700 0.23 96.6 81,869
34,000 163,800 0.26 15.2 199,361
18,900 118,000 0.17 45.5 139,981
33,900 151,600 0.26 25.3 186,637
23,800 133,500 0.21 13.6 161,123
23,900 119,000 0.21 14.3 146,054
18,500 110,500 0.19 32.2 130,575
36,300 122,500 0.61 56.2 162,270
47,300 298,800 0.36 31.4 348,138
36,600 238,700 0.28 25.5 278,839

49. Using the data given, apply k-means clustering with k = 10 using LandValue ($), BuildingValue ($), Acres, Age, and
Price ($) as variables. Be sure to Normalize input data and specify 50 iterations and 10 random starts in Step 2 of the
XLMiner k-Means Clustering procedure. What is the smallest cluster? What is the least dense cluster (as measured by the
average distance in the cluster)?
ANSWER: We specify # Iterations = 50 and # Starts = 10. We use the default fixed seed of 12345.
We see that the size of the clusters varies widely. There are two single-home clusters, Cluster-
3 and Cluster-6. The least dense cluster is the seven-home cluster, Cluster-2, which includes
homes with age ranging from 21.7 to 91 and price ranging from $159,644 to $360,936.

Copyright Cengage Learning. Powered by Cognero. Page 31


CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Challenging
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: Housing Market
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:35 PM

50. Using the data given, apply hierarchical clustering with 10 clusters using LandValue ($), BuildingValue ($), Acres,
Age, and Price ($) as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical Clustering
procedure. Use Ward’s method as the clustering method.
a. Use a PivotTable on the data in the HC_Clusters1 worksheet to compute the cluster centers for the clusters in the
hierarchical clustering.
b. Identify the cluster with the largest average price. Using all the variables, how would you characterize this cluster?
c. Identify the smallest cluster.
ANSWER:
a. Below is the PivotTable obtained on the data in the “HC_Clusters1” worksheet.

b. Cluster 7 has the largest average price (about $296,295). This cluster is a collection
of
six homes characterized by a cluster center indicating relatively moderate land value of
$41,500, the second largest average building value of $251,983, a relatively low average
Copyright Cengage Learning. Powered by Cognero. Page 32
CH 04 - Descriptive Data Mining

acres value of 0.33; and a relatively low average age of about 25 years.
c. Clusters 9 and 10 are the smallest clusters each with a single home.
POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: Housing Market
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:36 PM

51. a. Using the data given, apply hierarchical clustering with 10 clusters using LandValue ($), BuildingValue ($),
Acres, Age, and Price ($) as variables. Be sure to Normalize input data in Step 2 of the XLMiner Hierarchical
Clustering procedure and specify complete linkage as the clustering method. Analyze the resulting clusters by computing
the cluster size. It may be helpful to use a PivotTable on the data in the HC_Clusters worksheet generated by
XLMiner. You can also visualize the clusters by creating a scatter plot with Acre as the x-variable and Price ($) as the y-
variable.
b. Repeat part a using average group linkage as the clustering method. Compare the clusters to the previous method.
ANSWER: a. Complete linkage results in clusters with extreme sizes. There are two single-home
clusters (home-45 and home-105). There is one 43-home cluster, Cluster 3, which has
the average price centered at $124,927.

Copyright Cengage Learning. Powered by Cognero. Page 33


CH 04 - Descriptive Data Mining

b. Average group linkage results in three single-home clusters. Only one of the complete
linkage clusters is identical to a cluster from average group linkage. Cluster 10 of
complete
linkage and average group linkage are the same.

Copyright Cengage Learning. Powered by Cognero. Page 34


CH 04 - Descriptive Data Mining

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: Housing Market
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:37 PM

52. Using the data given, apply k-means clustering using Price ($) as the variable with k = 3. Be sure to Normalize input
data and specify 50 iterations and 10 random starts in Step 2 of the XLMiner k-Means Clustering procedure. Then create
one distinct data set for each of the three resulting clusters of price.
a. For the observations composing the cluster with low home price, apply hierarchical clustering with Ward’s method to
form three clusters using Acres and Age as variables. Be sure to Normalize input data in Step 2 of the XLMiner
Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters1, report the characteristics of each
cluster.
b. For the observations composing the cluster with medium home price, apply hierarchical clustering with Ward’s method
to form three clusters using Acres and Age as variables. Be sure to Normalize input data in Step 2 of the XLMiner
Hierarchical Clustering procedure. Using a PivotTable on the data in HC_Clusters1, report the characteristics of each
cluster.
c. Comment on the cluster with the high home price.
ANSWER:
Below is the Pivot table on the data in KM_Cluster1.

Copyright Cengage Learning. Powered by Cognero. Page 35


CH 04 - Descriptive Data Mining

a. The interval with the low home price is separated into three clusters with respect to Acres
and Age. The characteristics of each cluster are as below.

b. The interval with the medium home price is separated into three clusters with respect to
Acres and Age. The characteristics of each cluster are as below.

c. The third cluster that has the high home price is a single-home cluster with values for Acres
and Age as 1.05 and 5.7, respectively, and price $679,795.
POINTS: 1
DIFFICULTY: Challenging
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
PREFACE NAME: Housing Market
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:38 PM

53. A retailer is interested in analyzing the shopping trends of men concerning the items shirts, pants, jeans, t-shirts, shoes,
and belts. A sample of 50 male customers is selected and the data are given below.

t-shirt
Formal Shirt Formal Pants Belt
Formal Pants t-shirt
Formal Pants Formal Shoes Belt
Formal Shirt Formal Pants Jeans t-shirt Formal Shoes Belt
Formal Pants Formal Shirt t-shirt Formal Shoes
Formal Shoes Formal Shirt Formal Pants
Formal Shoes Jeans t-shirt
Formal Pants t-shirt Jeans Formal Shirt Belt Formal Shoes
Copyright Cengage Learning. Powered by Cognero. Page 36
CH 04 - Descriptive Data Mining

Belt Jeans
Formal Shirt t-shirt Formal Shoes Belt
Belt Formal Pants Formal Shirt Formal Shoes
Formal Shoes Belt
Formal Shoes Formal Pants t-shirt Formal Shirt
t-shirt Jeans
Formal Shirt Formal Pants Formal Shoes Jeans
Jeans Belt
Formal Shoes Formal Pants Formal Shirt t-shirt Jeans Belt
Formal Shirt Formal Pants t-shirt Formal Shoes Belt
Formal Shoes Formal Pants Formal Shirt Belt
Belt t-shirt Jeans Formal Shoes
t-shirt Formal Pants Formal Shirt Formal Shoes
Formal Pants Formal Shirt Formal Shoes
Formal Pants
Formal Shirt Formal Shoes Jeans Formal Pants
Belt Formal Pants Formal Shoes
Formal Shirt t-shirt
Formal Pants Formal Shoes
Formal Shoes Formal Pants Jeans Formal Shirt
Jeans Formal Pants Formal Shoes
Formal Pants Formal Shoes Belt Formal Shirt
Formal Shoes Formal Pants Formal Shirt
Formal Pants t-shirt
Formal Shoes Formal Pants Belt t-shirt Formal Shirt Jeans
Belt Formal Shoes Formal shirt Formal Pants
Jeans t-shirt Formal Pants
Formal Shirt Formal Pants Jeans t-shirt Formal Shoes
Formal Pants Formal Shirt Formal Shoes
t-shirt Jeans Formal shirt
Formal Shoes Formal Pants Belt
Belt Formal Shoes Formal pant Formal shirt
Formal Pants Formal Shirt Formal Shoes t-shirt Belt
Formal Shoes t-shirt
Jeans Formal Shoes Belt Formal Shirt
Formal Pants Formal Shoes t-shirt Formal Shirt
Belt Formal Pants Jeans
Formal Shirt Jeans t-shirt Belt
Jeans Formal Pants Belt t-shirt Formal Shoes Formal Shirt
Formal Shirt Jeans Formal Shoes
Formal Shirt Jeans Formal Pants Formal Shoes Belt

a. Using a minimum support of 20 transactions and a minimum confidence of 50 percent, use XLMiner to generate a
list of association rules. How many rules satisfy this criterion?
b. Using the list of rules from part (a), consider the rule with the largest lift ratio. Interpret what this rule is saying about
the relationship between the antecedent item set and consequent item set.
c. Interpret the support count of the item set composed of the all the items involved in the rule with the largest lift ratio.
d. Interpret the confidence of the rule with the largest lift ratio.
e. Interpret the lift ratio of the rule with the largest lift ratio.
ANSWER: a. Fourteen rules have a support count of at least 20 and a confidence of 50%.
Copyright Cengage Learning. Powered by Cognero. Page 37
CH 04 - Descriptive Data Mining

b. Antecedent: Formal Pants, Formal shoes; Consequent: Formal shirt. If a customer


purchases formal pants and formal shoes, then he also purchases formal shirts.
c. The support count of the item set involved in this rule is 23, meaning that formal pants, a
formal shirt, and formal shoes have been purchased 23 times together.
d. The confidence of this rule is 79.31%, which means that of the 29 times formal pants and
formal shoes were purchased, 23 times formal shirts were also purchased.
e. The lift ratio of this rule is 1.37, which means that a customer purchasing formal pants and
formal shoes and who also purchased formal shirts is 37% more likely than a randomly
selected customer who purchased formal shoes.

POINTS: 1
DIFFICULTY: Easy
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/30/2021 10:36 AM

54. __________ clustering method defines the similarity between two clusters as the similarity of the pair of observations
(one from each cluster) that are the most different.
ANSWER: Complete linkage
RATIONALE: Complete linkage measures dissimilarity between clusters by considering only the two most
dissimilar observations between the two clusters.

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
Copyright Cengage Learning. Powered by Cognero. Page 38
CH 04 - Descriptive Data Mining

DATE MODIFIED: 3/19/2021 11:16 AM

55. __________ uses the averaging concept of cluster centroids to define between-cluster similarity.
ANSWER: Centroid linkage
RATIONALE: Centroid linkage uses the averaging concept of cluster centroids to define between-cluster
similarity.

POINTS: 1
DIFFICULTY: Moderate
REFERENCES: CLUSTER ANALYSIS
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/19/2021 11:16 AM

56. Platinum Gym has 10,000 gym members out of which 1500 memberships included Unlimited Fitness Training and use
of the tanning salon, and out of which 750 included Unlimited Hydromassage. If the Fitness Training is considered A, the
use of the tanning salon is considered B, and the Hydromassage is considered C, then the associate rule for these sales
becomes "If A and B are purchased, then C is also purchased." Calculate the confidence level.
ANSWER: 0.5
RATIONALE: Total memberships = 10,000
Packages of A and B = 1,500
Packages of A, B, and C = 750

The association rule "If A and B are purchased, then C is also purchased" has a support of 750
out of 1500 sales. The confidence level = total support/total memberships which is 750/1,500
= 50%
POINTS: 1
DIFFICULTY: Challenging
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/30/2021 10:37 AM

57. Platinum Gym has 10,000 gym members out of which 1500 memberships included Unlimited Fitness Training and use
of the tanning salon, and out of which 750 included Unlimited Hydromassage. If the Fitness Training is considered A, the
use of the tanning salon is considered B, and the Hydromassage is considered C, then the associate rule for these sales
becomes "If A and B are purchased, then C is also purchased." Given total transactions for C are 3000. Calculate the
benchmark confidence level

Copyright Cengage Learning. Powered by Cognero. Page 39


CH 04 - Descriptive Data Mining

ANSWER: 0.3
RATIONALE: The total number of transactions for C is given as 3000. Benchmark confidence is the number
of transactions that includes the consequent divided by the total number of transactions.
Therefore, the benchmark confidence level = 3,000/10,000 = 30%.
POINTS: 1
DIFFICULTY: Challenging
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:43 PM

58. Platinum Gym has 10,000 gym members out of which 1500 memberships included Unlimited Fitness Training and use
of the tanning salon, and out of which 750 included Unlimited Hydromassage. If the Fitness Training is considered A, the
use of the tanning salon is considered B, and the Hydromassage is considered C, then the associate rule for these sales
becomes, "If A and B are purchased, then C is also purchased." Given total transactions for C are 3000. Calculate the lift
for this rule.
ANSWER: 1.67
RATIONALE: Total memberships = 10,
Packages of A and B = 1,500
Packages of A, B, and C = 750
The association rule, "If A and B are purchased, then C is also purchased" has a support of
750 out of 1500 sales. The confidence level = Total support/Total memberships = 750/1,500 =
50%.
The total number of transactions for C is given as 3000. Benchmark confidence is the number
of transactions that include the consequent divided by the total number of transactions.
Therefore, the benchmark confidence level = 3,000/10,000 = 30%. The lift is calculated as
confidence / benchmark confidence = 50% / 30% = 1.67.
POINTS: 1
DIFFICULTY: Challenging
REFERENCES: ASSOCIATION RULES
QUESTION TYPE: Subjective Short Answer
HAS VARIABLES: False
NATIONAL STANDARDS: United States - BUSPROG: Analytic
United States - DISC: - Descriptive Statistics
KEYWORDS: Bloom’s: Apply
DATE CREATED: 1/23/2021 10:16 AM
DATE MODIFIED: 3/28/2021 5:44 PM

Copyright Cengage Learning. Powered by Cognero. Page 40

You might also like