Week 6: Test Bank Questions Data Mining and Data Warehousing - IT 446
Week 6: Test Bank Questions Data Mining and Data Warehousing - IT 446
Week 6: Test Bank Questions Data Mining and Data Warehousing - IT 446
3) If the equation, 𝜎(𝑋) = ⌊{𝑡𝑖 |𝑋 ⊆ 𝑡𝑖, 𝑡𝑖 ∈ 𝑇 }⌋, represents the value known as the support count in
associative analysis, then
A. It must be true that ti represents a member of the itemset X
B. Then the higher the value, 𝜎(𝑋) is determined to be, the less likely that the itemset is meaningful
in the final analysis results
C. 𝜎(𝑋) cannot be the support value, because the set T is clearly always going to be equal to the
null set
D. X is the itemset for which we are trying to determine the number of transactions, 𝑡𝑖 , in the data
set of transactions, T, that contain X
4) The theorem that states “If an itemset is frequent, then all of its subsets must also be frequent” is often
referred to as the
A. Theorem that can never be proved
B. Apriori Principle
C. The key to understanding market basket analysis
D. All of the above
5) In associative analysis, “frequent” or, put differently, the minimum support value, that an itemset must
have to considered in the final results
A. Is specific to the data being analyzed
B. Is a constant for a given organization’s data, but can be different across organizations
C. Is often found to be non-determinable
D. Is often just a random value chosen by the analyst and so cannot be tested as to its validity
6) If there exists these two confidence rules, 𝑋 → 𝑌, 𝑋̃ → 𝑌̃, where for X, Y, 𝑋̃, ̃𝑌, both of these are true,
𝑋̃ ⊆ 𝑋 &𝑌̃ ⊆ 𝑌, which of the following is true:
A. 𝑋 ⊆ 𝑋̃ and 𝑌 ⊆ 𝑌̃
B. No such confidence rules can co-exist
C. The value of 𝑋 → 𝑌, can be higher, equal to, lower than the value of 𝑋̃ → 𝑌̃
D. The value 𝑋 → 𝑌, can only be equal to the value of 𝑋̃ → 𝑌̃,
9) Which of these possible attributes of a maximal itemset actually is the defining one:
A. None of the immediate subsets of the given itemset are frequent, but it is frequent
B. All of the immediate supersets of the given itemset are frequent, but it is frequent
C. None of the immediate supersets of the given itemset are frequent, but it is frequent
D. One or more of the immediate supersets of the given itemset are frequent, but it is not frequent
10) Which of these algorithms is often useful in finding maximal itemsets in a collection:
A. Shallow first
B. Binary sort
C. Backtrack
D. Depth-first
11) Because the FP-Growth algorithm abandons the generate and test approach of the Apriori algorithm in
favor a significantly more direct paradigm of storing into a compact data structure and directly selecting
the frequent itemsets from the structure, one can safely say the FP-Growth algorithm is a radical
departure from Apriori.
Answer true
Answer false
13) Correlationanalysis can be used to supplement support-confidence frameworks to discover interesting
patterns, especially when low support thresholds are being used.
14) The end user of an analysis is the only one who can ultimately judge if a given rule is interesting in
terms of the results it produces.
Answer true
15) A good interestingness measure will not be impacted by transactions that do not contain itemsets of
interest, because measures that are so impacted generate unstable results
16) If the function Lift(X, Y) returns a value less than 1, then the presence of event X in a given set of
events most likely means event Y is absent, and the reverse is also true.
Answer true
17) If the χ2 value is greater than 1 and the observed value of a slot (X, Y) is less than the expected value,
then there is a negativecorrelation between members X and Y of the slot
18) To say that the measures for interestingness, lift and χ2 are not null-invariant, is to assert that their
values are not independent of the number of null transactions in the data set being analyzed.
Answer true
19) Han, Kamber and Pei recommend the use of the Kulc null-invariant measure in conjunction with the
imbalance ratio in determining interestedness.
Answer true
Answer false
21) __________ is a methodology useful for discovering interesting relationships within large sets of data.
A. Big Data
B. Association analysis
C. Data Mining
D. Algorithm
23) Sets of frequent items hidden in large data sets are called _________.
A. associationrules.
B. big data.
C. binary representation.
D. analysis.
24) Two key issues that must be addressed when applying association analysis are:
A. Overpopulation
B. Computational Expense
C. Discovering spurious patterns
D. Data density
25) The strength of an association rule can be measured in terms of its____ and _____.
A. support
B. finances
C. confidence
D. size
26) The ______ approach for mining association rules is to compute the support and confidence for every
rule.
A. Hard-nosed
B. Brute force
C. Comprehensive
D. Association Rule
27) The objective of the ________________ strategy is to find all the items that satisfy the minsup
threshold.
A. Association Rule Discovery
B. Correct Answer
C. Frequent Itemset Generation
D. Incorrect Answer
28) The objective of the ________________ strategy is to extract all the high-confidence rules from the
frequent itemsets found in the Frequent Itemset Generation.
A. Association Rule
B. Rule Generation
C. Association Analysis
D. Association Generation
29) Trimming the exponential search space based on the support measure is known as:
A. Exponential Pruning
B. Apriori Algorithm
C. Trimming
D. Support-Based Pruning
30) __________ are visual structures that use branches and leaf nodes to search an item or itemset.
A. Functions
B. Hash Trees
C. Root Node
D. Algorithms
31) Market-based analysis studies customers’ buying habits by searching for itemsets that are frequently
purchased together (or in sequence).
TRUE
33) Simpson’s paradox is a phenomenon where hidden variables may cause the observed relationship to
multiply.
FALSE
34) Cross support patters are likely to be spurious because their correlations tend to be weak.
TRUE
37) The Apriori Principle says that if an itemset is frequent then all of its subsets must also be frequent.
TRUE
39) Highly correlated and strongly associated patterns are called hyperclique patterns.
TRUE
40) Frequent pattern mining has reached far beyond basics due to substantial research, numerous extensions
of the problem scope, and broad application studies.
TRUE
Week 7
1) According to Han, Kamber and Pei, which of the following is not a high-level pattern mining research
field:
A. Mining Methods
B. Kinds of Patterns and Rules
C. Extended Data Types
D. Extensions and Applications
2) According to Han, Kamber and Pei, which of the following is not an extended data type that researchers
in the pattern mining field are studying
A. Spatial (such as colocation)
B. Rational and Integer number patterns
C. Network patterns
D. Temporal patterns
3) The pattern of buying by customers of specific items such as (Tablet ⟶ Phone⟶ Headset) is known as
a
A. Temporal Pattern
B. Sequential Pattern
C. Structural Pattern
D. Indeterminate Pattern
E. None of the above
7) The Pattern-Fusion data mining strategy uses which of these tree traversal approaches
A. Depth-first
B. Bounded-breadth
C. Breadth-first
D. None of the above
9) The output structures of one approach to image analysis and recognition is referred to as
A. Visual bytes
B. Sound bits
C. Visual words
D. None of the above
11) Optimization in the world of statistics means to discover the minimum or maximum values of a given
function.
Answer true
12) The K-means clustering algorithm seeks to determine a set of clusters where the sum of the squared
error is optimized in the sense of a minimal value is sought
Answer true
13) When analytical methods for determining optimal values of a given function fail, we often turn to
numerical methods for approximating those values
Answer true
15) Two types of constraints must be dealt with in some forms of optimization: Equality and Inequality
16) The following equation holds true for negatively correlated association rules, 𝑥 ⟶ 𝑦
17) The graph isomorphism problem comes into play when pruning candidate subgraphs during subgraph
mining operations
Answer: true
23) If a rule involves associations between the presence or absence of items, it is called a:
A) Boolean Association Rule
B) Multidimensional Association Rule
C) Quantitative Association Rule
D) Approximation Rule
26) ___________________ recommend(s) information items that are likely to be of interest to the user
based on similar users’ patterns.
A) Recommender Systems
B) Tendency Reports
C) E-Commerce Data
D) Analysis Tasks
27) A ______- sized threshold can be defined to specify the maximum allowed time difference between the
latest and earliest occurrences of events in any element of a sequential pattern.
A) Giant
B) Window
C) Pattern
D) Limited
29) A(n) __________ pattern has a frequency support that is below a user specified minimum support
threshold.
A) Infrequent/Rare
B) Frequent/Regular
C) Large/Significant
D) Small/Minimal
30) The lower-level core patterns of a colossal pattern are called __________.
A) Basic patterns
B) Central descendants
C) Basic descendants
D) Core descendants
31) Mutual information is widely used in information theory to measure the mutual independency of two
random variables.
TRUE
34) The approach of iteratively expanding a subgraph by adding an extra vertex is known as edge growing.
FALSE
35) The approach of iteratively expanding a subgraph by adding an extra vertex is known as vertex growing.
TRUE
36) Determining whether two graphs are topologically equivalent is known as graph isomorphism.
TRUE
37) Object measures alone may be sufficient to eliminate uninteresting infrequent patterns.
FALSE
38) Interesting, infrequent patterns are also known indirect association patterns.
TRUE
39) Negative itemsets and negative association rules are collectively known as positive patterns.
FALSE
40) Negative-correlated patterns are useful for identifying competing items, or items that can be substituted
for one another.
TRUE
Week 9
1) Choose all of the following that are a prediction problem that comes under the task of classification:
A. Please tell me which of my current mortgage applicants I can safely approve for a mortgage and
which are unsafe at this time to approve for any kind of mortgage.
B. Please tell me the likelihood of each of my current mortgage clients to default on their mortgage in
the next 12 months.
C. Please tell me the maximum monthly payment Client X can safely make, that is avoid becoming a
risk for foreclosure
D. Please tell me which of my current mortgage clients are safe candidates for a low fixed interest rate
mortgage, which for an initial low-rate, variable rate mortgage, and which for a high,fixedinterest
rate mortgage.
2) Which of these terms describes the first major task in the data classification process:
A. Classify
B. Learning
C. Choose training data
D. Analyze the training data for possible “noise” in the database
E. None of the above
3) If the classes of the training data are an unknown, we can apply which of these computational algorithms
to attempt to find useable classes
A. Binary Sort
B. Bayesian Analysis
C. Pruning Analysis
D. Clustering
4) Which of these are typically used to represent the sought-after mapping function in the first step of the
classification process:
A. Decision Trees
B. Mathematical formula
C. Classification rules
D. All of the above
E. None of the above
9) Which of these attribute lists is most true for the rule set extracted from a decision tree:
A. Such rules are mutually exclusive, exhaustive and unordered
B. Such rules are non-exclusive, exhaustive, ordered
C. Logical OR exists between such rules, they are unordered.
D. None of the above
10) Ensemble methods for improving accuracy most resemble which of these situations:
A. A person gathers the opinions of 10 doctors for how best to cure that person’s illness, the person
chooses the cure approach most often recommended by the 10 doctors
B. A person gathers the opinions of 100 randomly chosen doctors for how best to cure that person’s
illness, the person chooses the cure approach recommended the most by doctors with incomes
greater than the median for the group
C. A person gathers the opinions of the three top-rated doctors in the world in the field of medicine
that his illness falls within on how best to cure the person’s illness and the opinions of 5 other
randomly chosen doctors within the same field, and chooses the opinion of the top doctor who is
matched by the most doctors in the random chosen group.
D. None of the above
11) According to Tan, Steinbach and Kumar, classification, the act of assigning objects to one of a set of
predefined categories, is a problem area for many data mining applications
Answer true
12) Tan, Steinbach and Kumar assert that the input data for a classification task is a collection of records.
Han, Kamber and Pei prefer the term tupleinstead of record.
13) Tan, Steinbach and Kumar assert that classification techniques work much better for data sets that are
classifiable by binary or nominal categories than those that are classified by ordinal categories.
14) Hunt’s algorithm depends on an attribute test conditionto determine how to split a set of records that can
be labeled by more than one category or class.
15) To say that a given node in classification tree is pure is to say that all the records in the node belong to
one and only one of the predetermined classes
Answer true
16) Entropy, in the case of determining the best split, is a measure of the skewedness of the class
distribution, or put differently, the degree of impurity in the child nodes.
17) To inductively grow a decision tree, one needs to be able to create nodes, find the best split of records in
a node into two or more nodes, classify the leaf nodes, and determine when to stop trying to grow the
tree.
Answer true
18) If a training set lacks sufficient example records of given class, then misclassification or overfitting
occurs.
19) Einstein’s assertion that we should make everything as simple as possible, but no simpler is another way
of stating the test for selecting the best model from several possible models for data mining that is both
simple in its structure and competent, i.e., not simpler than what it is possible.
Answer true
20) Prepruning requires a sufficiently restrictive stopping condition that prevents the decision tree induction
algorithm from growing a tree that is highly prone to overfitting the data.
21) __________ is a form of data analysis that extracts models describing important data classes.
A) Classification
B) Model
C) Attribute set
D) Class model
24) Each classification technique employs a ______________ to identify a model that best fits the
relationship between the attribute set and class level of the input data.
A) learning algorithm
B) binary split
C) test set
D) internal node
25) A(n) _____________ shows counts of test records correctly and incorrectly predicted by a classification
model.
A) decision tree
B) attribute class
C) confusion matrix
D) learning model
27) The basis of existing decision tree algorithms ID3, C4.5, and CART is called ________________.
A) ID4
B) Gini index
C) Hunt’s Algorithm
D) Information gain
29) A statistical technique that creates several smaller samples (subsets) is called:
A) scaling
B) division
C) bootstrapping
D) subsetting
30) ___________________ assumes that the effect of an attribute value on a given class is independent of
the values of the other attributes.
A) RainForest
B) Tree Pruning
C) Oversampling
D) Naïve Bayesian classificant
31) There cannot be more than one decision tree that fits the same data.
FALSE
32) A classification model can also be used to predict the class label of unknown records.
TRUE
33) “Overfitting” is when a tree becomes too large and its test error rate begins to increase, even though its
training error rate continues to decrease.
TRUE
34) The “Scalability” of a database is defined as classifying data sets with millions of examples and
hundreds of attributes inefficiently, over a very long period of time.
FALSE
38) A “nulled” hypothesis is when two variables, such as M1 and M2 are the same.
TRUE
39) Significance tests and ROC curves are useless for model selection.
FALSE
40) No method has been found superior for all data sets.
TRUE
Week 10
1) Bayesian belief networks (BBN) differ from naïve Bayesian classifiers (BC) in that:
A. BCs assume the values of the attributes in a tuple are conditionally independent, while BBNs do not
make that assumption
B. BBNs are always more computationally more efficient than are BCs
C. BCs are always more accurate, even when the data tuples have attribute values that are not
necessarily conditionally independent
D. No significant differences exist
11) Rule-ordering classification approaches can only be done rule-by-rule. It is not possible to order classes.
Answer false
12) When generating rules using the sequential covering algorithm, a rule is considered to be acceptable if it
covers all positive examples in the training set. The number of negative examples it covers does not
come into play.
Answer false
13) The Learn-One-Rule function handles the computational complexity issue, where search costs can grow
exponentially, by growing the rule in a greedy fashion, stopping only when a preset stopping condition
is met
14) Which of these is a Rule-Growing Strategy (select all that meet this criterion)?
A. General to specific
B. Start-in-middle
C. Start at end
D. Start at beginning
E. Specific to general
15) Given the true probability distribution governing P(X|Y), the Bayesian classification method allows for
the determination of the decision boundary that is ideal | most problematic. (circle the correct term)
16) What is the single equation that the parameters W and b, defining the decision boundary for a support
vector machine being trained, must meet? Assume yi can take on one of two values, 1 and -1
y1 (w * wi + b) ≥ 1, i = 1, 2, ..., N.
17) For support vector machines, the decision boundary the W and b parameters define must be maximal.
18) The base classifiers used in an ensemble classification approach must meet two necessary conditions for
the ensemble classifier to perform better than a single classifier.
19) When an ensemble classifier is built by manipulating the training set, there are how many classifiers?
20) A false positive result when classifying a particular record / tuple can be very significant when the goal
is detect records that fall into a rare class in the target population.
21) This belief network, used for classification via graphic models, allows the representation of
dependencies among subsets of attributes.
A. Atkinson Index
B. Bayesian belief network
C. Theil Index
D. Simpson Index
22) ______________ learns by iteratively processing a data set of training tuples, comparing the network’s
prediction for each tuple with the actual known target value.
A) split
B) set
C) Backpropagation
D) classifier
23) This theory can be used to approximately define classes that are not distinguishable based on the
available attributes.
A. Backpropagation
B. Genetic algorithms
C. Fuzzy set
D. Rough set theory
24) A form of supervised learning that is also suitable for situations where data is abundant, yet the class
labels are scarce or expensive.
A. Passive learning
B. Active teaching
C. Active learning
D. Passive teaching
25) This type of classification builds a classifier using both labeled and unlabeled data.
A. Multi-class classification
B. Semi-supervised classification
C. Support vectors
D. Instance based
26) The right side of the rule is called the ___________________, which contains the predicted class.
A. rule consequent
B. precondition
C. rule antecedent
D. data set
28) _____________ aims to extract the knowledge from one or more source tasks and applt the knowledge
to a target task.
A. Instance-based approach
B. TrAdaBoost
C. Transfer learning
D. Negative transfer
29) In order to improve generalization errors in rules generate by the Learn-One-Rule function, one can use
______________.
A. Sequential covering
B. Alternative techniques
C. Rule growing
D. Rule pruning
30) ________ chooses the majority class as its default class and learns the rules for detecting the minority
class.
A. FOIL’s information gain
B. RIPPER
C. Rote classifier
D. Support vector machine
31) A support vector machine is an algorithm for the classification of both linear and nonlinear data.
TRUE
32) Decision tree and rule-based classifiers are examples of lazy learners.
FALSE
34) Training examples that are relatively similar to the attributes of the test example are called nearest
neighbors.
TRUE
35) A naïve Bayesian classifier estimates the class-conditional probability by assuming that the attributes are
conditionally dependent.
FALSE
37) Which of the following are two key elements of a Bayesian network?
A. a DAG
B. a decision tree
C. an error rate
D. a probability table
40) The kernel trick is a method that solves issues with irrelevant attributes.
FALSE
Week 11
1) According to Han, Kamber and Pei, which of the following is the best tool for dealing with data that
needs to be labeled, but there are no predetermined labels:
A. Pattern Matchers
B. Classifiers
C. Clusterers
D. Recognizers
2) According to Han, Kamber and Pei, which of the following is an application where data segmentation is
useful
A. Market basket analysis
B. Outlier detection
C. Trend prediction
D. Temporal pattern discovery
3) Which of these is not considered a basic clustering technique by Han, Kamber and Pei
A. Partitioning methods
B. Sequential methods
C. Hierarchical methods
D. Density-based methods
E. Grid-based methods
6) K-means is a
A. Method for creating clusters based on similarities amongst members of a cluster
B. Method for clustering that requires an initial value, k, that predetermines the number of clusters to
create from a given data set
C. Method that avoids the fact that optimizing the within-cluster variation is NP-hard for k clusters in 2-
D Euclidean space
D. All of the above
7) Hierarchical clustering methods
A. Are always top-down in their approach to forming hierarchies
B. Are impervious to the choice of split / merge points
C. Start either with individual objects as clusters or with the entire data set as a single cluster
D. All of the above
E. None of the above
8) Density-based methods
A. Resolve the issue of finding clusters of arbitrary shapes
B. Includes DBSCAN, which starts with a parameter, ε > 0, specifying the radius of an object’s
neighborhood, and a second parameter, MinPts, which specifies the minimum number of objects in a
neighborhood of an object in order for that object to be considered a core object
C. Include the concept of density-connectedness to form clusters of small of dense regions
D. All of the above
10) According to Han, Kamber and Pei, which of these is not a measure used to evaluate clustering methods
A. Clustering quality
B. Computational costs
C. Clustering tendency
D. Determination of number of clusters in a data set
11) If the sum of the squared error of a set of clusters is near 0, then we can be reasonably certain that the
clusters are valid.
Answer true
12) Even if the difference between two clusters is statistically significant (repeatable), that does not
necessarily mean that the magnitude of that difference is significant for any given application. That is, a
difference of 0.1% for application A might be significant, while for application B, such a low difference
is not important.
Answer true
13) Many clustering approaches assume the following:
A) If two objects were created at essentially the same time, then they must belong to the
same cluster, and thus have the same class label
B) If two objects have the same class label, then they must belong to the same cluster, but
not the reverse.
C) If two objects are in the same cluster, then they must have the same class label, but not
the reverse.
D) If two objects are in the same cluster, then they must have the same class label, and the
reverse is also true
E) None of the above
14) If we maximize the recall competency of a clustering algorithm we are most likely lowering the
precision competency.
15) Maximizing the recall competency increases the likelihood of false positives
16) By minimizing separation of clusters, we are maximizing the cohesion of the individual clusters.
17) In graph-based clusters, the measure of the cohesion of a given cluster is found by summing the weights
associated with links between members of the cluster
18) In graph-based clusters, the measure of the separation between two given clusters is found by summing
the weights associated with links between the points in one cluster and points in the second cluster.
Answer: true
19) The higher the total sum of the squares (SSB) for a group of clusters, the lower the separation of the
clusters.
Answer: false
20) If neither one of two given clusters are clearly cohesive, but they are in close proximity to each other,
merging the two clusters is often a good option.
21) A(n) _______ is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other groups.
I. application
J. cluster
K. data group
L. field
23) This is a prototype-based, partitional clustering technique that attempts to find a user-specified number
of clusters, which are represented by their centroids.
A) K-means
B) Agglomerative Hierarchical Clustering
C) DBSCAN
24) This is a density-based clustering algorithm that produces a partitional clustering in which the number of
clusters is automatically determined by the algorithm.
A) K-means
B) Agglomerative Hierarchical Clustering
C) DBSCAN
25) A clustering technique that starts with each point as a singleton cluster and then repeatedly merging the
two closest clusters until a single, all-encompassing cluster remains.
A) K-means
B) Agglomerative Hierarchical Clustering
C) DBSCAN
26) _____________ assesses the feasibility of clustering analysis on a data set and the quality of the results
generated by a clustering method.
A) Grid-based method
B) Analytics
C) Clustering evaluations
D) Hierarchical method
28) This technique is based on picking an initial solution and then repeating two steps: compute the change
and update the solution.
A) Gradient descent
B) K-means
C) Centroids
D) Hierarchical methods
29) Which of the following are two basic approaches for generational hierarchical clustering?
A) Agglomerative
B) Divisive
C) Inclusive
D) Conglomerate
30) _________ aims to overcome some of the disadvantages of a cluster hierarchy by using probabilistic
models to measure distances between clusters.
A) Multiphase hierarchical clustering
B) Divisive hierarchical clustering
C) Density-based clustering
D) Probabilistic hierarchical clustering
33) Grid based clustering approach quantizes the object space into a finite number of cells.
TRUE
34) The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering.
TRUE
35) In a fuzzy clustering, every object belongs to every cluster with a membership weight that ranges
between 0 and 5.
FALSE
36) Clustering is a relatively simple field.
FALSE
37) CLIQUE is a simple grid-based method for finding density-based cluster in subspaces.
TRUE
2) The example of holding the number of data points constant, while increasing the number of dimensions,
and thus the volume occupied by those points
A. Illustrates the flexibility of K-means to handle high dimensionality, low number of data points
B. Illustrates why DBSCAN performs poorly on high dimensional data sets, as the Euclidean definition
of density as being the number of data points per unit of volume becomes meaningless as the volume
increases with the adding of dimensions
C. Illustrates how the higher the number of dimensions forces the number of clusters higher, until at
some point every cluster will consist of one data point
D. All of the above
E. None of the above
3) If the classes of the training data are an unknown, we can apply which of these computational algorithms
to attempt to find useable classes
A. Binary Sort
B. Bayesian Analysis
C. Pruning Analysis
D. Clustering
4) The example of scale as an issue in clustering where the attributes used to do the clustering are height
and weight of a group of people, and height is measured in meters and weight is measured in kilograms,
thus favoring weight as the main attribute determining cluster membership is true because:
A. The numeric value of meters will always be a much smaller number than the numeric value of
kilograms associated with any given individual (e.g. 2 meters, 100 kg), thus favoring weight in
determining similarity or dissimilarity between two people.
B. The two attributes are simple too different to be useable by any numeric-based clustering algorithm
C. Height is not a serious differential attribute when comparing people, while weight is very much more
a determinant of the similarity of two or more people
D. None of the above
5) Processing order of the data
A. Never impacts the accuracy of any know clustering algorithm
B. While important to how accurately certain clustering algorithms perform, is in some cases an
insufficient reason for not using a given algorithm, depending on other worthy characteristics of the
algorithm and the amount of inaccuracy introduced
C. While important to how accurately certain clustering algorithms perform, the inaccuracies are
determinant, and thus can be accounted for through normalization techniques
D. None of the above
6) K-means clustering algorithms are indeterminate as to results from run to run on the same data because
A. The clusters that are presented at the end are chosen from a group of clusters produced during
processing using a randomly initialized selecting function
B. The initialization step for K-means algorithms includes a random selection of the beginning
population of centroid objects
C. As clusters are generated and regenerated, the algorithm calls a randomization function to choose a
different centroid for each of the current population of clusters
D. None of the above
8) Since graphing the likelihood of the data for different values of the parameters used in a clustering
algorithm is not generally feasible, we
A. Choose the parameter values randomly initially, and then keep iterating with the algorithm with
different values until we get values that are accurate for the data.
B. Finesse the issue by normalizing the values based on certain known characteristics of the data,
such as the number of data objects to be clustered and number of attributes being used to
determine cluster membership
C. Rely on the fact that for a Gaussian distribution the mean and standard deviation of the sample
data points are the maximum likelihood estimates for the corresponding parameters for the
underlying distribution
D. None of the above
10) Assume that you want to analyze legal issues involved in the lives of people with certain medical
conditions, which of these clustering algorithms should be used to discover the appropriate documents
for helpful information:
A. An algorithm for determining clusters of documents based on the most prominent thematic
element in each document
B. An algorithm that determines clusters based on the two most prominent thematic elements in
each document
C. An algorithm that assigns a probability of cluster membership for each document across all of
the clusters using all the identified thematic elements of each document
D. None of the above
11) If a particular document belongs to the cluster LEGAL THEME with a degree of participation above 0.5
and to the cluster MEDICAL THEME with a degree above 0.7, then we can say that the likelihood of
that document being of interest to those studying the connections between medicine and the law is
strong.
Answer true
12) If a particular document belongs to the cluster LEGAL THEME with a degree of participation of 1 and
to the cluster MEDICAL THEME with a degree of 0, then we can say that the likelihood of that
document being of interest to those studying the connections between medicine and the law is low, but
not necessarily non-existent.
Answer true
13) If the objects to be analyzed have ten or fewer attributes, according to Han, Kamber and Pei, the data are
of low dimensionality
14) When analyzing high dimensional data, the attributes of the objects in a cluster that specifically
determined those objects to be placed in that cluster are needed for the final analysis to be accurately
informed, unlike with low dimensional data, where only the cluster objects need to be given over for
analysis
Answer true
15) High dimensional data typically contains larger clusters then does low-dimensional data
Answer false
16) Computationally, highdimensional data sets often are very costly because they contain an exponential
number of subspaces
17) Top-down approaches to clustering high dimensional data only work effectively when the subspace of a
given cluster can be relied upon to be determinable by the local neighborhood (locality assumption)
18) Bi-clustering involves clustering both by the objects and the attributes, simultaneously
19) Which of these clustering algorithms allows an object to participate in multiple clusters, or even to not
participate in any cluster
A) Fuzzy Clustering
B) Probabilistic Model-Based Clustering
C) Graph Clustering
D) Bi-Clustering
20) A constraint that an analyst is willing to change in order to obtain a more realistic clustering of the data
is called a soft constraint
22) These two methods of cluster analysis assign an object to one or more clusters.
I. Partition matrix
J. Fuzzy clustering
K. K-means clustering
L. Probabilistic model-based clustering
23) A ________________ assumes that a set of observed objects is a micture of isntances from multiple
probabilistic clusters.
A. Matrix model
B. Decision tree
C. Partition matrix
D. Expectation-maximization algorithm
29) By strictly respective the constraints in the cluster assignment process, __________________ can be
enforced.
A. Soft constraints for clustering
B. SCAN
C. Hard constraints for clustering
D. Cannot-link constraints
31) Traditional clustering methods require each object to belong to many clusters.
FALSE
38) Constraints can be used to express application-specific requirements or background knowledge for
cluster analysis.
TRUE
40) Compare the differences between SCAN algorithm and DBSCAN. What are their similarities and
differences?
Week 13
1) Which of these is not typically thought to be an application area where outlier or anomaly detection is
paramount
A. Fraud detection
B. Intrusion detection
C. Ecosystem disturbances
D. Detection of books of interest to a library client
4) Which of these are not normally found to be useful in outlier / anomaly detection
A. Fuzzy-based techniques
B. Model-based techniques
C. Proximity-based techniques
D. Density-based techniques
5) Distance measures are often preferred to objective function measures when assessing the extent a given
object belongs to as cluster because
A. Distance measures are always more accurate then measures of objective function improvement
B. Determining objective function improvement when the given object is eliminated from the cluster is
quite often computationally intensive, thus costly
C. Very few clustering techniques exist that use objective functions, and even when they do, distance
measures still can be applied, and we understand distance measure more deeply than we do objective
functions
D. Is not recommended by experts in the field, as our theoretical knowledge of objective functions is
very sparse – more research is needed
6) Which of these does not make the use of clustering problematic in detecting outliers:
A. Outliers impact the clustering process indeterminately, thus it is problematic as to whether the
clusters are real clusters, and thus which members are legitimate outliers
B. Clustering techniques that do not automatically determine the number of clusters can give different
results as to which objects are true outliers based on the number of predetermined clusters with
which the algorithm is seeded
C. Assessing accurately the extent to which a given object belongs to a particular cluster is not a well-
developed, well understood process at this time
D. All of the above
E. None of the above
9) Which of these situations is referred to as a masking problem when attempting to detect outliers
A. Anomalies in the data set distort the model being applied to detect outliers, making outlier objects
appear as normal objects
B. There is no such situation, no such problem as masking in outlier detection
C. Using one at a time anomaly detection fails to detect objects where the presence of two or more
anomalies hides the presence of all the anomalies for that object
D. None of the above
10) When detecting outliers in a univariate normal distribution, which of these situations could negatively
impact the accuracy of the detection:
A. Mean and standard deviation values are known, the number of observations is large
B. Mean and standard deviation are not known, and so must be estimated and the number of
observationsis small
C. Mean and standard deviation are not known, and so must be estimated and the number of
observations is very large
D. Mean and standard deviation values are known, the number of observations is small
E. None of the above
11) Outlier detection is the inverse of cluster analysis in that the former process seeks to find objects that do
not fit the majority patterns, while cluster analysis seeks to find the majority patterns in a data set and
organize the data using those patterns.
Answer true
12) If someone steals your credit card, the fact that they use it to buy breakfast at a different place in your
town then you normally go is a clear signal of an outlier transaction
Answer false
13) Noise in the data never interferes with any known outlier detection methods, so preprocessing of data
sets to remove noise data points is not necessary
Answer false
14) An object is a contextual or conditional outlier if it deviates significantly with respect to a specific
context of the object
15) Collective outliers are a group of objects that, as individual objects might not be outliers, but as a group,
they are
Answer true
16) As the dimensionality of the data being analyzed increases, the distance between objects may become a
function of the noise in the data, thus the distance and similarity between two points in a high-dimension
data space may not accurately reflect the true relationship between those two points.
17) The HilOut algorithm detects outliers in the fullspace of high dimensioned data sets, not resorting to
examining sub-spaces with lower dimensionality
18) Using algorithms that determine outliers in high dimensioned data sets in its sub-spaces is uniquely
useful because there is critical information exists there that support interpreting why and to what extent
an outlier is deemed an outlier
19) Using classification-based approaches for outlier detection requires the existence of a training set with
data that is labeled in a way that explicitly separates normal data from outlier data.
20) If an object falls outside the decisionboundary of the normal class when applying a one-class model, the
object is an outlier, even if the training set did not include an example of that kind of outlier
Answer: true
21) A (n) _______________ is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.
Q. deviant
R. exception
S. outlier
T. preliminary
23) These outliers are the simplest form and the easiest to detect.
A. Global
B. Contextual
C. Collective
D. Local
24) Outliers are most commonly also be called ________________.
A. Deviants
B. Exceptions
C. Preliminaries
D. Anomalies
25) ________________________ assume that the normal data objects follow a statistical model.
A. Theoretical outlier detection methods
B. Proximity based outlier detection methods
C. Statistical outlier detection methods
D. Clustering based outlier detection based methods
26) A subset of data objects forms a __________________ if the objects as a whole deviate significantly
from the entire data set.
A. Cluster
B. Collective outlier
C. Contextual outlier
D. Global outlier
27) The assessment of the degree to which something is an outlier is called ____________.
A. Outlier score
B. Evaluation score
C. Masking score
D. Outlier degree
28) When the presence of several anomalies masks the presence of all, it is called:
A. Swamping
B. Efficiency
C. Anomaly detection
D. Masking
29) Statistical distributions with values far from the mean are common in practice and are known as
___________________.
A. Heavy-tailed distributions
B. Discordant observation
C. Gaussian distribution
D. Binomial distribution
30) __________________________ assume that the normal data objects belong to large and dense clusters,
whereas outliers belong to small or sparse clusters, or not not belong to any clusters.
A. Classification-based outlier detection method
B. Clustering-based outlier detection method
C. Contextual-based outlier detection method
D. Proximity-based outlier detection method
34) Because clusters may form a hierarchy, outliers may belong to different granularity levels.
TRUE
35) Cluster techniques like K-means automatically determine the number of clusters.
FALSE
36) When outliers are detected, there is no question of whether the results are valid.
FALSE
37) Proximity based outlier detections methods generally take more time that distance-based methods.
TRUE
38) A mixture model approach for anomaly detection assume that data comes from just a single probability
distributions.
FALSE
39) Name three main approaches for outlier detection methods of high dimensional data.
_________________________
_________________________
_________________________
40) Using an equal-depth histogram, design a way to assign an object with an outlier score.
Week 14
1) If you were to attempt to mine a set of digitized Hollywood-produced films for major thematic
references, you would be working with what kind of data from a data mining perspective
A. Mixed format
B. Problematic
C. Complex
D. Visual
3) If we find the same four five-second sub-sequences occur, in order, every 45 seconds in a radio signal,
this is an example of a:
A. Random movement over time
B. Seasonal variation movement over time
C. Trend movement over time
D. Cyclic movement over time
E. None of the above
4) When pattern matching in sequence data where gaps between sub-sequences in a derived pattern need to
be very close together, we apply
A. A constraint during processing that skips any two subsequences that are too far apart.
B. An algorithm that finds all possible patterns of interest, then resift these patterns, throwing out those
where the sub-sequences are too far apart
C. A gap constraint in our algorithm that does not allow a pattern to be derived if the sub-sequences are
too close together
D. B & C
E. None of the above
6) If we find in a network of nodes representing the occurrence of specific symbolic references in order
through a given text there exists a pattern of “Fool” and “Violence” references always co-occurring in
the text, then
A. We can dismiss this phenomena as random, accidental, thus meaningless
B. We must then ask the author if this pattern was intentional, because if the author did not intend the
pattern, then it is as if the phenomena an artifact of random, accidental insertions into the text
C. We can conclude that the author of the text, consciously or unconsciously, views violence and fools
to be closely aligned with each other
D. None of the above
7) If we want to discover what set of games a given Facebook user is likely to choose to play
A. We need to talk to the user
B. We need to perform a link prediction analysis looking at current and recent games the user is or did
play in relation to the current and recent games that the Facebook friends of the user are or have
played
C. We are embarking on an impossible task
D. We need to perform a cluster analysis of the games played by the user and the user’s friends, looking
for outliers
8) If we want to have a system to warn drivers in a city of impending massive traffic congestions, we will
need to do what kind of mining:
A. Multimedia data mining
B. Text mining
C. Cyber-Physical System Data mining
D. None of the above
9) According to Han, Kamber and Pei, text mining does not involve the input from which of these research
disciplines
A. Statistics
B. Computational linguistics
C. Information Retrieval
D. Library Science
11) Webstructuremininginvolves the use of network and graph mining theory and methods to analyze the
nodes and connections on the Web
12) Mining data streams is best done at the current time using single-scan or a-very-few scan algorithms, as
they are often large in volume, change dynamically, are possibly infinite and are multidimensional in
their features.
Answer true
13) Regression is a statistical approach to data mining involves predicting the value of a response variable
based on the values of one or more predictor variables, all of which must be numeric
Answer: true
14) According to Han, Kamber and Peidata reduction is a theory concerning the basis of data mining that is
the only useful theory concerning data mining
Answer: false
15) According to Han, Kamber and Pei, the various theories for a basis of data mining are not
mutuallyexclusive
16) According to Han, Kamber and Pei, the ideal theoretical framework for data mining should model
typical data mining tasks, have a probabilistic nature, handle multiple forms of data, but can safely
ignore the iterative and interactive essence of data mining.
Answer: false
17) Data visualization approaches can be combined with data mining in a several ways. One such way is to
provide an interactive visual display to allow users to manipulate in real time a given data mining
process
Answer: true
18) One method for improving audio data mining is to transform the patterns in the signals into sound,
perhaps with a musical quality, eliminating a large part of the boredom issue involved in watching
graphical display versions of audio data
Answer: true
19) According to Han, Kamber and Pei, the high dimensionality of retail data on sales, customers, products,
time, and region makes such applications determining effective sales campaigns and NP-hard problem,
that is, impossible.
Answer: false
20) According to Han, Kamber and Pei, since scientific data can now be amassed at high speed and large
volumes, scientists have turned to a new paradigm of research. Instead of the researcher-generated
hypothesis and test approach, they now collect, store and data mine for the hypotheses, which are then
confirmed with data and/or experimentation
Answer: true
21) __________________ integrates data mining and data visualization to discover implicit and useful
knowledge from large data sets.
U. Audio data mining
V. Theoretical data mining
W. Invisible data mining
X. Visual data mining
22) __________ uses audio signals to indicate data patterns or features of data mining results.
Q. Audio data mining
R. Theoretical data mining
S. Invisible data mining
T. Visual data mining
23) The constant presence of data mining in many aspects of our daily lives is called:
A. Invisible data mining
B. Audio data mining
C. Ubiquitous data mining
D. Visual data mining
25) Data mining tools for particular industries, such as finance, retail, and telecommunications, are called:
A. Statistical methods
B. Theoretical data mining
C. Audio data mining
D. Domain-specific applications
28) Data mining that relates to both space and time is called:
A. Moving-object data mining
B. Spatial Data mining
C. Spatiotemporal data mining
D. Multimedial data mining
29) The discovery of relationships among multiple moving objects such as moving clusters, leaders, and
followers:
A. Moving-object data mining
B. Spatial Data mining
C. Spatiotemporal data mining
D. Multimedial data mining
31) An abundance of personal or confidential information available in electronic forms poses a threat to data
privacy and security.
TRUE
32) Research on the theoretical frameworks of data mining have matured significantly.
FALSE
33) The microeconomic view is a theory of data mining that considers finding patterns is not concerned with
the utility of patterns.
FALSE
34) According to the data mining theory of data compression, the basis of data mining is to compress given
data by encoding in terms of bits, association rules, decision trees, and clusters.
TRUE
35) Web usage mining is the process of extracting useful information from server logs.
TRUE
37) Web structure mining utilizes graph and network theories to analyze the connection structures of the
web.
TRUE
38) Few classification methods perform model construction based on feature vectors.
FALSE
40) Data mining trends include further efforts toward the exploration of new application areas.
TRUE