DWM QB Cyse
DWM QB Cyse
DWM QB Cyse
Class: TE SEM: V
Subject: Data Warehousing and Mining
Module-wise Questions
Introduction to Data Warehouse
Class: TE SEM: V
Subject: Data Warehousing and Mining
4. For a supermarket chain, Consider the following dimensions namely Product, Store, Time and
Promotion. The schema contains a central fact table for sales.
i) Design star schema for the above application.
ii) Calculate the maximum number of base fact table records for warehouse with the following
values given below:
Time Period – 5 years
Store – 300 stores reporting daily sales
Product – 40,000 products in each store (about 4000 sell in each store daily)
A manufacturing company has a huge sales network. To control the sales, it is divided into regions.
Each region has multiple zones. Each zone has different cities. Each sales person is allocated
different cities. The objective is to track sales figure at different granularity levels of region and to
count no. of products sold. Design a star schema by considering levels for region, sales person and
time. Convert the star schema to snowflake schema.
5. What is meant by Information Package Diagram, for recording the information requirements
for “Hotel Occupancy” having dimensions like time, hotel etc., give the Information package
diagram for the same. Also draw the star schema and snowflake schema.
6. A bank wants to develop a data warehouse for effective decision-making about their loan
schemes. The bank provides loans to customers for various purposes like house building loan,
car loan, educational loan, personal loan, etc. the whole country is categorized into a number
of regions, namely, north, south, east and west. Each region consists of a set of states. Loan is
disbursed to customers at interest rates that change from time to time. Also, at any given point
of time, the different types of loans have different rates. The data warehouse should record an
entry for each disbursement of loan to customer.
(a) Design an information package diagram for the application.
(b) Design a star schema for the data warehouse clearly identifying the fact tables (s),
dimensional table(s), their attributes and measures.
7. Consider the following data base for a chain of bookstores.
BOOKS (Booknum, Primary_author, Topic, Total_Stock, price)
BOOKSTORE (Storenum, City, State, Zip, Inventory_value)
STOCK (Storenum, Booknum, Qty)
With respect to the above business scenario, answer the following questions. Clearly state any
reasonable assumptions you make.
(a) Design an information package diagram.
(b) Design a star schema for the data warehouse clearly identifying the fact tables (s),
dimensional table(s), their attributes and measures.
8. State and explain the various schemas used in data warehousing with examples for each of
them.
Shah and Anchor Kutchhi Engineering College, Chembur
Department of Cyber Security
Class: TE SEM: V
Subject: Data Warehousing and Mining
ETL Process
1. In what way ETL cycle can be used in typical data warehouse, explain with suitable instance.
2. Describe the steps of ETL process.
3. Write a short note on: a) ETL Process
OLAP
1. What is Data Mining? What are techniques and applications of data mining?
2. Explain data mining as a step in KDD. Give the architecture of typical DM system.
3. Describe the various functionalities of Data Mining as a step in the process of knowledge
Discovery.
4. Discuss Architecture of a typical data mining system.
Shah and Anchor Kutchhi Engineering College, Chembur
Department of Cyber Security
Class: TE SEM: V
Subject: Data Warehousing and Mining
Data Preprocessing
1. What is classification? What are the issues in Classification? Apply statistical based algorithm to
obtain the actual probabilities of each event to classify the new tuple as a tail. Use the following data:
Person ID Name Gender Height Class
1 Kristina Female 1.6 m Short
2 Jim Male 2m Tall
3 Maggi Female 1.9 m Medium
4 Marya Female 2.1 m Tall
5 Stephanie Female 1.7 m Short
6 Bob Male 1.85 m Medium
7 Catherine Female 1.6 m Short
8 Dave Male 1.7 m Short
9 Wilson Male 2.2 m Tall
2. A simple example from the stock market involved only discrete ranges has Profit as categorical
attributes, with values { up, down}. And the training data is:
COMPETITION TYPE PROFIT
AGE
Old Yes Software Down
Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
New Yes Software Up
Shah and Anchor Kutchhi Engineering College, Chembur
Department of Cyber Security
Class: TE SEM: V
Subject: Data Warehousing and Mining
New No Hardware Up
New No Software Up
Apply the decision tree algorithm and show the generated rules.
3. Define linear, non-linear and multiple regressions. Plan a regression model for Disease
development with respect to change in weather parameters.
4. Write a short note on: Decision Tree Based Classification Approach
5. Why naïve Bayesian classification is called “naive”? Briefly outline the major ideas of naïve
Bayesian classification.
6. Apply the Naives Bayes classifier algorithm for buys computer classification and classify the
tuple X=(age= “young”, income= “medium”, student= “yes” and credit_rating= “fair”
Clustering
1. What is K-means clustering? Confer the K-means algorithm with the following data for two
clusters. Data set {10, 4, 2, 12, 3, 20, 30, 11, 25, 31}
2. What is clustering? Explain K-means clustering algorithm. Suppose the data for clustering is
{2, 4, 10, 12, 3, 20, 30, 11, 25} consider K=2, cluster the given data using above algorithm.
3. What is clustering technique? Discuss the agglomerative algorithm using following data and
plot a Dendrogram using single link approach. The following figure contains sample data items
indicating the distance between the elements:
Item E A C B D
E 0 1 2 2 3
A 1 0 2 5 3
Shah and Anchor Kutchhi Engineering College, Chembur
Department of Cyber Security
Class: TE SEM: V
Subject: Data Warehousing and Mining
C 2 2 0 1 6
B 2 5 1 0 3
D 3 3 6 3 0
a. K-means Clustering
5. Apply agglomerative hierarchical clustering and draw single link and average link dendrogram for
the following distance matrix.
A B C D E
A 0 2 6 10 9
B 2 0 3 9 8
C 6 3 0 7 5
D 10 9 7 0 4
E 9 8 5 4 0
6. Illustrate how the supermarket can use clustering methods to improve sales.
7. Find cluster using k-means clustering algorithm if we have several objects (4 types of medicines)
and each object have two attributes or features as shown in the table below. The goal is to group
these objects into k=2 group of medicine based on the two features (pH and weighted index).
Attribute1 (X) Attribute 2
Object
Weight Index (Y) pH
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine D 5 4
9. Consider the data given below, create adjacency matrix. Apply single link algorithm to cluster the
given data set and draw the dendogram
Attribute1 Attribute 2
Object (X) (Y)
A 2 2
B 3 2
C 1 1
D 3 1
E 1.5 0.5
Shah and Anchor Kutchhi Engineering College, Chembur
Department of Cyber Security
Class: TE SEM: V
Subject: Data Warehousing and Mining
Apply the Apriori Algorithm with minimum support of 30% and minimum confidence of 75% and
find the large item set L.
2. What is association mining rule? Give the Apriori Algorithm. Apply AR Mining to find all
frequent itemsets from the following table:
TID Items
100 1, 2, 5
200 2, 4
300 2, 3
400 1, 2, 4
500 1, 3
600 2, 3
700 1, 3, 2, 5
800 1, 3
900 1, 2, 3
TID Items
01 A, B, C, D
02 A, B, C, D, E, G
03 A, C, G, H, K
04 B, C, D, E, K
05 D, E, F, H, L
06 A, B, C, D, L
07 B, I, E, K, L
08 A, B, D, E, K
09 A, E, F, H, L
10 B, C, D, F
Shah and Anchor Kutchhi Engineering College, Chembur
Department of Cyber Security
Class: TE SEM: V
Subject: Data Warehousing and Mining
Apply the Apriori Algorithm with minimum support of 30% and minimum confidence of
70% and find all the association rules in the data set.
6. A Database has five transactions. Let min-support=60% and min-confidence = 80%. Find all
frequent item sets by using Apriori Algorithm. T_ID is the transaction ID.
Find all frequent item sets using apriori algorithm. List strong association rules.
8. A Database has five transactions. Let min-support=30% and min-confidence = 70%. Find all
frequent patterns using Apriori Algorithm. List strong association rules.