0% found this document useful (0 votes)
77 views4 pages

Assignment No: 1: Q1) A Dimension Table Is Wide, The Fact Table Is Deep. Explain?

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 4

Assignment No: 1

Q1] A dimension table is wide, the fact table is deep. Explain?


 Dimension table has got all the detail information of their respective table for eg.
Customer dimension table will contain all the related info about customers whereas fact
table contains the main data ,which contain the surrogate keys of every dimensions along
with other measures.
 A dimension table contains a higher granular information so have less no of records and
it needs to have all the necessary details related to the grain of the table . On the other
side A fact table has the lowest level grain of a subject area . Lower grain cause more
number of rows in the fact table.

Q2] What is the relationship between data warehousing and data replication? Which
form of replication (synchronous or asynchronous) is better suited for data warehousing?
Why? Explain with appropriate example. 

 Data warehouse are carefully designed databases that hold integrated data for secondary
usage. If you only need data from one system , but cant impact the performance of that
system, then it is suggested to take a copy i.e a replicated data store unless the
complexities and width of data is very large and the various ways to require access to it
are very high , a data warehouse would be overkill.
 A replicated data store is a database that holds schemas from other system but doesn’t
truly integrate the data. This means it is typically in a format similar to what the source
systems had. The value in a replicated data store is that it provides a single source for
reasources to go to inorder to access data from any system without negatively impacting
the performance of that system .
 Data replication is simply a method for creating copies of data in a distributed
environment.
 Replication technology can be used to capture changes to source data.

 Synchronous replication: Synchronous replication is used for creating replicas in


real time . In synchronous replication data is written to primary storage and the
replica is done simultaneously. Primary copy and the replica should always
remain synchronized.
 Asynchronous replication: It is used for creating time delayed replicas. In
asynchronous replication data is written to the primary storage first and then copy
data to the replica.
Q3] Suppose that a data warehouse consists of the three dimensions time, doctor, and
patient, and the two measures count and charge, where charge is the fee that a doctor
charges a patient for a visit.
a. Draw a schema diagram for the above data warehouse using one of the
schemas. [star, snowflake, fact constellation]

b. Starting with the base cuboid [day, doctor, patient], what specific OLAP
operations should be performed in order to list the total fee collected by each
doctor in 2004?

First, we should use roll-up operation to get the year 2004(rolling-up from day
then month to year). After getting that, we need to use slice operation to select
(2004). Second, we should use roll-up operation again to get all patients. Then,
we need to use slice operation to select (all). Finally, we get list the total fee
collected by each doctor in 2004.
So,
1. roll up from day to month to year
2. slice for year = “2004”
3. roll up on patient from individual patient to all
4. slice for patient = “all”
4. get the list of total fee collected by each doctor in 2004

c. To obtain the same list, write an SQL query assuming the data are stored in
a relational database with the schema fee (day, month, year, doctor, hospital,
patient, count, charge)

Select doctor, Sum(charge) From fee Where year = 2004 Group by doctor

Q4] Describe different steps of ETL (Extraction, Transformation Loading) cycle in Data
Warehousing for a pharmaceutical company.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms it in
the staging area and then finally, loads it into the Data Warehouse system.

1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be
much more difficult. Therefore, this is one of the most important steps of ETL process.

2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:

o Filtering – loading only certain attributes into the data warehouse.


o Cleaning – filling up the NULL values with some default values, mapping U.S.A,
United States and America into USA, etc.
o Joining – joining multiple attributes into one.
o Splitting – splitting a single attribute into multipe attributes.
o Sorting – sorting tuples on the basis of some attribute (generally key-attribbute).
3. Loading:
The third and final step of the ETL process is loading. In this step, the transformed data is
finally loaded into the data warehouse. Sometimes the data is updated by loading into the
data warehouse very frequently and sometimes it is done after longer but regular
intervals. The rate and period of loading solely depends on the requirements and varies
from system to system.

ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed. The
block diagram of the pipelining of ETL process is shown below:

Q5] Why is entity-relationship modeling technique not suitable for the data warehouse?
How is dimensional modeling different? 
 ER modelling aims to optimize performance for transaction processing. It is also hard to
query ER models because of the complexity, many tables should be joined to obtain a
result Therefore ER models are not suitable for high performance retrieval of data.
 The entity relationship model is a method used to represent the logical flow of entities
graphically that in turn create a database. It has both logical and physical model. And it is
good for reporting and point queries.
 Dimensional model is a method in which the data is stored in two types of tables namely
facts table and dimension table. It has only physical model. It is good for ad hoc query
queries.

You might also like