Assignment No: 1: Q1) A Dimension Table Is Wide, The Fact Table Is Deep. Explain?
Assignment No: 1: Q1) A Dimension Table Is Wide, The Fact Table Is Deep. Explain?
Assignment No: 1: Q1) A Dimension Table Is Wide, The Fact Table Is Deep. Explain?
Q2] What is the relationship between data warehousing and data replication? Which
form of replication (synchronous or asynchronous) is better suited for data warehousing?
Why? Explain with appropriate example.
Data warehouse are carefully designed databases that hold integrated data for secondary
usage. If you only need data from one system , but cant impact the performance of that
system, then it is suggested to take a copy i.e a replicated data store unless the
complexities and width of data is very large and the various ways to require access to it
are very high , a data warehouse would be overkill.
A replicated data store is a database that holds schemas from other system but doesn’t
truly integrate the data. This means it is typically in a format similar to what the source
systems had. The value in a replicated data store is that it provides a single source for
reasources to go to inorder to access data from any system without negatively impacting
the performance of that system .
Data replication is simply a method for creating copies of data in a distributed
environment.
Replication technology can be used to capture changes to source data.
b. Starting with the base cuboid [day, doctor, patient], what specific OLAP
operations should be performed in order to list the total fee collected by each
doctor in 2004?
First, we should use roll-up operation to get the year 2004(rolling-up from day
then month to year). After getting that, we need to use slice operation to select
(2004). Second, we should use roll-up operation again to get all patients. Then,
we need to use slice operation to select (all). Finally, we get list the total fee
collected by each doctor in 2004.
So,
1. roll up from day to month to year
2. slice for year = “2004”
3. roll up on patient from individual patient to all
4. slice for patient = “all”
4. get the list of total fee collected by each doctor in 2004
c. To obtain the same list, write an SQL query assuming the data are stored in
a relational database with the schema fee (day, month, year, doctor, hospital,
patient, count, charge)
Select doctor, Sum(charge) From fee Where year = 2004 Group by doctor
Q4] Describe different steps of ETL (Extraction, Transformation Loading) cycle in Data
Warehousing for a pharmaceutical company.
ETL is a process in Data Warehousing and it stands for Extract, Transform and Load. It is a
process in which an ETL tool extracts the data from various data source systems, transforms it in
the staging area and then finally, loads it into the Data Warehouse system.
1. Extraction:
The first step of the ETL process is extraction. In this step, data from various source
systems is extracted which can be in various formats like relational databases, No SQL,
XML and flat files into the staging area. It is important to extract the data from various
source systems and store it into the staging area first and not directly into the data
warehouse because the extracted data is in various formats and can be corrupted also.
Hence loading it directly into the data warehouse may damage it and rollback will be
much more difficult. Therefore, this is one of the most important steps of ETL process.
2. Transformation:
The second step of the ETL process is transformation. In this step, a set of rules or
functions are applied on the extracted data to convert it into a single standard format. It
may involve following processes/tasks:
ETL process can also use the pipelining concept i.e. as soon as some data is extracted, it can
transformed and during that period some new data can be extracted. And while the transformed
data is being loaded into the data warehouse, the already extracted data can be transformed. The
block diagram of the pipelining of ETL process is shown below:
Q5] Why is entity-relationship modeling technique not suitable for the data warehouse?
How is dimensional modeling different?
ER modelling aims to optimize performance for transaction processing. It is also hard to
query ER models because of the complexity, many tables should be joined to obtain a
result Therefore ER models are not suitable for high performance retrieval of data.
The entity relationship model is a method used to represent the logical flow of entities
graphically that in turn create a database. It has both logical and physical model. And it is
good for reporting and point queries.
Dimensional model is a method in which the data is stored in two types of tables namely
facts table and dimension table. It has only physical model. It is good for ad hoc query
queries.