What's A Data Warehouse
What's A Data Warehouse
What's A Data Warehouse
Answer1:
A Data warehouse is a repository of integrated information, available for queries and
analysis. Data and information are extracted from heterogeneous sources as they are
generated. This makes it much easier and more efficient to run queries over data that
originally came from different sources". Another definition for data warehouse is: " A
data warehouse is a logical collection of information gathered from many different
operational databases used to create business intelligence that supports business analysis
activities and decision-making tasks, primarily, a record of an enterprise's past
transactional and operational information, stored in a database designed to favour
efficient data analysis and reporting (especially OLAP)". Generally, data warehousing is
not meant for current "live" data, although 'virtual' or 'point-to-point' data warehouses can
access operational data. A 'real' data warehouse is generally preferred to a virtual DW
because stored data has been validated and is set up to provide reliable results to common
types of queries used in a business.
Answer2:
Data Warehouse is a repository of integrated information, available for queries and
analysis. Data and information are extracted from heterogeneous sources as they are
generated....This makes it much easier and more efficient to run queries over data that
originally came from different sources.
Typical relational databases are designed for on-line transactional processing (OLTP) and
do not meet the requirements for effective on-line analytical processing (OLAP). As a
result, data warehouses are designed differently than traditional relational databases.
What is ODS?
A dimensional table is a collection of hierarchies and categories along which the user can
drill down and drill up. it contains only the textual attributes.
What is a lookup table?
A lookUp table is the one which is used when updating a warehouse. When the lookup is
placed on the target table (fact table / warehouse) based upon the primary key of the
target, it just updates the table by allowing only new records or updated records based on
the lookup condition.
Why should you put your data warehouse on a different system than your OLTP
system?
Answer1:
A OLTP system is basically " data oriented " (ER model) and not " Subject oriented
"(Dimensional Model) .That is why we design a separate system that will have a subject
oriented OLAP system...
Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on
the OLTP server that will affect the daytoday business directly.
Answer2:
The loading of a warehouse will likely consume a lot of machine resources. Additionally,
users may create querries or reports that are very resource intensive because of the
potentially large amount of data available. Such loads and resource needs will conflict
with the needs of the OLTP systems for resources and will negatively impact those
production systems.
Aggregate table contains the summary of existing warehouse data which is grouped to
certain levels of dimensions.Retrieving the required data from the actual table, which
have millions of records will take more time and also affects the server performance.To
avoid this we can aggregate the table to certain required level and can use it.This tables
reduces the load in the database server and increases the performance of the query and
can retrieve the result very fastly.
Data modeling is probably the most labor intensive and time consuming part of the
development process. Why bother especially if you are pressed for time? A common
response by practitioners who write on the subject is that you should no more build a
database without a model than you should build a house without blueprints.
The goal of the data model is to make sure that the all data objects required by the
database are completely and accurately represented. Because the data model uses easily
understood notations and natural language , it can be reviewed and verified as correct by
the end-users.
The data model is also detailed enough to be used by the database developers to use as a
"blueprint" for building the physical database. The information contained in the data
model will be used to define the relational tables, primary and foreign keys, stored
procedures, and triggers. A poorly designed database will require more time in the long-
term. Without careful planning you may create a database that omits data required to
create critical reports, produces results that are incorrect or inconsistent, and is unable to
accommodate changes in the user's requirements.
What is ETL?
OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT
professionals.
2. Data Contents
OLAP: manages large amounts of historical data, provides facilities for summarization
and aggregation, stores information at different levels of granularity to support decision
making process.
3. Database Design
OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database
design.
4. View
OLAP: spans multiple versions of a database schema due to the evolutionary process of
an organization; integrates information from many organizational locations and data
stores
What is SCD1 , SCD2 , SCD3?
Ex: a customer address modified we update existing record with new address.
A) Effective Date
B) Versions
C) Flags
or combination of these
SCD3: by adding new columns to target table we maintain historical information and
current information.
Why are OLTP database designs not generally a good idea for a Data Warehouse?
Since in OLTP,tables are normalised and hence query response will be slow for end user
and OLTP doesnot contain years of data and hence cannot be analysed.
1. MS-Excel
2. Business Objects (Crystal Reports)
3. Cognos (Impromptu, Power Play)
4. Microstrategy
5. MS reporting services
6. Informatica Power Analyzer
7. Actuate
8. Hyperion (BRIO)
9. Oracle Express OLAP
10. Proclarity
What is Normalization, First Normal Form, Second Normal Form , Third Normal
Form?
Fact Table contains the measurements or metrics or facts of business process. If your
business process is "Sales" , then a measurement of this business process such as
"monthly sales number" is captured in the Fact table. Fact table also contains the foriegn
keys for the dimension tables.
Answer1:
Conformed dimensions mean the exact same thing with every possible fact table to which
they are joined Ex:Date Dimensions is connected all facts like Sales facts,Inventory
facts..etc
Answer2:
Conformed dimentions are dimensions which are common to the cubes.(cubes are the
schemas contains facts and dimension tables)
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts
and Dimensions here D1,D2 are the Conformed Dimensions
Conventional Load:
Before loading the data, all the Table constraints will be checked against the data.
Conformed dimensions are the dimensions which can be used across multiple Data Marts
in combination with multiple facts tables accordingly
Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of
users.
1.Independent data marts – sources from data captured form OLTP system, external
providers or from data generated locally within a particular department or geographic
area.
Level of granularity means level of detail that you put into the fact table in a data
warehouse. For example: Based on design you can decide to put the sales data in each
transaction. Now, level of granularity would mean what detail are you willing to put for
each transactional fact. Product sales with respect to each minute or you want to
aggregate it upto minute and put that data.
How are the Dimension tables designed?
Most dimension tables are designed using Normalization principles upto 2NF. In some
instances they are further normalized to 3NF.
Determine how to maintain changes to this dimension (see more on this in the next
section).
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or
the other types of clustered/non-clustered, unique/non-unique indexes.
To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports
bitmaps.
Snowflake Schema, each dimension has a primary dimension table, to which one or more
additional dimensions can join. The primary dimension table is the only table that can
join to the fact table.
Data warehousing captures business activity data. Real-time data warehousing captures
business activity data as it occurs. As soon as the business activity is complete and there
is data about it, the completed activity data flows into the data warehouse and becomes
available instantly. In other words, real-time data warehousing is a framework for
deriving information from data as the data becomes available.
SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three
types
What are Semi-additive and factless facts and in which scenario will you use such
kinds of fact tables?
Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-
additive.
A fact table without numeric fact columns is called factless fact table.
While maintain the promotion values of the transaction (ex: product samples) because
this table doesn’t contain any measures.
Star schema - all dimensions will be linked directly with a fat table.
Snow schema - dimensions maybe interlinked or may have one-to-many relationship with
other tables.
What is a Star Schema?
Star schema is a type of organising the tables such that we can retrieve the result from the
database easily and fastly in the warehouse environment.Usually a star schema consists
of one or more dimension tables around a fact table which looks like a star,so that it got
its name.
The basic purpose of the scheduling tool in a DW Application is to stream line the flow
of data from Source To Target at specific time or based on some condition.
What is ER Diagram?
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76]
as a way to unify the network and relational database views.
Simply stated the ER model is a conceptual data model that views the real world as
entities and relationships. A basic component of the model is the Entity-Relationship
diagram which is used to visually represents data objects.
Since Chen wrote his paper the model has been extended and today it is commonly used
for database design For the database designer, the utility of the ER model is:
it maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables. it is simple and easy to understand with a minimum of
training. Therefore, the model can be used by the database designer to communicate the
design to the end user.
In addition, the model can be used as a design plan by the database developer to
implement a data model in a specific database management software.
Which columns go to the fact table and which columns go the dimension table?
The Primary Key columns of the Tables(Entities) go to the Dimension Tables as Foreign
Keys.
The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign
Keys.
What are modeling tools available in the Market?
1. Oracle Designer
2. ERWin (Entity Relationship for windows)
3. Informatica (Cubes/Dimensions)
4. Embarcadero
5. Power Designer Sybase
Time dimensions are usually loaded by a program that loops through all possible dates
that may appear in the data. It is not unusual for 100 years to be represented in a time
dimension, with one row per day.
Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would you
put your TX logs.
Transaction logs write sequentially and don't need to be read at all. The ideal is to have
each on RAID 1/0 because it has much better write performance than RAID 5.
RAID 1 is also better for TX logs and costs less than 1/0 to implement. It has a tad less
reliability and performance is a little worse generally speaking.
RAID 5 is best for data generally because of cost and the fact it provides great read
capability.
1. Informatica
2. Data Stage
3. MS-SQL DTS(Integrated Services 2005)
4. Abinitio
5. SQL Loader
6. Sunopsis
7. Oracle Warehouse Bulider
8. Data Junction
What is VLDB?
Answer 1:
VLDB stands for Very Large DataBase.
Answer 2:
VLDB doesn’t refer to size of database or vast amount of information stored. It refers to
the window of opportunity to take back up the database.
Window of opportunity refers to the time of interval and if the DBA was unable to take
back up in the specified time then the database was considered as VLDB.
A data mart is a focused subset of a data warehouse that deals with a single area(like
different department) of data and is organized for quick analysis
Basic diff is E-R modeling will have logical and physical model. Dimensional model will
have only physical model.
Basically the fact table consists of the Index keys of the dimension/ook up tables and the
measures.
so when ever we have the keys in a table .that itself implies that the table is in the normal
form.
Data Mining is used for the estimation of future. For example, if we take a
company/business organization, by using the concept of Data Mining, we can predict the
future of business interms of Revenue (or) Employees (or) Cutomers (or) Orders etc.
Traditional approches use simple algorithms for estimating the future. But, it does not
give accurate results when compared to Data Mining.
What is data validation strategies for data mart validation after loading process ?
Data validation is to make sure that the loaded data is accurate and meets the business
requriments.
Degenerate Dimensions : If a table contains the values, which r neither dimesion nor
measures is called degenerate dimensions.Ex : invoice id,empno
Every company has methodology of their own. But to name a few SDLC Methodology,
AIM methodology are stardadly used. Other methodologies are AMM, World class
methodology and many more.
What is the main difference between Inmon and Kimball philosophies of data
warehousing?
Kimball views data warehousing as a constituency of Data marts. Data marts are focused
on delivering business objectives for departments in the organization. And the data
warehouse is a conformed dimension of the data marts. Hence a unified view of the
enterprise can be obtain from the dimension modeling on a local departmental level.
Inmon beliefs in creating a data warehouse on a subject-by-subject area basis. Hence the
development of the data warehouse can start with data from the online store. Other
subject areas can be added to the data warehouse as their needs arise. Point-of-sale (POS)
data can be added later if management decides it is necessary.
i.e.,
Kimball--First DataMarts--Combined way ---Datawarehouse
Inmon---First Datawarehouse--Later----Datamarts
Hierarchies
Hierarchies are logical structures that use ordered levels as a means of organizing data. A
hierarchy can be used to define data aggregation. For example, in a time dimension, a
hierarchy might aggregate data from the month level to the quarter level to the year level.
A hierarchy can also be used to define a navigational drill path and to establish a family
structure.
Within a hierarchy, each level is logically connected to the levels above and below it.
Data values at lower levels aggregate into the data values at higher levels. A dimension
can be composed of more than one hierarchy. For example, in the product dimension,
there might be two hierarchies--one for product categories and one for product suppliers.
Dimension hierarchies also group levels from general to granular. Query tools use
hierarchies to enable you to drill down into your data to view different levels of
granularity. This is one of the key benefits of a data warehouse.
When designing hierarchies, you must consider the relationships in business structures.
For example, a divisional multilevel sales organization.
Hierarchies impose a family structure on dimension values. For a particular level value, a
value at the next higher level is its parent, and values at the next lower level are its
children. These familial relationships enable analysts to access data quickly.
Levels
A level represents a position in a hierarchy. For example, a time dimension might have a
hierarchy that represents data at the month, quarter, and year levels. Levels range from
general to specific, with the root level as the highest or most general level. The levels in a
dimension are organized into one or more hierarchies.
Level Relationships
Level relationships specify top-to-bottom ordering of levels from most general (the root)
to most specific information. They define the parent-child relationship between the levels
in a hierarchy.
Hierarchies are also essential components in enabling more complex rewrites. For
example, the database can aggregate an existing sales revenue on a quarterly base to a
yearly aggregation when the dimensional dependencies between quarter and year are
known.
RDBMS Schema
* Used for OLTP systems
* Traditional and old schema
* Normalized
* Difficult to understand and navigate
* Cannot solve extract and complex problems
* Poorly modelled
DWH Schema
* Used for OLAP systems
* New generation schema
* De Normalized
* Easy to understand and navigate
* Extract and complex problems can be easily solved
* Very good model
It may happen that in a table, some columns are important and we need to track changes
for them i.e capture the historical data for them whereas in some columns even if the data
changes, we don't care.
For such tables we implement Hybrid SCDs, where in some columns are Type 1 and
some are Type 2.
what is junk dimension? what is the difference between junk dimension and
degenerated dimension?
Junk dimension: Grouping of Random flags and text Attributes in a dimension and
moving them to a separate sub dimension.
Degenerate Dimension: Keeping the control information on Fact table ex: Consider a
Dimension table with fields like order number and order line number and have 1:1
relationship with Fact table, In this case this dimension is removed and the order
information will be directly stored in a Fact table inorder eliminate unneccessary joins
while retrieving order information..
What are the possible data marts in Retail sales.?
What is the definition of normalized and denormalized view and what are the
differences between them?
Meta data is the data about data; Business Analyst or data modeler usually capture
information about data - the source (where and how the data is originated), nature of data
(char, varchar, nullable, existance, valid values etc) and behavior of data (how it is
modified / derived and the life cycle ) in data dictionary a.k.a metadata. Metadata is also
presented at the Datamart level, subsets, fact and dimensions, ODS etc. For a DW user,
metadata provides vital information for analysis / DSS.
Star schema
A single fact table with N number of Dimension
Snowflake schema
Any dimensions with extended dimensions are know as snowflake schema
Difference between Snow flake and Star Schema. What are situations where Snow
flake Schema is better than Star Schema to use and when the opposite is true?
Star schema contains the dimesion tables mapped around one or more fact tables.
It is a denormalised model.
No need to use complicated joins.
Queries results fastly.
Snowflake schema
It is the normalised form of Star schema.
contains indepth joins ,bcas the tbales r splitted in to many pieces.We can easily do
modification directly in the tables.
We hav to use comlicated joins ,since we hav more tables .
There will be some delay in processing the Query .
What is VLDB?
The perception of what constitutes a VLDB continues to grow. A one terabyte database
would normally be considered to be a VLDB.
What's the data types present in bo?n what happens if we implement view in the
designer n report
Yes.But those datatype will be char (only the values can numeric/char)
View - store the SQL statement in the database and let you use it as a table. Everytime
you access the view, the SQL statement executes.
Materialized view - stores the results of the SQL in table form in the database. SQL
statement only executes once and after that everytime you run the query, the stored result
set is used. Pros include quick query results.
Data warehouses typically use a surrogate, (also known as artificial or identity key), key
for the dimension tables primary keys. They can use Infa sequence generator, or Oracle
sequence, or SQL Server Identity values for the surrogate key.
It is useful because the natural primary key (i.e. Customer Number in Customer table)
can change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated
as the primary keys (according to the business users) but ,not only can these change,
indexing on a numerical value is probably better and you could consider creating a
surrogate key called, say, AIRPORT_ID. This would be internal to the system and as far
as the client is concerned you may display only the AIRPORT_NAME.
If you used the natural business key 'E1' for your employee within your datawarehouse
everything would be allocated to Business Unit 'BU2' even what actualy belongs to
'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for the
Employee 'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the SID of
the Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the
employee 'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural key:
natural key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with
the natural key enlargement process, is that you might not have all part of your new key
within your fact table, so you might not be able to do the join on the new enlarge key ->
so you need another id.
What is ER Diagram?
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76]
as a way to unify the network and relational database views.
Simply stated the ER model is a conceptual data model that views the real world as
entities and relationships. A basic component of the model is the Entity-Relationship
diagram which is used to visually represents data objects.
Since Chen wrote his paper the model has been extended and today it is commonly used
for database design For the database designer, the utility of the ER model is:
it maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables. it is simple and easy to understand with a minimum of
training. Therefore, the model can be used by the database designer to communicate the
design to the end user.
In addition, the model can be used as a design plan by the database developer to
implement a data model in a specific database management software.
What is aggregate table and aggregate fact table ... any examples of both?
Aggregate table contains summarised data. The materialized view are aggregated
tables.
for ex in sales we have only date transaction. if we want to create a report like sales by
product per year. in such cases we aggregate the date vales into week_agg, month_agg,
quarter_agg, year_agg. to retrive date from this tables we use @aggrtegate function.
Why do we override the execute method is struts? Plz give me the details?
Data warehousing deals with all aspects of managing the development, implementation
and operation of a data warehouse or data mart including meta data management, data
acquisition, data cleansing, data transformation, storage management, data distribution,
data archiving, operational reporting, analytical reporting, security management,
backup/recovery planning, etc. Business intelligence, on the other hand, is a set of
software tools that enable an organization to analyze measurable aspects of their business
such as sales performance, profitability, operational efficiency, effectiveness of
marketing campaigns, market penetration among certain customer groups, cost trends,
anomalies and exceptions, etc. Typically, the term “business intelligence” is used to
encompass OLAP, data visualization, data mining and query/reporting tools.Think of the
data warehouse as the back office and business intelligence as the entire business
including the back office. The business needs the back office on which to function, but
the back office without a business to support, makes no sense.
What is fact less fact table? where you have used it in your project?
Factless table means only the key available in the Fact there is no mesures availalabl
In a relational data model, for normalization purposes, some lookup tables are not merged
as a single table. In a dimensional data modeling(star schema), these tables would be
merged as a single table called DIMENSION table for performance and slicing data.Due
to this merging of tables into one large Dimension table, it comes out of complex
intermediate joins. Dimension tables are directly joined to Fact tables.Though,
redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only
when compared to FACT table. So only Denormalization is promoted in Universe
Desinging.
ODS:- It is nothing but a collection of tables created in the Datawarehouse that maintains
only current data
where as OLTP maintains the data only for transactions, these are designed for recording
daily operations and transactions of a business
True
Type-1
Most Recent Value
Type-2(full History)
i) Version Number
ii) Flag
iii) Date
Type-3
Current and one Perivies value
What is snapshot?
You can disconnect the report from the catalog to which it is attached by saving the
report with a snapshot of the data. However, you must reconnect to the catalog if you
want to refresh the data.
What is the difference between datawarehouse and BI?
A fact may be measure, metric or a dollar value. Measure and metric are non additive
facts.
Dollar value is additive fact. If we want to find out the amount for a particular place for a
particular period of time, we can add the dollar amounts and come up with the total
amount.
A non additive fact, for eg measure height(s) for 'citizens by geographical location' ,
when we rollup 'city' data to 'state' level data we should not add heights of the citizens
rather we may want to use it to derive 'count'