The document discusses data modeling techniques used for data warehouses. It describes entity-relationship (ER) modeling and dimensional modeling as the two main techniques. ER modeling uses entities, relationships, and attributes as basic concepts, while dimensional modeling uses facts, dimensions, and measures. Both techniques can be used to abstractly model data but have different semantic representations. The document also discusses key concepts of ER modeling in more detail, including entity definitions, relationship types, attributes, and subclasses. It provides examples to illustrate entity-relationship diagrams and how they represent different modeling concepts.
The document discusses data modeling techniques used for data warehouses. It describes entity-relationship (ER) modeling and dimensional modeling as the two main techniques. ER modeling uses entities, relationships, and attributes as basic concepts, while dimensional modeling uses facts, dimensions, and measures. Both techniques can be used to abstractly model data but have different semantic representations. The document also discusses key concepts of ER modeling in more detail, including entity definitions, relationship types, attributes, and subclasses. It provides examples to illustrate entity-relationship diagrams and how they represent different modeling concepts.
The document discusses data modeling techniques used for data warehouses. It describes entity-relationship (ER) modeling and dimensional modeling as the two main techniques. ER modeling uses entities, relationships, and attributes as basic concepts, while dimensional modeling uses facts, dimensions, and measures. Both techniques can be used to abstractly model data but have different semantic representations. The document also discusses key concepts of ER modeling in more detail, including entity definitions, relationship types, attributes, and subclasses. It provides examples to illustrate entity-relationship diagrams and how they represent different modeling concepts.
The document discusses data modeling techniques used for data warehouses. It describes entity-relationship (ER) modeling and dimensional modeling as the two main techniques. ER modeling uses entities, relationships, and attributes as basic concepts, while dimensional modeling uses facts, dimensions, and measures. Both techniques can be used to abstractly model data but have different semantic representations. The document also discusses key concepts of ER modeling in more detail, including entity definitions, relationship types, attributes, and subclasses. It provides examples to illustrate entity-relationship diagrams and how they represent different modeling concepts.
Unit 4 Data Modeling Structure: 4.1 Introduction Objectives 4.2 The Concept of Data Modeling 4.3 Data Modeling Techniques 4.4 Entity-Relationship (E-R) Modeling 4.4.1 Basic Concepts 4.4.2 Other Concepts Self Assessment Question(s) (SAQs) 4.5 Dimensional Modeling 4.5.1 Basic Concept 4.5.2 Star and Snowflake Models Self Assessment Question(s) (SAQs) 4.6 Database Design Methodology for Data Warehouses Self Assessment Question(s) (SAQs) 4.7 Common Mistakes in Data Modeling 4.7.1 Failure to address Long term growth 4.7.2 No Mechanism to clean-out old Data 4.7.3 Overuse of Synthetic keys 4.7.4 Inadequate Planning for Indexes Self Assessment Question(s) (SAQs) 4.8 Organizing the Data in a Data Warehouse 4.9 Summary 4.10 Terminal Questions (TQs) 4.11 Multiple Choice Questions (MCQs) 4.12 Answers to SAQs, TQs, and MCQs 4.12.1 Answers to Self Assessment Questions (SAQs) 4.12.2 Answers to Terminal Questions (TQs) 4.12.3 Answers to Multiple Choice Questions (MCQs) Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 81 4.1 Introduction Data warehousing has become the most accepted approach for providing an integrated, consistent source of data for use in business decision- making. But the question that comes into the picture is how to model the data in a data warehouse so that the data can best support the decision- making processes of an organization. In this Unit, we consider two basic data modeling techniques; ER modeling and Dimensional modeling. The ER Modeling has been the choice for the operational environment, but there is an increased interest in the use of dimensional modeling in these days for a business intelligence environment. Objectives The objectives of the Unit are: The purpose of data modeling in building a data warehouse The techniques of data modeling Common mistakes in data modeling 4.2 The Concept of Data Modeling In simple terms, data; is a record of all business activities, resources, and results of an organization and data model is a well-organized abstraction of that data. So, it is quite natural that the data model has become the best method to understand and manage the business of the organization. Without a data model, it would be very difficult to organize the structure and contents of the data in the data warehouse. These data models are critical in understanding the data in an analytical environment of a data warehouse and so act as a blueprint for the information requirements of an organization. The purpose of modeling is to provide an accurate record of some aspect of the real world in a particular context. The objective of the modeling is to provide: Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 82 A record of accurate and meaningful business data definitions. Identification of valid, consistent business data structures that contain sufficient information to run and manage the business. An indication of the similarities and differences between data from different sources and the relationship between them. 4.3 Data Modeling Techniques There are two data modeling techniques that are relevant in a data warehousing environment. They are Entity Relationship modeling (ER modeling) and dimensional modeling. ER modeling produces a data model of the specific area of interest, using two basic concepts: Entities and the Relationships between them. A detailed ER model may also contain attributes, which can be properties of either the entities or the relationships. The ER model is an abstraction tool as it can be used to simplify, understand and analyze the ambiguous data relationships in the real business world. Dimensional modeling uses three basic concepts: Facts, Dimensions and Measures. Dimensional modeling is powerful in representing the requirements of the business user in the context of database tables and also in the area of data warehousing. Both ER and dimensional modeling can be used to create an abstract model of a specific subject. However, each of them has its own limited set of modeling concepts and associated notation conventions. Consequently, the techniques seem different, and they are indeed different in terms of semantic representation. There is much debate as to which method is better and the conditions under which a specific technique is to be selected. There can be no definite answer, understanding of the circumstances and the business requirements finally lead to selection of an appropriate technique. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 83 4.4 Entity- Relationship (E-R) Modeling 4.4.1 Basic Concepts An ER model is represented by an ER diagram, which uses three basic graphic symbols to conceptualize the data: entity, relationship, and attribute. Entity An entity is defined to be a person, place, thing, or event of interest to the business or the organization. It represents a class of objects, which are things in the real business world that can be observed and classified by their properties and characteristics. In general, an entity has its own business definition and a clear boundary definition that is required to describe what is included and what is not. In a practical modeling project, the team members share a definition template for integration and a consistent entity definition in the model. In case of a high-level business modeling, an entity can be very generic, but it must be quite specific in the detailed logical modeling. There are four entities; PRODUCT, PRODUCT MODEL, PRODUCT COMPONENT, and COMPONENT in the ER diagram (Refer Figure 4.1) and are represented as rectangles. Fig. 4.1: A Simpl e ER Model Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 84 The four diagonal lines on the corners of the PRODUCT COMPONENT entity represent that the entity is an associative entity and the entity is to resolve the many-to-many relationship between two entities. PRODUCT MODEL and COMPONENT are independent of each other but have a business relationship between them. A PRODUCT MODEL consists of many components and a component is related to many product models. With this business rule, you cannot tell which components make up a product model. To do this, you can define a resolving entity. For example, the PRODUCT COMPONENT entity can provide the information about which components are related to which product model. In ER modeling, naming the entities is important for easy understanding and clear communication. It is expressed grammatically in the form of a noun rather than a verb and the criteria for selecting an entity name depend on how well the name represents the characteristics and scope of the entity. Also, defining a unique identifier of an entity is the most critical task. These unique identifiers are called candidate keys. Among them, you can select the key that is most commonly used to identify the entity, called primary key. Relationship Relationships represent the structural interaction and association among the entities in a model and they are represented with lines drawn between the two specific entities. Generally, a relationship is named grammatically by a verb (such as owns, belongs, and has) and the relationship between the entities can be defined in terms of the cardinality. Cardinality represents the maximum number of instances of one entity that are related to a single instance in another table and vice versa. Thus the possible cardinalities include one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M). In a detailed normalized ER model, any M:M relationship is not shown because it is resolved to an associative entity. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 85 Attributes Attributes describe the characteristics of properties of the entities. The Product ID, Description, and Picture are attributes of the PRODUCT entity in Figure 4.1. The name of an attribute has to be unique in an entity and should be self-explanatory to ensure clarity. For example, rather naming date1 and date2, you may use the names; order date and delivery date. When an instance has no value for an attribute, the minimum cardinality of the attribute is zero, which means either nullable or optional. In Figure 4.1, you can see the characters P, m, o, and F that stand for primary key, mandatory, optional, and foreign key. The Picture attribute of the PRODUCT entity is optional, which means it is nullable. A foreign key of an entity is defined to be the primary key of another entity. In figure 4.1, the Product ID attribute of the PRODUCT MODEL entity is a foreign key as it is the primary key of the PRODUCT entity. These foreign keys are useful in determining the relationships such as the referential integrity between the entities. 4.4.2 Other Concepts Supertype and Subtype An entity can have subtypes and supertypes and the relationship between a supertype entity and its subtype entity is an IS A relationship. An IS A relationship is used where one entity is a generalization of several more specialized entities. The supertype and subtype relationship is represented by a triangle on the relationship. Figure 4.2 shows an example of supertype and subtype entities wherein SALES OUTLET is the supertype of RETAIL STORE and CORPORATE SALES OFFICE and RETAIL STORE, CORPORATE SALES OFFICE are subtypes of SALES OUTLET. Here, each subtype entity inherits attributes from its supertype entity. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 86 Also, each subtype entity can have its own distinctive attributes. In the example provided above, Region ID and Outlet ID are inherited attributes and the sub entities have their own attributes (such as number of cash registers and floor space of the RETAIL STORE subentity). The practical benefit of supertyping and subtyping is that they make a data model more directly expressive. J ust by looking at the ER diagram, you can see that sales outlets are composed of retail stores and corporate sales offices. Fig. 4.2: Supertype and Subtype Other important concepts in the area of ER modeling are domain and normalization. A domain consists of all the possible acceptable values and categories that are allowed for an attribute. It is the set of all real possible occurrences. The format or data type, such as integer, date, and character, provides a clear definition of domain. The practical benefit of domain is that it is imperative for building the data dictionary or repository, and for implementing the database consequently. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 87 Normalization is a process of assigning the attributes to entities which in a way reduces data redundancy, avoids data anomalies, provides a solid architecture for updating data, and reinforces the long-term integrity of the data model (the third normal form is usually adequate). Self Assessment Question(s) (SAQs) For Section 4.4 1. Discuss the basic concepts involved in ER modeling? 4.5 Dimensional Modeling Dimensional modeling is a relatively new concept compared to ER modeling. This method is simpler, more expressive, and easier to understand. This technique is mainly aimed at conceptualizing and visualizing data models as a set of measures that are described by common aspects of the business. It is useful for summarizing and rearranging the data and presenting views of the data to support data analysis. Also, the technique focuses on numeric data, such as values, counts, and weights. The basic concepts of this technique are discussed below: 4.5.1 Basic Concepts Fact A fact is a collection of related data items, consisting of measures and context data. A fact represents a business item, a business transaction, or an event that can be used to analyze the business or a business process. In a data warehouse, facts are implemented in the core tables in which the entire numeric data is stored. Dimension A dimension is a collection of members or units of the same type of views. Usually, it is represented by an axis. In a dimensional model, every data point in the fact table is associated with one and only one member from Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 88 each of the multiple dimensions. Thus dimensions determine the contextual background for the facts. Many analytical processes are used to quantify the impact of dimensions on the facts. Dimensions are the parameters over which you can perform Online Analytical Processing (OLAP). For example, in a database for analyzing all sales of products, the common dimensions could be; time, location/region, customers, salesperson, scenarios such as actual, budgeted, or estimated numbers. Generally, dimensions are mapped to non-numeric, informative entities such as branch or employee. A dimension contains several dimension members. A dimension member is a distinct name or identifier used to determine a data item's position. For instance, all weeks, months, quarters, and years make up a time dimension, and all cities, regions, and countries make up a geography dimension. The members of a dimension can be arranged into one or more hierarchies. Each hierarchy can also have multiple hierarchy levels. Every member of a dimension does not locate on one hierarchy structure. Measure A measure is a numeric attribute of a fact, representing the performance or behavior of the business relative to the dimensions. For example, measures are the sales in revenue, the sales volume, the quantity supplied, the supply cost, etc. A measure is determined by combinations of the dimension members and is located on facts. Here, the actual numbers are called as variables. The most popular way of visualizing a dimensional model is to use a three- dimensional model using a cube. Though a dimensional model consists of more than three dimensions (to be referred to as a hypercube), a three- dimensional cube is more commonly used for practical convenience. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 89 4.5.2 Star and Snowflake Models There are two basic models that can be used in dimensional modeling: Star model Snowflake model Star Model After gathering the business requirements, the dimensional modeling begins with identifying facts and dimensions and so the initial dimensional model is usually star-like in appearance, with one fact in the center and one level of several dimensions around it. This star model is the basic structure for a dimensional model and has one large central table (a fact table) and a set of smaller tables (the dimension tables) arranged in a radial pattern around the central table (Refer Fig. 4.3). Fig. 4.3: Star model In the example provided in Figure 4.3, the sales is a fact table in the center and the dimension tables (time, customer, seller, manufacturing location, and product) are arranged around the fact table. Whereas, the traditional ER model has an even and balanced style of entities and complex relationships among entities, the dimensional model is very asymmetric. Even though the Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 90 fact table in the dimensional model is joined with all the other dimension tables, there is only a single joining line connecting the fact table to the dimension tables. Snowflake Model The snowflake model is derived from the star model and is the result of decomposing one or more of the dimensions, which sometimes have hierarchies themselves. You can define the many-to-one relationships among members within a dimension table as a separate dimension table, forming a hierarchy. For example, the seller dimension in Figure 4.3 can be decomposed into sub-dimensions; outlet, region, and outlet type in Figure 4.4. Fig. 4.4: Snowflake Model Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 91 The decomposed snowflake structure visualizes the hierarchical structure of dimensions and so it is easy for data modelers to understand the data model and for database designers to analyze the dimensions. Developers also prefer the snowflake model as it saves the data storage. However, this structure seems more complex and so might make the business users feel uncomfortable working with it compared to the simple star model. Self Assessment Question(s) (SAQs) For Section 4.5 1. Explain the significance of the dimensional data modeling technique? 4.6 Database Design Methodology for Data Warehouses The steps involved in the database design methodology for data warehouses are as follows: Step 1 Choosing the Process The process (function) refers to the subject matter of a particular data mart. Build the data mart and it should be the one that is most likely to be delivered on time, within budget, and to answer the most commercially important business questions. Step 2 Choosing the Grain Decide on what a record of the fact table is to represent. Identify the dimensions of the fact table. The grain decision for the fact table also determines the grain of each dimension table. Also include time as a core dimension, which is always present in star schemas. Step 3 Identifying and conforming the Dimensions Dimensions set the context for asking questions about the facts in the fact table. If any dimension occurs in two data marts, they must be exactly the same, or one must be a mathematical subset of the other. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 92 Step 4 Choosing the Fact The grain of the fact table determines which facts can be used in the data mart. Facts should be numeric and additive. Unusable facts may include: non-numeric facts non-additive facts Fact at different granularity from other facts in table. Step 5 Storing the Pre-Calculations in the Fact Table Once the facts have been selected, each of the facts is to be re-examined to determine whether there are opportunities to use pre-calculations. Step 6 Rounding out the Dimension Tables Text descriptions are added to the dimension tables and they should be as intuitive and understandable to the users as possible. Usefulness of a data mart is determined by the scope and nature of the attributes of the dimension tables. Step 7: Choosing the duration of the database Take appropriate duration measures (how far back in time the fact table goes) Step 8: Tracking slowly changing dimensions Slowly changing dimension problem means that the proper description of the old dimension data must be used with old fact data. Often, a generalized key must be assigned to important dimensions in order to distinguish multiple snapshots of dimensions over a period of time. Three basic types of slowly changing dimensions: Type 1, where a changed dimension attribute is overwritten. Type 2, where changed dimensions attribute causes creation of a new dimension record to be created. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 93 Type 3, where a changed dimension attribute causes an alternate attribute to be created so that both the old and new values of the attribute are simultaneously accessible in the same dimension record. Step 9: Deciding the Query priorities and the Query Modes Most critical physical design issues affecting the end-users perception includes: physical sort order of the fact table on disk the presence of pre-stored summaries or aggregations Additional physical design issues include administration, backup, indexing performance, and security. Self Assessment Question(s) (SAQs) For Section 4.6 1. List out the steps involved in the database design methodology for data warehouses and also provide various activities involved in each of these steps. 4.7 Common Mistakes in Data Modeling Following are some of the mistakes that development teams may make in an application development project during development and implementation of a logical and physical data model: 4.7.1 Failure to Address Long-term Growth To construct a data model, you may use the commands (CREATE TABLE, CREATE INDEX, etc.) to make your logical entities become physical tables. But operations that work on small tables may not perform when the table grows to production size because the tablespaces are not big enough. To avoid this, you have to focus on the physical storage of large tables during the design of the data model. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 94 In this scenario, you need to Estimate the size of the table and its long-term growth pattern appropriately Estimate the physical size of the tables adequately Estimate the physical size of the indexes appropriately Use this information to estimate the need for disk space both now and several years in the future. During development, validate the predictions by comparing predicted bytes per row with the actual row sizes of the tables and indexes in the development or test environments, and adjust the prediction model accordingly. 4.7.2 No mechanism to clean-out old Data When there is no plan for cleaning out old data, you may respond to emergency shortages of space either by scrambling to delete larger numbers of rows or by adding space on an ad hoc basis. But none of these is a good solution. In Oracle, the solution to deal with this problem is range partitioning (introduced in version 8.0). A range-partitioned table is physically separated into several different partitions, and Oracle adds rows to a particular partition automatically based on the value of a specific column in the row (called the partition key). The benefits of partitioning large tables include: All rows in a given range can be eliminated with a single small, efficient DDL transaction by simply dropping or truncating a partition. For queries that have a condition limiting the value of the partition key, such as partition key BETWEEN :lower_bound AND :upper_bound, the query optimizer can skip partitions that will have no matching rows, reducing I/O required to process the query, an operation known as partition pruning. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 95 Maintenance operations can proceed on individual partitions while the rest of the table is still available to support users. However, you need to make the following choices in advance to reap the benefits of partitioning: Which column to use as partition key. Which intervals to use as partition ranges Whether to partition the indexes, if so, how When to add new partitions and drop old ones 4.7.3 Overuse of Synthetic Keys Database designers frequently create synthetic key (or surrogate key) because it serves the purpose of a primary key but has no real meaning to the business. While there are valid reasons to use a synthetic key, it is going overboard to extend the practice to every table in a data model, for these reasons: There are tables for which a synthetic key simply serves no purpose. The primary key index is a waste of space, and the server must waste time updating the index during inserts and deletes. One example would be a table of log entries; it probably does not need a key at all. Another example would be the detail table in a master/detail relationship. Detail records are typically fetched by the master key; there is little benefit in giving the detail rows their own synthetic key. Simultaneous transactions that insert new rows must serialize on the sequence generator as they each take turns getting new key values. If there are dozens or hundreds of such transactions running in parallel, as in an OLTP system, then such serializing will limit performance. (Sequence caching can mitigate this serializing effect in Oracle Parallel Server or Oracle Real Application Cluster, but it does not help at all on a single-instance database.) Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 96 So use of synthetic key is useful only in the following situations: The natural key is something that might change over time, such as a car's license number. A synthetic key can remain fixed while the other attributes change. The natural key has many columns, and there are many places in your application where a program must fetch a single row by key value. A synthetic key would allow the application to identify a row by only one value. The natural key has many columns, and the table is the target of foreign references. It is generally impractical to repeat all of the natural key columns in the child tables; a synthetic key allows you to create foreign references with just one column. 4.7.4 Inadequate Planning for Indexes The most difficult task in creating a logical data model is to decide which columns to include in indexes. But there are some rules of thumb that can serve as a starting point. These generally require a decent understanding of the way the tables will be accessed by the application programs, report programs, and (if possible) ad hoc analysis. Foreign Key Indexes A foreign key index is useful only if at least one of the following is true: When a query plan in your application starts by fetching rows in the parent table, and then fetches rows in the child table by way of the foreign key, then use of an index on the foreign key is useful. When your application deletes rows in the parent table, or updates key columns in the parent table, Oracle can use the foreign key index on the child table to enforce the foreign key constraint. If neither of these situations applies, then indexing the foreign key is not worth. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 97 Low-cardinality Columns If you need an index for low-cardinality columns, you might use a bitmap index, since bitmap indexes have many benefits when dealing with low- cardinality columns. However, because a single block in a bitmap index can contain thousands of rows, concurrency can be a problem during the inserts or deletes. The only way to know for sure is to try a realistic benchmark with and without the index. Order of Indexed Columns The order of the columns in the index can also affect its usefulness, though this is less of an issue with the new skip-scanning feature of Oracle 9i. Generally speaking, columns that are high-cardinality (i.e., have many distinct values in the table) and are used in the WHERE clauses should generally be first in the index. The idea is to allow the query processor to find the rows you want by searching as few blocks as possible. To understand this concept, consider a telephone book as an analogy. The entries are sorted by last name, first name, street name, and house number. When you are searching for a person by last name, the alphabetical organization allows you to proceed to the correct line very quickly. But it does not help at all if you are searching by street name, the third field in the sort order. Instead, you would need a book where street name is the primary sort field. Reviewing Index Definitions During development, you need to figure out which indexes you require. Here, you may have to determine the indexes you need by checking the query plans for important queries that your application uses (Make sure the indexes are adequate to support those queries) Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 98 determine the indexes not required (use index monitoring feature in Oracle 9i) The only purposes for indexes are to enforce constraints (primary key, unique, and foreign key) and to improve the query performance. Any index that does not do one of these two things is of no use. Self Assessment Question(s) (SAQs) For Section 4.7 1. Mention the frequent mistakes that the development teams may make while developing and implementing a logical and physical data model in an application development project and also suggest how one can avoid making these mistakes? 4.8 Organizing the Data in a Data Warehouse There are two leading approaches to organize the data in a data warehouse; the Dimensional approach (advocated by Ralph Kimball) and the Normalized approach (advocated by Bill Inmon). In the "dimensional" approach, the transaction data is partitioned into either measured "facts" (which are generally numeric data that captures specific values) or "dimensions" (which contain the reference information that gives each transaction its context). For example, a sales transaction can be broken up into facts (like number of products ordered, the price paid, etc.) and dimensions (like date, customer ID, product, geographical location, salesperson, etc.). The advantage of this approach is that it is easy for the business staff to understand and use the data warehouse. Also, the data warehouse tends to operate very quickly because the data is pre-joined into the dimensional form. However, the disadvantage of this approach is that it is quite difficult to incorporate the changes if the company changes the way in which it does business. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 99 In the "normalized" approach, the data in the data warehouse is stored in third normal form. The tables in the database are then grouped together by subject areas so as to reflect the general definition of the data (Customer, Product, Finance, etc.). The advantage of this approach is that it is easy to modify the information in the database. But the disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports. Furthermore, since the segregation of facts and dimensions is not explicit in this type of data model, it is difficult for users to join the required data elements into meaningful information without a precise understanding of the data structure. Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within a problem space. E.g. in a Financial Services business area, you might have Customers, products, Contracts, etc. An alternative approach is to organize around the business transactions. e.g.: Customer enrollment, Sales, Trades. 4.9 Summary A data model is a well-organized abstraction of the data in a data warehouse. The purpose of modeling is to provide an accurate record of some aspect of the real world in a particular context. Data modeling techniques that are relevant in a data warehousing environment are Entity Relationship modeling (ER modeling) and dimensional modeling. ER modeling uses two basic concepts; entities and relationships between them. Also, there are attributes, which can be properties of either the entities or the relationships. An entity is defined to be a person, place, thing, or event of interest to the business or the organization. It represents a class of objects, which are things in the real business world that can be observed Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 100 and classified by their properties and characteristics. Relationships represent the structural interaction and association among the entities in a model and it is represented with lines drawn between the two specific entities. The dimensional modeling uses three basic concepts; measures, facts, and dimensions. This is powerful in representing the requirements of the business user in the context of database tables and also in the area of data warehousing. A fact is a collection of related data items, consisting of measures and context data. A dimension is a collection of members or units of the same type of views. A measure is a numeric attribute of a fact, representing the performance or behavior of the business relative to the dimensions. There are two basic models that can be used in dimensional modeling; Star model and Snowflake model. This star model is the basic structure for a dimensional model and has one large central table (a fact table) and a set of smaller tables (the dimension tables) arranged in a radial pattern around the central table. The snowflake model is derived from the star model and is the result of decomposing one or more of the dimensions, which sometimes have hierarchies themselves. The steps involved in the database design methodology for data warehouses are as follows; choosing the process, choosing the grain, identifying and conforming the dimensions, choosing the facts, storing pre-calculations in the fact table, rounding out the dimension tables, choosing the duration of the database, tracking slowly changing dimensions, and deciding the query priorities and the query modes. 4.10 Terminal Questions (TQs) 1. Explain the significance of data modeling in the context of data warehousing? Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 101 2. Discuss the similarities and differences between the two data modeling techniques; ER modeling and Dimensional modeling? 3. What are the approaches involved in organizing the data in a warehouse? Discuss the merits and demerits of each of these approaches? 4.11 Multiple Choice Questions (MCQs) 1. Which of the following data model consists of Attributes? a. ER Modeling b. Dimensional modeling c. Both (a) and (b) d. None of the above 2. Which of the following statements is False? a. ER model is a type of data model that uses the concepts; entities and relationships b. ER model is more preferred data model in the context of data warehousing compared to any of the other data models c. Dimensional modeling uses the concepts; measures, facts, and dimensions d. Attributes in an ER model provide the properties of the entities or the relationships 3. In ER modeling, a Domain is _____________. a. the existing set of entities in a data warehouse b. the existing set of relationships between the entities c. the set of all possible values and categories allowed for an attribute d. the set of all possible values and categories allowed for a derived attribute 4. The relationships between the entities in an ER model are defined in terms of ___________. a. Dimensions b. Degree of relationship Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 102 c. Cardinality d. Measures 5. The IS A relationship in an ER model exists between ______________. a. Supertype and Subtype entities b. Any of the two constraints c. Attribute and derived attributes d. Entities and their attributes 6. Which of the following is a numeric attribute of a Fact? a. Data point b. Derived fact c. Dimension d. Measure 7. The central table in a snowflake model is a ________ a. Dimension table b. Fact c. Measure d. Can be any 8. Which of the following approaches are used to organize the data in a data warehouse? a. Entity-Relationship approach and Dimensional approach b. Entity-Relationship approach and Normalized approach c. Facts approach and Dimension approach d. Dimensional approach and Normalized approach 9. In which of the following modeling techniques, the supertype and subtype entities are commonly used? a. ER Modeling b. Dimensional modeling c. Both (A) and (B) d. None of the above Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 103 10. The normalized approach of organizing the data has been proposed by __________. a. Ralph Kimball b. E.R. Muller c. Andy Groove d. Bill Inmon 4.12 Answers to SAQs, TQs, and MCQs 4.12.1 Answers to Self Assessment Questions (SAQs) Section 4.4 1. ER modeling uses two basic concepts; entities and relationships between them. Also, there are attributes, which can be properties of either the entities or the relationships. An entity is defined to be a person, place, thing, or event of interest to the business or the organization. It represents a class of objects, which are things in the real business world that can be observed and classified by their properties and characteristics. Relationships represent the structural interaction and association among the entities in a model and is represented with lines drawn between the two specific entities. Section 4.5 1. Dimensional modeling is simpler and a more expressive technique compared to ER modeling. This technique is mainly aimed at conceptualizing and visualizing data models as a set of measures that are described by common aspects of the business. Also, the technique is useful for summarizing and rearranging the data and presenting views of the data to support data analysis. Apart from these, the technique focuses on numeric data, such as values, counts, and weights. Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 104 Section 4.6 1. The steps involved in the database design methodology for data warehouses are as follows: Choosing the Process, Choosing the Grain, Identifying and conforming the Dimensions, Choosing the Facts, Storing Pre-Calculations in the Fact Table, Rounding out the Dimension Tables, Choosing the duration of the database, Tracking slowly changing dimensions, Deciding the Query priorities and the Query Modes. These steps are detailed in Section 4.6. Section 4.7 1. Some of the mistakes that development teams may make during the development and implementation of a logical and physical data model include failure to address the long-term growth, use of no specific mechanism to clean-out the old data, overuse of the synthetic keys, and inadequate planning for indexes. The team has to initiate appropriate actions to overcome these mistakes. 4.12.2 Answers to Terminal Questions (TQs) 1. A data model is a well-organized abstraction of the data and so the data model has become an appropriate method to understand and manage the business of the organization. Otherwise, it is very difficult to organize the structure and contents of the data in the data warehouse. The data models are critical in understanding the data in an analytical environment of a data warehouse and so act as a blueprint for the information requirements of an organization. The purpose of modeling is to provide an accurate record of some aspect of the real world in a particular context. The objective of the modeling is to provide a record of accurate and meaningful business data definitions, to be useful in the identification of valid, consistent business data structures that contain sufficient Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 105 information to run and manage the business, to act as an indication of the similarities and differences between data from different sources and the relationships between them. 2. The two techniques for data modeling in a data warehouse environment sometimes look very different from each other, but they have many similarities. Dimensional modeling can use the same notation, such as entity, relationship, attribute, and primary key. And, in general, you can say that a fact is just an entity in which the primary key is a combination of foreign keys, and the foreign keys refer the dimensions. Therefore, we could say that dimensional modeling is a special form of ER modeling. An ER model provides the structure and content definition of the informational needs of the corporation, which is the base for designing the data warehouse. A conclusion that can be drawn from the above discussion is that the two data modeling techniques have their own strengths and weaknesses, and either of them can be used in the appropriate situation. 3. The leading approaches to organize the data in a data warehouse are Ralph Kimballs Dimensional approach and the Bill Inmons Normalized approach. In the "dimensional" approach, the transaction data is partitioned into either measured "facts" (which are generally numeric data that captures specific values) or "dimensions" (which contain the reference information that gives each transaction its context). This enables the business staff to understand and use the data warehouse. Also, the data warehouse tends to operate very quickly because the data is pre-joined into the dimensional form. But there will be some difficulty to incorporate the changes if the company changes the way in which it does business. In the "normalized" approach, the data in the data warehouse is stored in third normal form. The tables in Business Intelligence and Tools Unit 4 Sikkim Manipal University Page No. 106 the database are then grouped together by subject areas so as to reflect the general definition of the data. Through use of this approach, it is easy to modify the information in the database. 4.12.3 Answers to Multiple Choice Questions (MCQs) 1. Ans: a 2. Ans: b 3. Ans: c 4. Ans: c 5. Ans: a 6. Ans: d 7. Ans: b 8. Ans: d 9. Ans: a 10. Ans. d