0% found this document useful (0 votes)
254 views20 pages

Data Mining Basics

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. It refers to extracting hidden patterns from large data sets. The major components of a typical data mining system include databases or data warehouses that store data, data cleaning and integration techniques, a data mining engine that applies techniques like classification and clustering, and a graphical user interface for user interaction. Data mining can be applied to various data sources like relational databases, data warehouses, transactional databases, and web data.

Uploaded by

Naresh R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
254 views20 pages

Data Mining Basics

Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. It refers to extracting hidden patterns from large data sets. The major components of a typical data mining system include databases or data warehouses that store data, data cleaning and integration techniques, a data mining engine that applies techniques like classification and clustering, and a graphical user interface for user interaction. Data mining can be applied to various data sources like relational databases, data warehouses, transactional databases, and web data.

Uploaded by

Naresh R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT-I

Introduction
• Data mining is a technique for uncovering the interesting data patterns hidden in large data sets.
Write short notes on Data mining (or)
Define data mining (or) What is data mining. (5 Marks)
Data mining: It refers to extracting or mining knowledge from large amount of data. Data mining is the
technique to extract the hidden information from data warehousing.
৹ Data mining is the process of analyzing data from different perspectives and summarizing it into
useful information.
Example: The mining of gold from rocks or sand is referred to as gold mining rather than rock or sand
mining.
Data mining can also be defined as the process of discovering meaningful new correlation patterns and
trends by shifting through large amount of data stored in repositories using pattern recognition techniques as
well as statistical and mathematical techniques.
Data mining should have been more appropriately named “Knowledge mining from data”.
Applications:
 Business management,
 Production control,
 Market analysis,
 Engineering design,
 Science Exploration.
The Evolution of
database technology
Data collection and database
creation –Primitive file processing
(1960’s)

Database management system (1970’s early 1980’s)


Hierarchical and network database system
Relational database system
Data modeling tools: ER Model
Indexing and data organization techniques
B+, tree, hashing, query language: SQL etc.
User interfaces, forms and reports
Query processing of query optimization
Transaction management, recovery, concurrency control, online
transaction

Advanced data base Data warehousing and data Web based database
management system mining (Late 1980’s) system 1990’s
Mid 1980 Data warehouse and OLAP Xml based data base
Object oriented and Technology system, web mining
relational model, Data mining and knowledge
Spatial model, multimedia discovery

New generation of Integrated information


system 2000
Different meanings to data mining
1
 Knowledge mining from databases
 Knowledge extraction
 Data/pattern analysis

Knowledge discovery in databases (KDD)

Knowledge

Evaluation and
Presentation

Data Mining
Patterns

Selection and
Transformation

Data
Warehouse

Cleaning and
Integration

Databases Flat Files

Data cleaning: To remove noise and inconsistent data


Data integration: where multiple data sources may be combined
Data selection: Where data relevant to the analysis task are retrieved from the database
Data transformation: Where data are transformed or consolidated into forms appropriate for mining by
performing summary on aggregation operation.
Data mining: An essential process where intelligent methods are applied in order to extract data patterns
Pattern evaluation: To identify the truly interesting patterns representing knowledge based on some
interestingness measures.
Knowledge presentation: Where visualization and knowledge representation techniques are used to present
the mined knowledge to the user.

Explain the Architecture of typical Data Mining System. (or) (5 Marks)


What are the major components in Data Mining System?
Data Mining is the process of discovering interesting knowledge from large amounts of data stored
either in databases, Data Warehouse or other information repositories. On the basis of this view, the
architecture of a typical Data Mining System may have the following major components.

2
Database, Data Warehouse or other information repository:
This is one or a set of databases, data warehouses, spreadsheets or other kinds of information
repositories. Data Cleaning and Data Integration techniques may be performed on the data.

Database or Data Warehouse Server:


The Database or Data Warehouse Server is responsible for fetching the relevant data, based on the
user’s Data Mining request.
Knowledge Base:
This is the domain knowledge that is used to guide the search, or evaluate the interestingness of
resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute
values into different levels of abstraction. Some examples are additional interestingness constraints or
thresholds and metadata.

Graphical
Graphical User
User Interface
Interface

Pattern
Pattern Evaluation
Evaluation

Knowledge
Data
Data Mining
Mining Engine
Engine base

Database
Database or
or
Data
Data Warehouse
Warehouse Server
Server

Data Cleaning Filtering


Data Integration

Data
Database Warehouse

Data Mining Engine:


This is essential to the Data Mining system and consists of a set of functional modules for tasks such
as Characterization, Association, Classification, Cluster Analysis and Evaluation and Deviation Analysis.
Pattern evaluation module:
This component typically employs interestingness measures and interacts with the data mining
modules so as to focus the search towards interesting patterns. It may use interestingness thresholds to filter
out discovered patterns.
Graphical user interface:
This module communicates between users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task, providing information to help focus the search
and performing exploratory data mining based on the intermediate data mining result.
This component allows user to browse database and data warehouse schemas or data structures,
evaluate mined patterns, and visualize the patterns in different forms.
Explain different kind of data in data mining. (or) (5 Marks)
Data mining –on what kind of data.

3
Data mining should be applicable to any kind of information repository. This includes relational
databases, data warehouses, transactional databases, advanced database systems, flat files, and the World
Wide Web.
Relational databases: A database system, also called a database management system (DBMS) consists of a
collection of interrelated data known as a database and a set of software programs to manage and access the
data.
Relational database is a collection of tables each of which is assigned a unique name. Each table
consists of a set of attributes and usually stores a large set of tuples.
Each tuples in a relational table represents an object identified by a unique key and described by a set
of attributes values.
A semantic data model, such as an entity relationship(ER) data model, which models the databases
as a set of entities and their relationships, is often constructed for relational databases. Relational data can be
accessed by database queries written in a relational query language such as SQL.
Data warehouses:
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema and which usually resides at a single site. Data warehouses are constructed via a process of
data cleaning, data transformation, data integration, data loading and periodic data refreshing.

client
Data warehouse
Data sources
clean
transform Query and tools
integrate
load

client

Typical architecture of a data warehouse

A data warehouse is usually modeled by multidimensional database structure; where each dimension
corresponds to an attribute or a set of attributes in the schema each cell store the value of some aggregate
measure such as count or sale-amount.
It provides a multidimensional view of data and allows the precomputation and fast accessing of
summarized data.
Data mart:
A data mart is the access layer of the data warehouse environment that is used to get data out to the
users. The DM is a subset of the DW, usually oriented to a specific business line or team.
Data warehouse collect information about subject that span an entire organization and thus it scope is
enterprise wide. A data mart is a department subset of a data warehouse. It focuses on selected subjects and
its scope is department wide.
On line Analytical Processing:
OLAP operations make use of background knowledge regarding the domain of the data being studied
in order to allow the presentation of data at different levels of abstraction.
Transactional Databases:
Transactional databases consist of a file where each record represents a transaction. A transaction
includes a unique transaction identity number and a list of the item making up the transaction.
Trans_id List of item_id
T100 I1,I3,I4
T200 I2,I8
Advanced Database system and advanced data base Applications:

4
The new database application includes handling spatial data, engineering design data, hypertext and
multimedia data, time related data and the www.
These applications require efficient data structure and scalable methods for handling complex object
structures, variable length records, semi structured or unstructured data, text and multimedia data and db
schemas with complex structures and dynamic changes.
Databases or information repositories require sophisticated facilities to efficiently store, retrieve, and
update large amounts of complex data, they also provide fertile grounds and raise many challenging research
and implementation issues for data mining.
Object-Oriented Databases:
Object oriented databases are based on the object oriented programming paradigm, where is general terms
each entity is considered as an object. Data and code relating to an object are encapsulated into a single unit.
each object has associated with it following
1. A set of variables that describe the objects.
2. A set of messages that the objects can use to communicate with other object on with the rest of the
Database system.
3. A set of methods, where each method holds the code to implement a message.
Object Relational databases:
Object relational databases are constructed based on an object relational data model. This model
extend the relational model by providing a rich data type for handling complex objects and object
orientation.
Spatial databases:
It contains spatial relation information; such databases include geographic (map) databases, VLSI
chip design databases, and medial and satellite image databases.
Temporal Databases and Time Series Databases:
Temporal databases and time series databases both store time related data. A temporal database
usually store relational data that include time related attributes. These attributes may involve several
timestamps, each having different semantics.
A time series database stores sequence of values that change with time such as data collected
regarding the slow exchange.
Sequence Database:
A sequence database stores sequences of ordered events, with or without a concrete notion of time ,
e.g. customers shopping sequence ,web click streams

Text Databases and Multimedia Database:


Text databases are databases that contain word descriptions for objects. These word descriptions are
usually not simple keywords but rather long sentences or paragraphs, such as product specifications, error or
bug reports, warning messages, summary report, notes or other description.
Multimedia databases store image, audio and video data. They are used in applications such as
picture content based retrieval, voice-mail systems, video on demand systems, the WWW and speech based
user interfaces that recognize spoken commands.

Heterogeneous Databases And Legacy Databases:


A legacy database is a group of heterogeneous databases that combines different kinds of data
systems, such as relational on object oriented databases, hierarchical databases, network databases, spread
sheets, multimedia databases or file systems. The heterogeneous databases is a legacy database may be
corrected by intra or inter computer networks.

The World Wide Web:


The WWW and its associated distributed information service, such as America online, Yahoo.
AltaVista and Prodigy, provide rich online information services, where data objects are linked together to
facilitate inter active access, users seeking information of interest traverse from one object via links to
another.

Write short notes on data mining functionalities. (Or) Explain about what kinds of patterns can be
mined. (10 Marks)
5
Data mining tasks can be classified into two categories: Descriptive and predictive.
Descriptive: Descriptive mining tasks characterize the general properties of the data in the database.
Predictive: Predictive mining tasks perform inference on the current data in order to make prediction.
Concept/Class Description: Characterization and Discrimination:
Data can be associated with classes or concepts. For example, allElectornic store: here classes of
items are computers and printers and the concept of customer include big spenders and budget spenders.
These descriptions can be derived via:
1. Data characterization by summarizing the data of the class under study called target class.
2. Data discrimination by comparison of the target class with one or a set of comparative classes called
contrasting classes.
Data characterization
It is summarization of the general characterization or features of a target class of data. The data
corresponding to the user specified class are typically collected by a database query.
The data cube based OLAP roll_up operation can be used to perform user controlled data
summarization along a specified dimension.
An attribute oriented induction technique can be used to perform data generalization and
characterization without step by step user interaction. The output of data characterization can be presented
in various forms: Pie charts, bar charts, curves, multidimensional data cubes.
Data Discrimination
It a is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. The target and contrasting classes can specified by the user,
and the corresponding data objects retrieved through database queries.
The output of data discrimination is a comparative measures that help distinguish between the target
and contrasting classes.
Association analysis:
Association analysis is the discovery of association rules showing attributes value conditions that
occur frequently together in a given set of data. Association analysis is widely used for market basket or
transaction data analysis.
Example: Association rules are of the form x=>y, that is “A 1Λ A2Λ….. Λ Am”→ “B1Λ B2Λ….. ΛBn”,
where Ai (for i ε {1,2,….,m}) and Bj (for j ε {1,2,….,n}) are attribute value pairs. That is “database tuples
that satisfy the conditions in X are also likely to satisfy the conditions in Y.”
Example: age(X,”20…29) Λ income(X,”20K….29K)=> buys(X,”CDplayer”)
[support = 2%, confidence = 60%]
Adopting the terminology used in multidimensional databases, where each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional association rule.

Example: contain(T,”computer”)=>contains(T,”software)
[support=1%,confidence=50%]
Association rules that contain a single predicate are referred to as single dimensional association
rules.
Computer=>software[1%,50%]”.
Classification and Prediction:
Classification is the process of finding a set of models or functions that describe and distinguish
data classes or concepts, for the purpose of being able to use the model to predict the class of objects.
How is the derived model presented?
The derived model may be represented in various forms, such as classification if-then rules, decision
trees, mathematical etc,. A decision tree is a flow chart like a tree structure where each node denotes a test
on an attribute value, each branch represents an outcome of the test and tree leaves represent classes or class
distributions.
Prediction: User may wish to predict some missing or unavailable data values rather than class
labels.
Cluster Analysis: The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Outlier Analysis:

6
A database may contain data objects that do not comply with the general behavior or model of the
data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.
Outliers

Cluster(grouping of same type of items)

Evolution analysis:
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time.
Discuss about classification of data mining system (or)
Write notes on classification of data mining system. (5 Marks)
Classification according to the kinds of databases mined:
A data mining system can be classified according to the kinds of databases mined. Database systems
themselves can be classified according to different criteria. For example: relational, transactional, object
oriented, or data warehouse.
Classification according to the kinds of knowledge mined:
Data mining systems can be categorized based on data mining functionalities, such as
characterization, discrimination, association, classification, clustering, outlier analysis and evolution
analysis.
Classification according to the kinds of techniques utilized:
These techniques can be described according to the degree of user interaction involved or the
methods of data analysis employed visualization, pattern recognition, and neural networks and so on.

Database Statistic
technology s

Information Data mining Machine


science learning

Visualization Other disciplines

Classification according to the application adapted:


Data mining systems can also be categorized according to the applications they adapt. For example
stock markets, e_mail, telecommunication and so on.

Write about major issues in data mining. (or)


Discuss about major issues in data mining. (5Marks)
Mining methodology and user interaction issues:
These reflect the kinds of knowledge mined, the ability to mine knowledge at multiple granularities.

7
1) Mining different kinds of knowledge in database: Since different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data analysis and
knowledge discovery tasks.
2) Interactive mining of knowledge at multiple levels of abstraction: Since it is difficult to know
exactly what can be discovered within a database, the data mining process should be interactive. For
databases containing a huge amount of data, appropriate sampling techniques can first be applied to
facilitate interactive data exploration.
3) Incorporation of background knowledge: Background knowledge, or information regarding the
domain under study, may be used to guide the discovery process and allow discovered pattern to be
expressed in concise terms and at different levels of abstraction.
4) Data mining query languages and ad hoc data mining: A relational query language allows users
to pose ad hoc queries for data retrieval. In a similar vein, high level data mining query languages
need to be developed to allow users to describe ad hoc data mining tasks by facilitating the
specification of the relevant sets of data for analysis.
5) Presentation and visualization of data mining results: Discovered knowledge should be expressed
in high level languages, visual representations, or other expressive forms so that the knowledge can
be easily understood and directly usable by humans.
6) Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional
cases, or incomplete data objects. When mining data regularities, these objects may confuse the
process, causing the knowledge model constructed to overfit the data.
7) Pattern evaluation: The interestingness problem: a data mining system can uncover thousands of
patterns. Many of the patterns discovered may be uninteresting to the given user, representing
common knowledge or lacking novelty.
Performance issues: These include efficiency, scalability and parallelization of data mining algorithms:
1) Efficiency and scalability of data mining algorithms: To effectively extract information from a
huge amount of data in databases, data mining algorithms must be efficient and scalable.
2) Parallel, distributed and incremental mining algorithms: the huge size of many databases, the
wide distribution of data, and the computational complexity of some data mining methods are factors
motivating the development of parallel and distributed data mining algorithms.
Issues relating to the diversity of database types:
1) Handling of relational and complex types of data: Since relational databases and data warehouses
are widely used, the development of efficient and effective data mining systems for such data is
important. It may contain complex data objects, hypertext and multimedia data, spatial data,
temporal data or transaction data.
2) Mining information from heterogeneous databases and global information systems: Local and
wide are computer networks connect many sources of data, forming huge, distributed and
heterogeneous databases.

Data Warehouse and OLAP technology for data mining:


What is data warehouse? (or) Define data warehouse. (5 Marks)
A data warehouse is a subject oriented, integrated, time variant and nonvolatile collection of data in
support of management’s decision making process. The four keywords, subject-oriented, integrated time-
variant and nonvolatile, distinguish data warehouses from other data repository systems such as relational
database systems, transaction processing systems and file systems.
Subject-oriented:
A data warehouse is organized around major subjects, such as customer, supplier, product and sales.
Integrated:
A data warehouse is usually constructed by integrating multiple heterogeneous sources, such as
relational databases, flat files, and on-line transaction records.
Time variant:
Data are stored to provide information from a historical perspective. Every key structure in the data
warehouse contains, either implicitly or explicitly, an element of time.

8
Nonvolatile:
A data warehouse is always a physically separate store of data transformed from the application data
found in the operational environment. How are
organizations using the information from data warehouses?
1) Increasing customer focus, which includes the analysis of customer buying patterns.
2) Repositioning products and managing product portfolios by comparing the performance of sales by
quarter, by year and by geographic regions.
3) Analyzing operations and looking for sources of profit
4) Managing the customer relationships, making environmental corrections.

Write the difference between operational databases systems and data warehouses (5 Marks)
Databases system: On line transaction processing (OLTP)
Data warehouse: On line analytical processing(OLAP)

Feature OLTP OLAP


Users and system Customer oriented, used for Market oriented and used for
orientation transaction and query processing data analysis by knowledge
workers
Data contents Manages current data Manages large amount of
historical data
Database design Entity relationship and application Star or snowflake model and
oriented design subject oriented design
View Focuses mainly on the current data Multiple versions of database
schema
Access patterns Short and atomic transaction Read only operations
Characteristic Operational processing Informational processing
Orientation Transaction Analysis
User Clerk, DBA, database professional Knowledge worker
Function Day to day operations Long tern informational
requirements, decision support
View Detailed, flat relational Summarized, multidimensional
Unit of work Short Complex
Number of records Tens Millions
accessed
Operations Index/hash on primary key Lots of scans
Priority High performance, high availability High flexibility, end user
autonomy
Write short notes on multidimensional data model (or)
Write the uses of multidimensional data model (or)
Explain about data cubes (10 Marks)
Data warehouses and OLAP tools are based on a multidimensional data model. This model vies data
in the form of the data cube.
Form tables and spreadsheets to data cubes
What is data cube?
A data cube allows data to be modeled and viewed in multiple dimensions. It is defined by
dimensions and facts. Dimensions are perspectives or entities with respect to which an organization wants to
keep records. In order to keep records of the store’s sales with respect to the dimensions time, item, branch
and location. These dimensions allow the store to keep track of things like monthly sales of items, and the
branches and locations at which the items were sold.
Each dimensional may have a table associated with it called a dimension table. A multidimensional
data model is typically organized around central theme like sales. This theme is represented by a fact table.
Facts are numerical measures.
The fact table contains the names of the facts, or measures, as well as keys to each of the related
dimension tables.
9
In the data warehousing research literature, a data cube such as each of the above is referred to as a
cuboid. Given a set of dimensions, we can construct a lattice of cuboids, each showing the data at a different
level of summarization or goup by. The lattice of cuboids is then referred to as a data cube.
The cuboid that holds the lowest level of summarization is called base cuboid. The 0-D cuboid
which holds the highest level of summarization is called the apex cuboid. The apex cuboid is typically
denoted by all.
all 0-D(apex) Cuboid

time item location supplier 1-D Cuboids

time, item time, location item, location location, supplier 2-D Cuboids
time, supplier item, supplier

time, item, location time, location, supplier


3-D Cuboids

time, item, supplier item, location, supplier

4-D(base) Cuboids
time, item, location, supplier
Stars, snowflakes and fact constellations: schemas for multidimensional databases:
The most popular data model for a data warehouse is a multidimensional model. Such a model can
exist in the form of a star schema, a snowflake schema, or a fact constellation schema.
Star Schema: The most common modeling paradigm is Star Schema, in which Data Warehouse contains,
1. a large central table (Fact table)
2. a set of smaller attendant tables (Dimension Table)
The advantage of a star schema is that it is easy to understand easy to define hierarchies, reduces
the number of physical joins, requires low maintenance and very simple metadata. 1:N relationship
between the Fact Table and dimension.
Disadvantage: Un-normalized dimension table tends to be large and may contain redundant
information.
time sales item
dimension table fact table dimension table
time_key
time_key time_key item_key
item_key
time_key
day
day item_key item_name
item_name
item_key
day_of_the_week
day_of_the_week branch_key brand
brand
branch_key
month
month location_key type
type
location_key
quarter
quarter dollars_sold supplier_type
supplier_type
dollars_sold
year
year units_sold
units_sold

location_key
location_key
branch_key
branch_key street
street
branch_name
branch_name city
city
branch_type
branch_type province_or_state
province_or_state
10 country
country
Snowflake Schema:
This is a variant of the Star Schema model, where some dimension tables are normalized, thereby
further splitting the data into additional tables. The resulting schema graph forms a shape similar to a
snowflake. It consists of a single fact table and multiple dimension tables. The snowflake structure may be
reducing and effectiveness of navigating across the tables due to larger number of join operations.
time sales item supplier
dimension table fact table dimension table dimension table
time_key
time_key time_key item_key
item_key supplier_key
time_key supplier_key
day
day item_key item_name
item_name supplier_type
item_key supplier_type
day_of_week
day_of_week branch_key brand
brand
branch_key
month
month location_key type
type
location_key
quarter
quarter dollars_sold supplier_key
supplier_key
dollars_sold
year
year units_sold
units_sold

branch_key
branch_key location_key
location_key city_key
city_key
branch_name
branch_name street
street city
city
branch location city
branch_type
branch_type city_key
city_key province_or_state
province_or_state
dimension table
country
country
Fact Constellation Schema:
Modern and Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars and so it is also called as galaxy schema or a fact
constellation.
time sales item shipping supplier
dimension table fact table dimension table fact table dimension table
time_key
time_key time_key item_key
item_key item_key
time_key item_key shipper_key
shipper_key
day
day item_key item_name
item_name time_key
item_key time_key shipper_name
shipper_name
day
day of
of week
week branch_key brand
brand shipper_key
branch_key shipper_key location_key
location_key
month
month location_key type
type from_location
location_key from_location shipper_type
shipper_type
quarter
quarter dollars_sold supplier_type
supplier_type to_location
dollars_sold to_location
of
year
year units_sold dollars_cost
units_sold dollars_cost
units_shipped
units_shipped

branch_key
branch_key location_key
location_key
branch
branch_name
branch_name street
street
dimension table branch location
branch_type
branch_type city
city
11
province_or_state
province_or_state
country
country
dimension table

A data mart is a department subset of the data warehouse that focuses on selected subjects, and thus
its scope is department wide. For data marts, the star or snowflake schema are commonly used since both are
geared towards modeling single subjects, although the star schema is more popular and efficient.
Examples for defining star, snowflake and fact constellation schemas
Data warehouse and data marts can be defined using two language primitives, one for cube definition
and one for dimension definition. The cube definition statement has the following syntax:
define cube(cube_name)[(dimension_list)];(measure_list)
The dimension definition statement has the following syntax.
define dimension(dimension_name) as ((attribute or sub dimension list))
Star schema:
define cube sales_star[time,item,branch,location]:dollars_sold=sum(sales_in_dollars),unit_sold=count(*)
define dimension time as (time_key, day, day _or_week, month, quarter, year)
define dimension item as (item_key,item_name,brand, type,supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key,street, city,province_or_state,country)
Snowflake schema:
Define cube sales_snowflake[time,item,branch,location]:dollars_sold
=sum(sales_in_dollars),unit_sold=count(*)
define dimension time as (time_key, day, day _or_week, month, quarter, year)
define dimension item as (item_key,item_name,brand, type)supplier(supplier_key,supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key,street) city(city_key, city, province_or_state, country))
Fact constellation schema:
define cube sales[time,item,branch,location]:dollars_sold=sum(sales_in_dollars),unit_sold=count(*)
define dimension time as (time_key, day, day _or_week, month, quarter, year)
define dimension item as (item_key,item_name,brand, type,supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key,street, city,province_or_state,country)
define cube shippling[time, item, shipper, from_location, to_location]:
dollars_cost=sum(cost_in_dollars), units_shipped=count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as(shipper_key,shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
A define cube statement is used to define data cubes for sales and shipping corresponding to the two
fact tables of the schema.
Measures: their categorization and computation:
A data cube measure is a numerical function that can be evaluated at each point in the data cube
space. A measure value is computed for a given point by aggregating the data corresponding to the
respective dimension-value pairs defining the given point.
Measures can be organized into three categories, based on the kind of aggregate functions used.
Distributive:
An aggregate function is distributive if it can be computed in a distributed manner. Count(),
sum(), min() and max() are distributive aggregate functions.
Algebraic:
An aggregate function is algebraic if it can be computed by an algebraic function with M arguments
each of which is obtained by applying a distributive aggregate function. min_N(), max_N() and standard
deviation. A measure is algebraic if it is obtained by applying an algebraic aggregate function.
Holistic:

12
An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub
aggregate: rank(), median(), mode(). A measure is holistic if it is obtained by applying a holistic aggregate
function.

Explain the various OLAP Operations. (5 Marks)


(or) Write short notes on OLAP Operations.
OLAP (On-Line Analytical Processing)
In the multidimensional model, data are organized into multidimensional data cube. It is necessary
to explore the different analytical tools with which to perform the complex analysis of Data. These data
analysis tools are called OLAP (On-Line Analytical Processing).
There exists a number of OLAP Operations on data cubes which allow interactive querying and
analysis of data.
The various operations of OLAP are,
 Slicing
 Dicing
 Drilling
 Drill-Up (Roll-Up)
 Drill-Down (Roll-Down)
 Drill-Within
 Drill-Across
 Pivot (Rotate)
 Other OLAP operations
Slicing:Slicing and Dicing are used for reducing the data cube by one or more dimensions. The Slice
operation performs a selection on one dimension of the given cube, resulting in a subcube.
Slice time=’Q2’ C[quarter, city, product] = c[city, product]
Dicing:This operation is for selecting a smaller data cube and analyzing it from different perspectives. The
dice operation defines a subcube by performing a selection on two or more dimensions.
Dice time=’Q1’ or ‘Q2’ and location= “Mumbai” or “Pune” C[quarter, city, product] = C[quarter’,city’, product]
Drilling:This operation is meant for moving up and down along classification hierarchies. The different
instances of drilling operations
Drill-Up: This operation deals with switching from a detailed to an aggregated level within the same
classification hierarchy. It is also called as Roll-Up which performs aggregation on a data cube, either by
climbing-up a dimension hierarchy or by dimension reduction.
Roll-uptime C[quarter, city, product] = C[quarter, province, product]
Drill-Down: This operation is concerned with switching from an aggregated level to a more detailed level
within the same classification hierarchy. It is the reverse of Roll-Up. Drill-Down can be realized by either
stepping-down a dimension hierarchy or introducing additional dimension.

Drill-Within: It is switching from one classification to a different one within the same dimension.
Drill-Across: It means switching from a classification in one dimension to a different classification in a
different dimension.
Pivot (Rotate): Pivot (also called “rotate”) is a visualization operation which rotates the data axes in order
to provide an alternative presentation of the same data. Other examples are rotating the axed in a 3-D cube,
or transforming a 3-D cube into a series of 2-D planes.
Other OLAP Operations:Some other additional operations are such as Drill-Across, Drill-Through, etc.
It may include ranking the top N or bottom N items in lists, as well as computing moving averages, growth
rates, etc..
Explain the Architecture of Data Warehouse. (5 Marks)

13
(or) Write short notes of Data Warehouse Architecture with a neat diagram.
The Data Warehouse Architecture deals with 2 phases such as,

1. Design and Construct a Data Warehouse


2. Three-Tier Data Warehouse Architecture.
Three-Tier Data Warehouse Architecture
Tier 1 is essentially the warehouse sever, tier 2 is the OLAP engine for analytical processing, and tier
3 is a client containing reporting tools, visualization tools, data mining tools, querying tools.
There is also the backend process which is concerned with extracting data from multiple operational
databases and from external sources; with cleaning, transforming and I integrating this data for loading into
the data warehouse server and of course with periodically refreshing the warehouse.
Design and Construction of Data Warehouse:
The Design phase is considered in four different views such as
 Top-Down View : It allows the selection of the relevant information necessary for the
Data Warehouse. This information matches the current and coming business needs.
 Data Source View : It exposes the information being captured, stored and managed by
operational systems. This information may be documented at various levels of detail and accuracy.
Query/Report
 Data Warehouse View : It includesAnalysis Fact Tables and Dimension
Data Mining
tables. It represents the
information that is stored inside the Data Warehouse, including precalcuated totals and counts.
 Business Query : It is the perspective of data in the Data Warehouse from the viewpoint
Top Tier:
of the end user.
Front-End

Tools

OLAP Server OLAP Server


Outpu
t
Middle Tier:
OLAP
Server

Monitoring Administration Data Warehouse Data Marts


Bottom Tier:
Data Warehouse
Metadata Repository Server

Extract
Clean
Transfor
m
Load Data
Refresh

14
Operational Databases External Sources
Write about the steps used to design and construct the data warehouses. (OR) (5 Marks)
Discuss about the basic steps involved in the designing process of data warehouse.
What does the data warehouse provide for business analysts?
 Having a data warehouse may provide a competitive advantage by presenting relevant information
from which to measure performance and make critical adjustments in order to help win over
competitors.
 A data warehouse can enhance business productivity since it is able to quickly and efficiently gather
information.
 A data warehouse facilitates customer relationship management since it provides a consistent view of
customers and items across all lines of business.
 It may bring about cost reduction by tracking trends, patterns and exceptions over long periods of
time in a consistent and reliable manner.
Four different views regarding the design of a data warehouse must be considered: the top down view,
the data source view, the data warehouse view and the business query view.
Top down view: It allows the selection of the relevant information necessary for the data warehouse.
Data source view: It exposes the information being captured, stored and managed by operational
systems.
Data warehouse view: It includes fact tables and dimension tables. It represents the information that is
stored inside the data warehouse, including precalculated totals and counts, as well as information regarding
the source, date and time of origin.
Business Query view: It is the perspective of data in the data warehouse from the viewpoint of the end
user.
The process of data warehouse design
A data warehouse can be built using a top down approach, a bottom up approach or a combination of
both.
Top down approach: It starts with overall design and planning.
Bottom up approach: It starts with experiments and prototypes.
Combined approach: An organization can exploit the planned and strategic nature of the top down
approach while retaining the rapid implementation and opportunistic application of the bottom up approach.

Software engineering point of view, the design and construction consist of following steps:
Planning
Requirements Study
Problem analysis
Warehouse design
Data integration and testing
Deployment of the data warehouse
Large software systems can be developed using two methodologies:
1. Waterfall method 2. Spiral method
Waterfall method: A structured and systematic analysis at each step before proceeding to the next, which is
like a waterfall, falling from one step to the next.
Spiral method: The rapid generation of increasingly functional systems, with short intervals between
successive releases.
In general the warehouse design process consists of the following steps:
 Choose a business process to model :For example ,orders,invoices,shipments,inventory,account
administration,sales.If the business process is organizational and involves multiple complex oject
collection a datawarehouse warehouse model should be followed.
 Choose the grain of the business process :The grain is the fundamental ,the atomic level of data

15
To be represented in the fact table for this process,for example,individual transaction,individual daily
snapshots and so on.
 Choose the dimensions that will apply to each fact table record:Choose the dimention that will
apply to each fact table record.Typical dimension are time,item,customer ,supplier,warehouse.
 Choose the measures that will populate each fact table record:Typical measure are numeric
additive quantities like dollars-sold , units sold.
Data warehouse development tools provide functions to define and edit metadata repository contents,
Answer queries, output reports and ship metadata to and from relational database system catalogues.
Planning and analysis tools study the impact of schema changes and of refresh performance when
changing refresh rates or time windows.
Three-Tier Data Warehouse Architecture
In general, the Warehouse design process consists of the following steps:
1. Choose a business process to model.
2. Choose the grain of the business process.
3. Choose the dimensions that will apply to each fact table record.
4. Choose the measures that will populate each fact table record.
Three-Tier Data Warehouse Architecture: The Three-Tier consists of
1. Bottom Tier - Warehouse Database Server
2. Middle Tier - OLAP Server
3. Top Tier - Client
Bottom Tier: It is a Warehouse Database Server that is always a relational database system. Data from
operational databases and external sources are extracted using application program interfaces known as
gateways (JDBC).
Middle Tier: It is OLAP Server that is implemented using either
(1) Relational OLAP (ROLAP) model – extended RDBMS that maps multidimensional data to
standard relational operations
(2) Multidimensional OLAP (MOLAP) model – a special purpose server that directly implement
multidimensional data operations.
TopTier:It is a Client which contains query and reporting tools is always a relational database system.
By the Architecture point of view, there are three Data Warehouse models. They are:
 Enterprise Warehouse:
It collects all of the information about subjects of the entire organisation.
 Data Mart:
It contains a subset of corporate-wide data that has the value of the specific group pf users.
 Virtual Warehouse:
It is a set of views over operational databases.
Data warehouse Multidimensional view and Implementation
What is a concept hierarchy? (Or) Define concept hierarchy. (5 Marks)
. ALL

CANADA USA

BRITISH ONTARIO NEW ILLINOIS


COLUMBIA YORK

16
VAN VICTORIA
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level,
more general concepts.Many concept hierarchies are implicit within the database schema. For example,
suppose that the dimension location is described by the attributes number, street, city, province_or_state,
zipcode and country. These attributes are related by a total order, forming a concept hierarchy such as
“street<city<province_or_state<country”.
The attributes of a dimension may be organized in a partial order, forming a lattice. A concept
hierarchy that is total or partial order among attributes in a database schema is called a schema hierarchy.
Concept hierarchies may also be defined by discrediting or grouping values for a given dimension or
attribute, resulting in a set_grouping hierarchy.

Country

Province_or_state quarter

city
month Week
street

Hierarchy for location


day
A Lattice for time.
Discuss about data warehouse implementation. (Or) (10 Marks)
Write about implementation of data warehousing.
Data warehouses contain huge volumes of data. OLAP servers demand that decisions supports
queries be answered in the order of seconds.
Efficient computation of Data Cubes
Multidimensional data analysis is the efficient computation of aggregations across many sets of
dimensions. In SQL these aggregations are referred to as group-by’s.
The compute cube operator and its implementation
One approach to cube computation extends SQL so as to include a compute cube operator.
Compute cube operator computes aggregates over all subsets of dimensions specified in the operation.
Example:
Suppose we would like to create a data cube for AllElectronics sales that contains the following:
item, city, year, and sales_in_dollars. To analyze the data, use the following queries.
“Compute the sum of sales, grouping by item and city”
“Compute the sum of sales, grouping by item”
“Compute the sum of sales, grouping by city”
Taking the three attributes, city, item and year, as three dimensions and sales_in_dollars as the
measure, that total number of cuboids, or group_by’s that can be computed for this data cube is 23=8.
The possible group_by’s are the following: {(city,item,year), (city, item), (city, year), (item,year), (city),
(item), (year), ()}, where () means that the group_by empty.
() 0-D (apex) (cuboids)

(city) (item) (year) 1-D cuboids

17
2-D cuboids
(item,year)
(city,item) (city, year)
(city,item,year) 3-D (base) cuboids
An SQL query containing no group-by such as _ compute the sum of total sales”, is a zero-
dimensional operation. An SQL query containing one group-by such as “is a one-dimensional operation.
A cube operator on n dimensions is equivalent to a collection of group by statements, one for each
subset of the n dimensions. Therefore, the cube operator is the n-dimensional generalization of the group by
operator.
Define cube sales [item, city, year]: sum (sales_in_dollars)
For a cube with n dimensions, there are a total of 2 n cuboids, including the base cuboid. The
statement.
compute cube sales
explicitly instructs the system the system to compute the sales aggregate cuboids for all of the eight
subsets of the set {item, city, year}, including the empty subset.
Partial materialization: selected computation of cuboids:
Three choices for data cube materialization:
1) Do not precompute any of the “nonbase” cuboids (no materialization)
2) Precompute all of the cuboids (full materialization)
3) Selectively compte a proper subset of the whole set of possible cuboids (partial
materialization).
The first choice leads to computing expensive multidimensional aggregates on the fly which could be
slow. The second choice may require huge amounts of memory space in order to store all of the
precomputed cuboids. The third choice presents an interesting trade off between storage space and response
time.

The partial materialization of cuboids should consider three factors:


Identify the subset of cuboids to materialize, Exploit the materialized cuboids during query
processing, Efficiently update the materialized cuboids during load and refresh.
Multiway Array aggregation in the computation of data cubes
Cuboids may be stored on secondary storage and accessed when necessary. Hence it is important to
explore efficient methods for computing all of the cuboids making up a data cube that is for full
materialization.
ROLAP cube computation uses the following major optimization techniques:
1) Sorting, hashing and grouping operations are applied to the dimension attributes in order to
reorder and cluster related tuples.
2) Grouping is performed on some sub aggregates as a “partial grouping step”.
3) Aggregates may be computed from previously computed aggregates, rather than from the base
fact tables.
MOLAP uses direct array addressing, where dimension values are accessed via the position or index of
their corresponding array locations. Therefore a different approach should be developed for the are based
cube construction of MOLAP:
1) Partition the array into chunks. A Chunk is a subcube that is small enough to fit into the memory
available for cube computation. Chunking is a method for dividing an n-dimensional array into
small n-dimensional chunks, where each chunk is stored as an object on disk.
2) Compute aggregates by visiting cube cells. The order in which cells are visited can be optimized
so as to minimize the number of times that each cell must be revisited.
3) Chunking technique involves “overlapping” some of the aggregation computations, it is referred
to as multiway array aggregation in data cube computation
Indexing OLAP Data:
The bitmap indexing method is popular in OLAP products because it allows quick searching in data
cubes. The bitmap index is an alternative representation of the record_ID (RID) list.
The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value.
18
Join indices may span multiple dimensions to form composite join indices.
Efficient processing of OLAP Queries:
Determine which operations should be performed on the available cuboids.
Determine to which materialized cuboids the relevant operations should be applied.
Metadata Repository:
Metadata are data about data, when used in a data warehouse, metadata are the data and define
warehouse objects. Metadata are created for the data and captured for timestamping any extracted data, the
source of the extracted data and missing fields that have been added by cleaning or integration processes. It
consists of:
1) A description of the structure of the data warehouse
2) Operational metadata
3) The algorithm used for summarization
4) The mapping from the operational environment to the data warehouse
5) Data related to system performance
6) Business metadata.
Data warehouse back end tools and utilities:
1) Data extraction: which typically gathers data from multiple, heterogeneous and external sources
2) Data cleaning: which detects errors in the data and rectifies them when possible
3) Data transformation: which converts data from legacy or host format to warehouse format
4) Load: Which sorts, summarizes, consolidates, computes vies, checks integrity, and builds indices
and partitions.
5) Refresh: which propagates the updates from the data sources to the warehouse.

Write notes on data warehousing to data mining. (Or) (5 Marks)


Describe about “From data warehousing to data mining”.
On-line analytical mining (OLAM) is a powerful paradigm that integrates OLAP with data mining
technology.
Data warehouse usage:
Data warehouses and data marts are used in a wide range of applications. Business executives in
almost every industry use the data collected, integrated, preprocessed and stored in data warehouses and data
marts to perform data analysis and make strategic decisions.
The data warehouse is mainly used for generating reports and answering predefined queries. It is
used to analyze summarized and detailed data, where the results are presented in the form of reports and
charts.
The tools for data warehousing can be categorized into access and retrieval tools, database
reporting tools, data analysis tools and data mining tools.
Business users need to have the means to know what exists in the data warehouse, how to access the
contents of the data warehouse, how to examine the contents using analysis tools, and how to present the
results of such analysis.
Three kinds of data warehouse applications:
 Information Processing –Supports querying, and reporting using crosstabs, tables, charts or graphs.
 Analytical processing – Supports basic OLAP operations
 Data mining – supports knowledge discovery by finding hidden patterns and associations,
constructing analytical models.

From on line Analytical processing to on line analytical mining


Online Analytical mining: Which integrates on line analytical processing with data mining and mining
knowledge in multidimensional database. It is important for the following reasons:
High quality of data in data warehouses:
Available information processing infrastructure surrounding data warehouses
OLAP based exploratory data analysis
OLAP based exploratory data analysis
Online selection of data mining functions:
Architecture for on_line Analytical mining:

19
An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server
performs on-line analytical processing. Where the OLAM and OLAP servers both accept user on-line
queries via an graphical user interface API work with the data cube in the data analysis via a cube APO. A
metadata directory is used to guide the access of the data cube.
The data cube can be constructed by accessing and integrating multiple databases via an MDDB API
and by filtering a data warehouse via a database API that may support OLEDB and ODBC connections.
Since an OLAM server may perform multiple data mining tasks, such as concept description,
association, classification, prediction, clustering, time series analysis and so on, it usually consists of
multiple integrated data mining modules and is more sophisticated than an OLAP server.

Constraint-based mining query Mining result

Layer 4 user interface


Graphical user interfaces API
Layer3 OLAP/OLAM

OLAM OLAP engine


engine

CUBE API
Layer 2: multidimensional Database

MDDB Meta data

Database API
Data filtering Filtering Layer 1 (data repository)
Integration

Data cleaning
Databases Databases Data
warehouse Data integration

*************End of Unit I***********

20

You might also like