Data Mining Basics
Data Mining Basics
Introduction
• Data mining is a technique for uncovering the interesting data patterns hidden in large data sets.
Write short notes on Data mining (or)
Define data mining (or) What is data mining. (5 Marks)
Data mining: It refers to extracting or mining knowledge from large amount of data. Data mining is the
technique to extract the hidden information from data warehousing.
৹ Data mining is the process of analyzing data from different perspectives and summarizing it into
useful information.
Example: The mining of gold from rocks or sand is referred to as gold mining rather than rock or sand
mining.
Data mining can also be defined as the process of discovering meaningful new correlation patterns and
trends by shifting through large amount of data stored in repositories using pattern recognition techniques as
well as statistical and mathematical techniques.
Data mining should have been more appropriately named “Knowledge mining from data”.
Applications:
Business management,
Production control,
Market analysis,
Engineering design,
Science Exploration.
The Evolution of
database technology
Data collection and database
creation –Primitive file processing
(1960’s)
Advanced data base Data warehousing and data Web based database
management system mining (Late 1980’s) system 1990’s
Mid 1980 Data warehouse and OLAP Xml based data base
Object oriented and Technology system, web mining
relational model, Data mining and knowledge
Spatial model, multimedia discovery
Knowledge
Evaluation and
Presentation
Data Mining
Patterns
Selection and
Transformation
Data
Warehouse
Cleaning and
Integration
2
Database, Data Warehouse or other information repository:
This is one or a set of databases, data warehouses, spreadsheets or other kinds of information
repositories. Data Cleaning and Data Integration techniques may be performed on the data.
Graphical
Graphical User
User Interface
Interface
Pattern
Pattern Evaluation
Evaluation
Knowledge
Data
Data Mining
Mining Engine
Engine base
Database
Database or
or
Data
Data Warehouse
Warehouse Server
Server
Data
Database Warehouse
3
Data mining should be applicable to any kind of information repository. This includes relational
databases, data warehouses, transactional databases, advanced database systems, flat files, and the World
Wide Web.
Relational databases: A database system, also called a database management system (DBMS) consists of a
collection of interrelated data known as a database and a set of software programs to manage and access the
data.
Relational database is a collection of tables each of which is assigned a unique name. Each table
consists of a set of attributes and usually stores a large set of tuples.
Each tuples in a relational table represents an object identified by a unique key and described by a set
of attributes values.
A semantic data model, such as an entity relationship(ER) data model, which models the databases
as a set of entities and their relationships, is often constructed for relational databases. Relational data can be
accessed by database queries written in a relational query language such as SQL.
Data warehouses:
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema and which usually resides at a single site. Data warehouses are constructed via a process of
data cleaning, data transformation, data integration, data loading and periodic data refreshing.
client
Data warehouse
Data sources
clean
transform Query and tools
integrate
load
client
A data warehouse is usually modeled by multidimensional database structure; where each dimension
corresponds to an attribute or a set of attributes in the schema each cell store the value of some aggregate
measure such as count or sale-amount.
It provides a multidimensional view of data and allows the precomputation and fast accessing of
summarized data.
Data mart:
A data mart is the access layer of the data warehouse environment that is used to get data out to the
users. The DM is a subset of the DW, usually oriented to a specific business line or team.
Data warehouse collect information about subject that span an entire organization and thus it scope is
enterprise wide. A data mart is a department subset of a data warehouse. It focuses on selected subjects and
its scope is department wide.
On line Analytical Processing:
OLAP operations make use of background knowledge regarding the domain of the data being studied
in order to allow the presentation of data at different levels of abstraction.
Transactional Databases:
Transactional databases consist of a file where each record represents a transaction. A transaction
includes a unique transaction identity number and a list of the item making up the transaction.
Trans_id List of item_id
T100 I1,I3,I4
T200 I2,I8
Advanced Database system and advanced data base Applications:
4
The new database application includes handling spatial data, engineering design data, hypertext and
multimedia data, time related data and the www.
These applications require efficient data structure and scalable methods for handling complex object
structures, variable length records, semi structured or unstructured data, text and multimedia data and db
schemas with complex structures and dynamic changes.
Databases or information repositories require sophisticated facilities to efficiently store, retrieve, and
update large amounts of complex data, they also provide fertile grounds and raise many challenging research
and implementation issues for data mining.
Object-Oriented Databases:
Object oriented databases are based on the object oriented programming paradigm, where is general terms
each entity is considered as an object. Data and code relating to an object are encapsulated into a single unit.
each object has associated with it following
1. A set of variables that describe the objects.
2. A set of messages that the objects can use to communicate with other object on with the rest of the
Database system.
3. A set of methods, where each method holds the code to implement a message.
Object Relational databases:
Object relational databases are constructed based on an object relational data model. This model
extend the relational model by providing a rich data type for handling complex objects and object
orientation.
Spatial databases:
It contains spatial relation information; such databases include geographic (map) databases, VLSI
chip design databases, and medial and satellite image databases.
Temporal Databases and Time Series Databases:
Temporal databases and time series databases both store time related data. A temporal database
usually store relational data that include time related attributes. These attributes may involve several
timestamps, each having different semantics.
A time series database stores sequence of values that change with time such as data collected
regarding the slow exchange.
Sequence Database:
A sequence database stores sequences of ordered events, with or without a concrete notion of time ,
e.g. customers shopping sequence ,web click streams
Write short notes on data mining functionalities. (Or) Explain about what kinds of patterns can be
mined. (10 Marks)
5
Data mining tasks can be classified into two categories: Descriptive and predictive.
Descriptive: Descriptive mining tasks characterize the general properties of the data in the database.
Predictive: Predictive mining tasks perform inference on the current data in order to make prediction.
Concept/Class Description: Characterization and Discrimination:
Data can be associated with classes or concepts. For example, allElectornic store: here classes of
items are computers and printers and the concept of customer include big spenders and budget spenders.
These descriptions can be derived via:
1. Data characterization by summarizing the data of the class under study called target class.
2. Data discrimination by comparison of the target class with one or a set of comparative classes called
contrasting classes.
Data characterization
It is summarization of the general characterization or features of a target class of data. The data
corresponding to the user specified class are typically collected by a database query.
The data cube based OLAP roll_up operation can be used to perform user controlled data
summarization along a specified dimension.
An attribute oriented induction technique can be used to perform data generalization and
characterization without step by step user interaction. The output of data characterization can be presented
in various forms: Pie charts, bar charts, curves, multidimensional data cubes.
Data Discrimination
It a is a comparison of the general features of target class data objects with the general features of
objects from one or a set of contrasting classes. The target and contrasting classes can specified by the user,
and the corresponding data objects retrieved through database queries.
The output of data discrimination is a comparative measures that help distinguish between the target
and contrasting classes.
Association analysis:
Association analysis is the discovery of association rules showing attributes value conditions that
occur frequently together in a given set of data. Association analysis is widely used for market basket or
transaction data analysis.
Example: Association rules are of the form x=>y, that is “A 1Λ A2Λ….. Λ Am”→ “B1Λ B2Λ….. ΛBn”,
where Ai (for i ε {1,2,….,m}) and Bj (for j ε {1,2,….,n}) are attribute value pairs. That is “database tuples
that satisfy the conditions in X are also likely to satisfy the conditions in Y.”
Example: age(X,”20…29) Λ income(X,”20K….29K)=> buys(X,”CDplayer”)
[support = 2%, confidence = 60%]
Adopting the terminology used in multidimensional databases, where each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional association rule.
Example: contain(T,”computer”)=>contains(T,”software)
[support=1%,confidence=50%]
Association rules that contain a single predicate are referred to as single dimensional association
rules.
Computer=>software[1%,50%]”.
Classification and Prediction:
Classification is the process of finding a set of models or functions that describe and distinguish
data classes or concepts, for the purpose of being able to use the model to predict the class of objects.
How is the derived model presented?
The derived model may be represented in various forms, such as classification if-then rules, decision
trees, mathematical etc,. A decision tree is a flow chart like a tree structure where each node denotes a test
on an attribute value, each branch represents an outcome of the test and tree leaves represent classes or class
distributions.
Prediction: User may wish to predict some missing or unavailable data values rather than class
labels.
Cluster Analysis: The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity.
Outlier Analysis:
6
A database may contain data objects that do not comply with the general behavior or model of the
data. These data objects are outliers. Most data mining methods discard outliers as noise or exceptions.
Outliers
Evolution analysis:
Data evolution analysis describes and models regularities or trends for objects whose behavior
changes over time.
Discuss about classification of data mining system (or)
Write notes on classification of data mining system. (5 Marks)
Classification according to the kinds of databases mined:
A data mining system can be classified according to the kinds of databases mined. Database systems
themselves can be classified according to different criteria. For example: relational, transactional, object
oriented, or data warehouse.
Classification according to the kinds of knowledge mined:
Data mining systems can be categorized based on data mining functionalities, such as
characterization, discrimination, association, classification, clustering, outlier analysis and evolution
analysis.
Classification according to the kinds of techniques utilized:
These techniques can be described according to the degree of user interaction involved or the
methods of data analysis employed visualization, pattern recognition, and neural networks and so on.
Database Statistic
technology s
7
1) Mining different kinds of knowledge in database: Since different users can be interested in
different kinds of knowledge, data mining should cover a wide spectrum of data analysis and
knowledge discovery tasks.
2) Interactive mining of knowledge at multiple levels of abstraction: Since it is difficult to know
exactly what can be discovered within a database, the data mining process should be interactive. For
databases containing a huge amount of data, appropriate sampling techniques can first be applied to
facilitate interactive data exploration.
3) Incorporation of background knowledge: Background knowledge, or information regarding the
domain under study, may be used to guide the discovery process and allow discovered pattern to be
expressed in concise terms and at different levels of abstraction.
4) Data mining query languages and ad hoc data mining: A relational query language allows users
to pose ad hoc queries for data retrieval. In a similar vein, high level data mining query languages
need to be developed to allow users to describe ad hoc data mining tasks by facilitating the
specification of the relevant sets of data for analysis.
5) Presentation and visualization of data mining results: Discovered knowledge should be expressed
in high level languages, visual representations, or other expressive forms so that the knowledge can
be easily understood and directly usable by humans.
6) Handling noisy or incomplete data: The data stored in a database may reflect noise, exceptional
cases, or incomplete data objects. When mining data regularities, these objects may confuse the
process, causing the knowledge model constructed to overfit the data.
7) Pattern evaluation: The interestingness problem: a data mining system can uncover thousands of
patterns. Many of the patterns discovered may be uninteresting to the given user, representing
common knowledge or lacking novelty.
Performance issues: These include efficiency, scalability and parallelization of data mining algorithms:
1) Efficiency and scalability of data mining algorithms: To effectively extract information from a
huge amount of data in databases, data mining algorithms must be efficient and scalable.
2) Parallel, distributed and incremental mining algorithms: the huge size of many databases, the
wide distribution of data, and the computational complexity of some data mining methods are factors
motivating the development of parallel and distributed data mining algorithms.
Issues relating to the diversity of database types:
1) Handling of relational and complex types of data: Since relational databases and data warehouses
are widely used, the development of efficient and effective data mining systems for such data is
important. It may contain complex data objects, hypertext and multimedia data, spatial data,
temporal data or transaction data.
2) Mining information from heterogeneous databases and global information systems: Local and
wide are computer networks connect many sources of data, forming huge, distributed and
heterogeneous databases.
8
Nonvolatile:
A data warehouse is always a physically separate store of data transformed from the application data
found in the operational environment. How are
organizations using the information from data warehouses?
1) Increasing customer focus, which includes the analysis of customer buying patterns.
2) Repositioning products and managing product portfolios by comparing the performance of sales by
quarter, by year and by geographic regions.
3) Analyzing operations and looking for sources of profit
4) Managing the customer relationships, making environmental corrections.
Write the difference between operational databases systems and data warehouses (5 Marks)
Databases system: On line transaction processing (OLTP)
Data warehouse: On line analytical processing(OLAP)
time, item time, location item, location location, supplier 2-D Cuboids
time, supplier item, supplier
4-D(base) Cuboids
time, item, location, supplier
Stars, snowflakes and fact constellations: schemas for multidimensional databases:
The most popular data model for a data warehouse is a multidimensional model. Such a model can
exist in the form of a star schema, a snowflake schema, or a fact constellation schema.
Star Schema: The most common modeling paradigm is Star Schema, in which Data Warehouse contains,
1. a large central table (Fact table)
2. a set of smaller attendant tables (Dimension Table)
The advantage of a star schema is that it is easy to understand easy to define hierarchies, reduces
the number of physical joins, requires low maintenance and very simple metadata. 1:N relationship
between the Fact Table and dimension.
Disadvantage: Un-normalized dimension table tends to be large and may contain redundant
information.
time sales item
dimension table fact table dimension table
time_key
time_key time_key item_key
item_key
time_key
day
day item_key item_name
item_name
item_key
day_of_the_week
day_of_the_week branch_key brand
brand
branch_key
month
month location_key type
type
location_key
quarter
quarter dollars_sold supplier_type
supplier_type
dollars_sold
year
year units_sold
units_sold
location_key
location_key
branch_key
branch_key street
street
branch_name
branch_name city
city
branch_type
branch_type province_or_state
province_or_state
10 country
country
Snowflake Schema:
This is a variant of the Star Schema model, where some dimension tables are normalized, thereby
further splitting the data into additional tables. The resulting schema graph forms a shape similar to a
snowflake. It consists of a single fact table and multiple dimension tables. The snowflake structure may be
reducing and effectiveness of navigating across the tables due to larger number of join operations.
time sales item supplier
dimension table fact table dimension table dimension table
time_key
time_key time_key item_key
item_key supplier_key
time_key supplier_key
day
day item_key item_name
item_name supplier_type
item_key supplier_type
day_of_week
day_of_week branch_key brand
brand
branch_key
month
month location_key type
type
location_key
quarter
quarter dollars_sold supplier_key
supplier_key
dollars_sold
year
year units_sold
units_sold
branch_key
branch_key location_key
location_key city_key
city_key
branch_name
branch_name street
street city
city
branch location city
branch_type
branch_type city_key
city_key province_or_state
province_or_state
dimension table
country
country
Fact Constellation Schema:
Modern and Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars and so it is also called as galaxy schema or a fact
constellation.
time sales item shipping supplier
dimension table fact table dimension table fact table dimension table
time_key
time_key time_key item_key
item_key item_key
time_key item_key shipper_key
shipper_key
day
day item_key item_name
item_name time_key
item_key time_key shipper_name
shipper_name
day
day of
of week
week branch_key brand
brand shipper_key
branch_key shipper_key location_key
location_key
month
month location_key type
type from_location
location_key from_location shipper_type
shipper_type
quarter
quarter dollars_sold supplier_type
supplier_type to_location
dollars_sold to_location
of
year
year units_sold dollars_cost
units_sold dollars_cost
units_shipped
units_shipped
branch_key
branch_key location_key
location_key
branch
branch_name
branch_name street
street
dimension table branch location
branch_type
branch_type city
city
11
province_or_state
province_or_state
country
country
dimension table
A data mart is a department subset of the data warehouse that focuses on selected subjects, and thus
its scope is department wide. For data marts, the star or snowflake schema are commonly used since both are
geared towards modeling single subjects, although the star schema is more popular and efficient.
Examples for defining star, snowflake and fact constellation schemas
Data warehouse and data marts can be defined using two language primitives, one for cube definition
and one for dimension definition. The cube definition statement has the following syntax:
define cube(cube_name)[(dimension_list)];(measure_list)
The dimension definition statement has the following syntax.
define dimension(dimension_name) as ((attribute or sub dimension list))
Star schema:
define cube sales_star[time,item,branch,location]:dollars_sold=sum(sales_in_dollars),unit_sold=count(*)
define dimension time as (time_key, day, day _or_week, month, quarter, year)
define dimension item as (item_key,item_name,brand, type,supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key,street, city,province_or_state,country)
Snowflake schema:
Define cube sales_snowflake[time,item,branch,location]:dollars_sold
=sum(sales_in_dollars),unit_sold=count(*)
define dimension time as (time_key, day, day _or_week, month, quarter, year)
define dimension item as (item_key,item_name,brand, type)supplier(supplier_key,supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key,street) city(city_key, city, province_or_state, country))
Fact constellation schema:
define cube sales[time,item,branch,location]:dollars_sold=sum(sales_in_dollars),unit_sold=count(*)
define dimension time as (time_key, day, day _or_week, month, quarter, year)
define dimension item as (item_key,item_name,brand, type,supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key,street, city,province_or_state,country)
define cube shippling[time, item, shipper, from_location, to_location]:
dollars_cost=sum(cost_in_dollars), units_shipped=count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as(shipper_key,shipper_name, location as location in cube sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
A define cube statement is used to define data cubes for sales and shipping corresponding to the two
fact tables of the schema.
Measures: their categorization and computation:
A data cube measure is a numerical function that can be evaluated at each point in the data cube
space. A measure value is computed for a given point by aggregating the data corresponding to the
respective dimension-value pairs defining the given point.
Measures can be organized into three categories, based on the kind of aggregate functions used.
Distributive:
An aggregate function is distributive if it can be computed in a distributed manner. Count(),
sum(), min() and max() are distributive aggregate functions.
Algebraic:
An aggregate function is algebraic if it can be computed by an algebraic function with M arguments
each of which is obtained by applying a distributive aggregate function. min_N(), max_N() and standard
deviation. A measure is algebraic if it is obtained by applying an algebraic aggregate function.
Holistic:
12
An aggregate function is holistic if there is no constant bound on the storage size needed to describe a sub
aggregate: rank(), median(), mode(). A measure is holistic if it is obtained by applying a holistic aggregate
function.
Drill-Within: It is switching from one classification to a different one within the same dimension.
Drill-Across: It means switching from a classification in one dimension to a different classification in a
different dimension.
Pivot (Rotate): Pivot (also called “rotate”) is a visualization operation which rotates the data axes in order
to provide an alternative presentation of the same data. Other examples are rotating the axed in a 3-D cube,
or transforming a 3-D cube into a series of 2-D planes.
Other OLAP Operations:Some other additional operations are such as Drill-Across, Drill-Through, etc.
It may include ranking the top N or bottom N items in lists, as well as computing moving averages, growth
rates, etc..
Explain the Architecture of Data Warehouse. (5 Marks)
13
(or) Write short notes of Data Warehouse Architecture with a neat diagram.
The Data Warehouse Architecture deals with 2 phases such as,
Extract
Clean
Transfor
m
Load Data
Refresh
14
Operational Databases External Sources
Write about the steps used to design and construct the data warehouses. (OR) (5 Marks)
Discuss about the basic steps involved in the designing process of data warehouse.
What does the data warehouse provide for business analysts?
Having a data warehouse may provide a competitive advantage by presenting relevant information
from which to measure performance and make critical adjustments in order to help win over
competitors.
A data warehouse can enhance business productivity since it is able to quickly and efficiently gather
information.
A data warehouse facilitates customer relationship management since it provides a consistent view of
customers and items across all lines of business.
It may bring about cost reduction by tracking trends, patterns and exceptions over long periods of
time in a consistent and reliable manner.
Four different views regarding the design of a data warehouse must be considered: the top down view,
the data source view, the data warehouse view and the business query view.
Top down view: It allows the selection of the relevant information necessary for the data warehouse.
Data source view: It exposes the information being captured, stored and managed by operational
systems.
Data warehouse view: It includes fact tables and dimension tables. It represents the information that is
stored inside the data warehouse, including precalculated totals and counts, as well as information regarding
the source, date and time of origin.
Business Query view: It is the perspective of data in the data warehouse from the viewpoint of the end
user.
The process of data warehouse design
A data warehouse can be built using a top down approach, a bottom up approach or a combination of
both.
Top down approach: It starts with overall design and planning.
Bottom up approach: It starts with experiments and prototypes.
Combined approach: An organization can exploit the planned and strategic nature of the top down
approach while retaining the rapid implementation and opportunistic application of the bottom up approach.
Software engineering point of view, the design and construction consist of following steps:
Planning
Requirements Study
Problem analysis
Warehouse design
Data integration and testing
Deployment of the data warehouse
Large software systems can be developed using two methodologies:
1. Waterfall method 2. Spiral method
Waterfall method: A structured and systematic analysis at each step before proceeding to the next, which is
like a waterfall, falling from one step to the next.
Spiral method: The rapid generation of increasingly functional systems, with short intervals between
successive releases.
In general the warehouse design process consists of the following steps:
Choose a business process to model :For example ,orders,invoices,shipments,inventory,account
administration,sales.If the business process is organizational and involves multiple complex oject
collection a datawarehouse warehouse model should be followed.
Choose the grain of the business process :The grain is the fundamental ,the atomic level of data
15
To be represented in the fact table for this process,for example,individual transaction,individual daily
snapshots and so on.
Choose the dimensions that will apply to each fact table record:Choose the dimention that will
apply to each fact table record.Typical dimension are time,item,customer ,supplier,warehouse.
Choose the measures that will populate each fact table record:Typical measure are numeric
additive quantities like dollars-sold , units sold.
Data warehouse development tools provide functions to define and edit metadata repository contents,
Answer queries, output reports and ship metadata to and from relational database system catalogues.
Planning and analysis tools study the impact of schema changes and of refresh performance when
changing refresh rates or time windows.
Three-Tier Data Warehouse Architecture
In general, the Warehouse design process consists of the following steps:
1. Choose a business process to model.
2. Choose the grain of the business process.
3. Choose the dimensions that will apply to each fact table record.
4. Choose the measures that will populate each fact table record.
Three-Tier Data Warehouse Architecture: The Three-Tier consists of
1. Bottom Tier - Warehouse Database Server
2. Middle Tier - OLAP Server
3. Top Tier - Client
Bottom Tier: It is a Warehouse Database Server that is always a relational database system. Data from
operational databases and external sources are extracted using application program interfaces known as
gateways (JDBC).
Middle Tier: It is OLAP Server that is implemented using either
(1) Relational OLAP (ROLAP) model – extended RDBMS that maps multidimensional data to
standard relational operations
(2) Multidimensional OLAP (MOLAP) model – a special purpose server that directly implement
multidimensional data operations.
TopTier:It is a Client which contains query and reporting tools is always a relational database system.
By the Architecture point of view, there are three Data Warehouse models. They are:
Enterprise Warehouse:
It collects all of the information about subjects of the entire organisation.
Data Mart:
It contains a subset of corporate-wide data that has the value of the specific group pf users.
Virtual Warehouse:
It is a set of views over operational databases.
Data warehouse Multidimensional view and Implementation
What is a concept hierarchy? (Or) Define concept hierarchy. (5 Marks)
. ALL
CANADA USA
16
VAN VICTORIA
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level,
more general concepts.Many concept hierarchies are implicit within the database schema. For example,
suppose that the dimension location is described by the attributes number, street, city, province_or_state,
zipcode and country. These attributes are related by a total order, forming a concept hierarchy such as
“street<city<province_or_state<country”.
The attributes of a dimension may be organized in a partial order, forming a lattice. A concept
hierarchy that is total or partial order among attributes in a database schema is called a schema hierarchy.
Concept hierarchies may also be defined by discrediting or grouping values for a given dimension or
attribute, resulting in a set_grouping hierarchy.
Country
Province_or_state quarter
city
month Week
street
17
2-D cuboids
(item,year)
(city,item) (city, year)
(city,item,year) 3-D (base) cuboids
An SQL query containing no group-by such as _ compute the sum of total sales”, is a zero-
dimensional operation. An SQL query containing one group-by such as “is a one-dimensional operation.
A cube operator on n dimensions is equivalent to a collection of group by statements, one for each
subset of the n dimensions. Therefore, the cube operator is the n-dimensional generalization of the group by
operator.
Define cube sales [item, city, year]: sum (sales_in_dollars)
For a cube with n dimensions, there are a total of 2 n cuboids, including the base cuboid. The
statement.
compute cube sales
explicitly instructs the system the system to compute the sales aggregate cuboids for all of the eight
subsets of the set {item, city, year}, including the empty subset.
Partial materialization: selected computation of cuboids:
Three choices for data cube materialization:
1) Do not precompute any of the “nonbase” cuboids (no materialization)
2) Precompute all of the cuboids (full materialization)
3) Selectively compte a proper subset of the whole set of possible cuboids (partial
materialization).
The first choice leads to computing expensive multidimensional aggregates on the fly which could be
slow. The second choice may require huge amounts of memory space in order to store all of the
precomputed cuboids. The third choice presents an interesting trade off between storage space and response
time.
19
An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server
performs on-line analytical processing. Where the OLAM and OLAP servers both accept user on-line
queries via an graphical user interface API work with the data cube in the data analysis via a cube APO. A
metadata directory is used to guide the access of the data cube.
The data cube can be constructed by accessing and integrating multiple databases via an MDDB API
and by filtering a data warehouse via a database API that may support OLEDB and ODBC connections.
Since an OLAM server may perform multiple data mining tasks, such as concept description,
association, classification, prediction, clustering, time series analysis and so on, it usually consists of
multiple integrated data mining modules and is more sophisticated than an OLAP server.
CUBE API
Layer 2: multidimensional Database
Database API
Data filtering Filtering Layer 1 (data repository)
Integration
Data cleaning
Databases Databases Data
warehouse Data integration
20