Data Warehousing
Data Warehousing
Data Warehousing
3 Meta data
It is data about data. It is used for maintaining, managing and using the data warehouse. It is
classified into two:
Technical Meta data: It contains information about data warehouse data used by warehouse
designer, administrator to carry out development and management tasks. It includes,
Info about data stores
Transformation descriptions. That is mapping methods from operational db to warehouse
db
Warehouse Object and data structure definitions for target data
The rules used to perform clean up, and data enhancement
Data mapping operations
Access authorization, backup history, archive history, info delivery history, data
acquisition history, data access etc.,
Business Meta data: It contains info that gives info stored in data warehouse to users. It
includes,
Subject areas, and info object type including queries, reports, images, video, audio clips
etc.
Internet home pages
Info related to info delivery system
Data warehouse operational info such as ownerships, audit trails etc.,
4 Access tools
Its purpose is to provide info to business users for decision making. There are five categories:
Data query and reporting tools
Application development tools
Executive info system tools (EIS)
OLAP tools
Data mining tools
5 Data marts
Departmental subsets that focus on selected subjects. They are independent used by dedicated
user group. They are used for rapid delivery of enhanced decision support functionality to end
users. Data mart is used in the following situation:
Extremely urgent user requirement
The absence of a budget for a full scale data warehouse strategy
The decentralization of business needs
The attraction of easy to use tools and mind sized project
Data mart presents two problems:
1. Scalability: A small data mart can grow quickly in multi dimensions. So that while
designing it, the organization has to pay more attention on system scalability, consistency
and manageability issues
2. Data integration
6 Data warehouse admin and management
The management of data warehouse includes,
Security and priority management
Monitoring updates from multiple sources
Data quality checks
Managing and updating meta data
Auditing and reporting data warehouse usage and status
Purging data
Replicating, sub setting and distributing data
Backup and recovery
Data warehouse storage management which includes capacity planning, hierarchical
storage management and purging of aged data etc.,
7 Information delivery system
• It is used to enable the process of subscribing for data warehouse info.
• Delivery to one or more destinations according to specified scheduling algorithm
Shared nothing systems are concerned with access to disks, not access to memory.
Nonetheless, adding more PUs and disks can improve scaleup. Oracle Parallel Server can access
the disks on a shared nothing system as long as the operating system provides transparent disk
access, but this access is expensive in terms of latency.
Shared nothing systems have advantages and disadvantages for parallel processing:
Advantages
Shared nothing systems provide for incremental growth.
System growth is practically unlimited.
MPPs are good for read-only databases and decision support applications.
Failure is local: if one node fails, the others stay up.
Disadvantages
More coordination is required.
More overhead is required for a process working on a disk belonging to another node.
If there is a heavy workload of updates or inserts, as in an online transaction processing
system, it may be worthwhile to consider data-dependent routing to alleviate contention.
Considering Relational context, there are three basic schemas that are used in dimensional
modeling:
1. Star schema
2. Snowflake schema
3. Fact constellation schema
Star schema
The multidimensional view of data that is expressed using relational data base semantics is
provided by the data base schema design called star schema. The basic of stat schema is that
information can be classified into two groups:
Facts
Dimension
Star schema has one large central table (fact table) and a set of smaller tables (dimensions)
arranged in a radial pattern around the central table.
Facts are core data element being analyzed while dimensions are attributes about the facts.
The determination of which schema model should be used for a data warehouse should be based
upon the analysis of project requirements, accessible tools and project team preferences.
What is star schema? The star schema architecture is the simplest data warehouse schema. It is
called a star schema because the diagram resembles a star, with points radiating from a center. The
center of the star consists of fact table and the points of the star are the dimension tables. Usually
the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-
normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly
used nowadays and is recommended by Oracle.
Fact Tables:
A fact table is a table that contains summarized numerical and historical data (facts) and a multipart
index composed of foreign keys from the primary keys of related dimension tables.
A fact table typically has two types of columns: foreign keys to dimension tables and measures
those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
Dimension Tables:
Dimensions are categories by which summarized data can be viewed. E.g. a profit summary in a
fact table can be viewed by a Time dimension (profit by month, quarter, year), Region dimension
(profit by country, state, city), Product dimension (profit for product1, product2).
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a
dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys
of each of the dimension tables are part of the composite primary key of the fact table. Dimensional
attributes help to describe the dimensional value. They are normally descriptive, textual values.
Dimension tables are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about geographic region
(markets, cities), clients, products, times, channels.
Measures:
a. Measures are numeric data based on columns in a fact table. They are the primary data which
end users are interested in. E.g. a sales fact table may contain a profit measure which
represents profit on each sale.
b. Aggregations are pre calculated numeric data. By calculating and storing the answers to a
query before users ask for it, the query processing time can be reduced. This is key in
providing fast query performance in OLAP.
c. Cubes are data processing units composed of fact tables and dimensions from the data
warehouse. They provide multidimensional views of data, querying and analytical
capabilities to clients.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely
supported by a large number of business intelligence tools
Snowflake schema: is the result of decomposing one or more of the dimensions. The many-to-
one relationships among sets of attributes of a dimension can separate new dimension tables,
forming a hierarchy. The decomposed snowflake structure visualizes the hierarchical structure of
dimensions very well.
Fact constellation schema: For each star schema it is possible to construct fact constellation
schema (for example by splitting the original star schema into more star schemes each of them
describes facts on another level of dimension hierarchies). The fact constellation architecture
contains multiple fact tables that share many dimension tables.
The main shortcoming of the fact constellation schema is a more complicated design because many
variants for particular kinds of aggregation must be considered and selected. Moreover, dimension
tables are still large.
Multi relational Database:
The relational implementation of multidimensional data base systems are referred to as multi
relational database systems
Data Cleaning
➢ Remove noise
➢ Correct inconsistencies.
➢ By filling missing values, smoothing noisy data
➢ Identify outliers
➢ Resolve inconsistencies
Metadata:
• Location & Description of data warehouse system and data components.
• Names, definitions, structure & content of data warehouse.
• Identification of authorized sources.
• Subscription information
• Data warehouse operational information
• Security authorization & access control list
• Integration & transformation rules.
• Mapping methods
– Meta data used for maintaining, building, managing and using data warehouse.
– Helps the user to understand the data
Meta Data Interchange Initiative:
• To develop the standard specification for meta data inter change format and support its
mechanism called meta data interchange initiative.
Goal of a Charter:
• Creating a vendor independent, industry defined and maintained standard access
mechanism.
• Enabling the users to control and manage the access manipulate the meta data by using
interchange complaint tools.
• Allow the users to build the tools to meet their needs
• Enabling the tools to satisfy their meta data access requirements freely.
• Define clear and simple infrastructure.
• Minimize the amount of modification.
• Creating a procedure for establishing, maintaining, extending, updating.
Core set components:
• Represent the minimum common denominator of the meta data element.
• Represent minimum point of integration.
• Approve optional and extension components.
Defines 2 distinct meta models.
• Application meta model.
– Load the meta data for particular application.
• Meta data meta model.
– Set of objects to describe meta data.
Meta Data Interchange Standard Framework:
• Approaches:
– Procedural approach.
– ASCII batch approach
– Hybrid approach.
Meta Data Interchange Standard Framework:
2. R & O Repository
a. Client/server repository
b. Supports many platforms.
c. Performance: sub second response time.
d. Scalability → Laptop to mainframe
e. Capacity: support very large repository.
f. Drawback:
g. Meta data storage.
3. Prism
a. Prism’s directory manager is meta data.
i. Import business models
ii. Import meta data from ware house manager.
iii. Export meta data into catalogs.
iv. Create flexible customized views.
b. Directory manager has three components:
i. Information builder → all entries
ii. Directory builder → meta data relationships
iii. Directory navigator → provides extensive view & gives better understanding.
Prism directory manager answers question like:
• What data exists
• Where to find data
• What are the original sources of data.
Advantages of single integration point
• Simplifies the population of data
• Accuracy
• Navigation and access is so easiest task.
4. Logic Works Universal Directory
• Meta data repository tool
• Acts as a hub
• Centralizing meta data
– Improves the way of managing.
• Activities:
– Inventorying source data
– Designing the data warehouse
– Mapping source to target data
– Populating the ware house
– Analyzing the data in the data warehouse
– Evolving the data warehouse content.
Universal directory components:
• Universal explorer
– Provides the users with a set of tools (browsing)
– Act as a search engine
– Allow the user to visualize the data warehouse using STAR schema.
• Directory administrator Maintain security
UNIT II
BUSINESS ANALYSIS
The data warehouse is accessed using an end-user query and reporting tool from Business
Objects. Business Objects provides several tools to securely access the data warehouse or personal
data files with a point-and-click interface including the following:
BusinessObjects (Reporter and Explorer) ? a Microsoft Windows based query and reporting tool.
InfoView - a web based tool, that allows reports to be refreshed on demand (but can not create new
reports).
InfoBurst - a web based server tool, that allows reports to be refreshed, scheduled and distributed.
It can be used to distribute reports and data to users or servers in various formats (e.g. Text, Excel,
PDF, HTML, etc.). For more information, see the documentation below:
o InfoBurst Usage Notes (PDF)
o InfoBurst User Guide (PDF)
Data Warehouse List Upload - a web based tool, that allows lists of data to be uploaded into the
data warehouse for use as input to queries. For more information, see the documentation below:
o Data Warehouse List Upload Instructions (PDF)
WSU has negotiated a contract with Business Objects for purchasing these tools at a
discount. View BusObj Rates.
Selecting your Query Tool:
a. The query tools discussed in the next several slides represent the most commonly used
query tools at Penn State.
b. A Data Warehouse user is free to select any query tool, and is not limited to the ones
mentioned.
c. What is a ―Query Tool‖?
d. A query tool is a software package that allows you to connect to the data warehouse from
your PC and develop queries and reports.
There are many query tools to choose from. Below is a listing of what is currently being
used on the PC:
1. Microsoft Access
2. Microsoft Excel
3. Cognos Impromptu
Data Warehousing Tools and Technologies
a. Building a data warehouse is a complex task because there is no vendor that provides
an ‗end-to-end‘ set of tools.
b. Necessitates that a data warehouse is built using multiple products from different
vendors.
c. Ensuring that these products work well together and are fully integrated is a major
challenge.
Cognos impromptu
Query Tabs:
Data
✓ Identify what to query
✓ Click and drag
Sort
✓ Hierarchy presentation
✓ Ascending or descending order
Group
✓ Summarized data by group order
Filter
✓ Defines criteria
✓ Specifies query range
Administrative
✓ Access
✓ Profile
✓ Client/Server
Generating reports:
Edit features on the toolbar allowed changes to report data after the query has been
completed
➢ Modify existing data
➢ Format numerical and date fields
➢ Perform calculations
➢ Group data
➢ Sort columns.
Catalog Query:
Cognous Impromptu
What is impromptu?
Impromptu is an interactive database reporting tool. It allows Power Users to query data without
programming knowledge. When using the Impromptu tool, no data is written or changed in the
database. It is only capable of reading the data.
The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP.
Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during
interactive sessions, not in batch jobs that run overnight. And because OLAP is also analytic, the
queries are complex. The multidimensional data model is designed to solve complex queries in
real time.
Multidimensional data model is to view it as a cube. The cable at the left contains detailed sales
data by product, market and time. The cube on the right associates sales number (unit sold) with
dimensions-product type, market and time with the unit variables organized as cell in an array.
This cube can be expended to include another array-price-which can be associates with all or only
some dimensions. As number of dimensions increases number of cubes cell increase
exponentially.
Dimensions are hierarchical in nature i.e. time dimension may contain hierarchies for years,
quarters, months, weak and day. GEOGRAPHY may contain country, state, city etc.
In this cube we can observe, that each side of the cube represents one of the elements of the
question. The x-axis represents the time, the y-axis represents the products and the z-axis
represents different centers. The cells of in the cube represents the number of product sold or can
represent the price of the items.
This Figure also gives a different understanding to the drilling down operations. The relations
defined must not be directly related, they related directly.
The size of the dimension increase, the size of the cube will also increase exponentially. The time
response of the cube depends on the size of the cube.
• Aggregation (roll-up)
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total sales by city and year -> total
sales by region and by year
• Selection (slice) defines a subcube
– e.g., sales where city = Palo Alto and date = 1/15/96
• Navigation to detailed data (drill-down)
– e.g., (sales - expense) by city, top 3% of cities by average income
• Visualization Operations (e.g., Pivot or dice)
OLAP
OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension
tables) to enable multidimensional viewing, analysis and querying of large amounts of data. E.g.
OLAP technology could provide management with fast answers to complex queries on their
operational data or enable them to analyze their company's historical data for trends and patterns.
Online Analytical Processing (OLAP) applications and tools are those that are designed to ask
―complex queries of large multidimensional collections of data.‖ Due to that OLAP is
accompanied with data warehousing.
Need
The key driver of OLAP is the multidimensional nature of the business problem. These problems
are characterized by retrieving a very large number of records that can reach gigabytes and
terabytes and summarizing this data into a form information that can by used by business analysts.
One of the limitations that SQL has, it cannot represent these complex problems. A query will be
translated in to several SQL statements. These SQL statements will involve multiple joins,
intermediate tables, sorting, aggregations and a huge temporary memory to store these tables.
These procedures required a lot of computation which will require a long time in computing. The
second limitation of SQL is its inability to use mathematical models in these SQL statements. If
an analyst, could create these complex statements using SQL statements, still there will be a large
number of computation and huge memory needed. Therefore the use of OLAP is preferable to
solve this kind of problem.
This methodology relies on manipulating the data stored in the relational database to give the
appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing
and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Data stored in
relational tables
Advantages:
Can handle large amounts of data: The data size limitation of ROLAP technology is the
limitation on data size of the underlying relational database. In other words, ROLAP itself
places no limitation on data amount.
Can leverage functionalities inherent in the relational database: Often, relational database
already comes with a host of functionalities. ROLAP technologies, since they sit on top of
the relational database, can therefore leverage these functionalities.
Disadvantages:
Performance can be slow: Because each ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database, the query time can be long if the
underlying data size is large.
Limited by SQL functionalities: Because ROLAP technology mainly relies on generating
SQL statements to query the relational database, and SQL statements do not fit all needs
(for example, it is difficult to perform complex calculations using SQL), ROLAP
technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have
mitigated this risk by building into the tool out-of-the-box complex functions as well as the
ability to allow users to define their own functions.
Examples: PowerPlay (Cognos), Brio, Microsoft Analysis Services, Oracle Advanced Analytic
Services.
OLAP Guidelines:
Dr. E.F. Codd the ―father‖ of the relational model, created a list of rules to deal with the OLAP
systems. Users should priorities these rules according to their needs to match their business
requirements (reference 3). These rules are:
1) Multidimensional conceptual view: The OLAP should provide an appropriate
multidimensional Business model that suits the Business problems and Requirements.
2) Transparency: The OLAP tool should provide transparency to the input data for the users.
3) Accessibility: The OLAP tool should only access the data required only to the analysis
needed.
4) Consistent reporting performance: The Size of the database should not affect in any way
the performance.
5) Client/server architecture: The OLAP tool should use the client server architecture to
ensure better performance and flexibility.
6) Generic dimensionality: Data entered should be equivalent to the structure and operation
requirements.
7) Dynamic sparse matrix handling: The OLAP too should be able to manage the sparse
matrix and so maintain the level of performance.
8) Multi-user support: The OLAP should allow several users working concurrently to work
together.
9) Unrestricted cross-dimensional operations: The OLAP tool should be able to perform
operations across the dimensions of the cube.
10) Intuitive data manipulation. ―Consolidation path re-orientation, drilling down across
columns or rows, zooming out, and other manipulation inherent in the consolidation path
outlines should be accomplished via direct action upon the cells of the analytical model,
and should neither require the use of a menu nor multiple trips across the user
interface.‖(Reference 4)
11) Flexible reporting: It is the ability of the tool to present the rows and column in a manner
suitable to be analyzed.
12) Unlimited dimensions and aggregation levels: This depends on the kind of Business, where
multiple dimensions and defining hierarchies can be made.
In addition to these guidelines an OLAP system should also support:
Comprehensive database management tools: This gives the database management to
control distributed Businesses
The ability to drill down to detail source record level: Which requires that The OLAP
tool should allow smooth transitions in the multidimensional database.
Incremental database refresh: The OLAP tool should provide partial refresh.
Structured Query Language (SQL interface): the OLAP system should be able to
integrate effectively in the surrounding enterprise environment.
OLTP vs OLAP
OLTP stands for On Line Transaction Processing and is a data modeling approach typically
used to facilitate and manage usual business applications. Most of applications you see and use are
OLTP based. OLTP technology used to perform updates on operational or transactional systems
(e.g., point of sale systems)
OLAP stands for On Line Analytic Processing and is an approach to answer multi-
dimensional queries. OLAP was conceived for Management Information Systems and Decision
Support Systems. OLAP technology used to perform complex analysis of the data in a data
warehouse.
The following table summarizes the major differences between OLTP and OLAP system
design.
Notes:
Representation of Multi-Dimensional Data:
➢ OLAP database servers use multi-dimensional structures to store data and relationships between data.
➢ Multi-dimensional structures are best-visualized as cubes of data, and cubes within cubes of data. Each
side of a cube is a dimension.
Assistant Professor,
Depa rtmen t of C ompu ter Sci e nc e a
ndEngi neering,
➢ Multi-dimensional databases are a compact and easy-to-unde rst an d way of vi sua lizng and
Dr. Pauls Engineering College,
manipulating data elements that have many inter-relationships.
NH -66, Pu lichapa l lam Village, V anur Taluk,
➢ The cube can be expanded to include another dimensi on, for e xa mple , the number of sa les st af in
Villupuram District – 605109
each city.
E-mail Address: jsudha16@yahoo.com
➢ The response time of a multi-dimensional query depends on how many cells have to be added on-
the-fly.
➢ As the number of dimensions increases, the number of cube‘s cells increases exponentially.
Multi-dimensional OLAP supports common analytical operations, such as:
o Consolidation: involves the aggregation of data such as ‗roll-ups‘ or complex expressions
involving interrelated data. Foe example, branch offices can be rolled up to cities and
rolled up to countries.
o Drill-Down: is the reverse of consolidation and involves displaying the detailed data that
comprises the consolidated data.
o Slicing and dicing: refers to the ability to look at the data from different viewpoints.
Slicing and dicing is often performed along a time axis in order to analyze trends and find
patterns.
Relational OLAP (ROLAP)
➢ ROLAP is the fastest-growing type of OLAP tools.
➢ ROLAP supports RDBMS products through the use of a metadata layer, thus avoiding the
requirement to create a static multi-dimensional data structure.
➢ This facilitates the creation of multiple multi-dimensional views of the two-dimensional relation.
➢ To improve performance, some ROLAP products have enhanced SQL engines to support the
complexity of multi-dimensional analysis, while others recommend, or require, the use of highly
denormalized database designs such as the star schema.
➢ The development issues associated with ROLAP technology:
◆ Performance problems associated with the processing of complex queries that require
multiple passes through the relational data.
◆ Development of middleware to facilitate the development of multi-dimensional
applications.
◆ Development of an option to create persistent multi-dimensional structures, together with
facilities o assist in the administration of these structures.
Describes The Relation of the MOLAP with the server and end user.
ROLAP
OLAP Tools and the internet:
The mainly comprehensive premises in computing have been the internet and data warehousing thus
the integration of these two giant technologies is a necessity. The advantages of using the Web for access
are inevitable.(Reference 3) These advantages are:
The internet provides connectivity between countries acting as a free resource.
The web eases administrative tasks of managing scattered locations.
The Web allows users to store and manage data and applications on servers that can be managed,
maintained and updated centrally.
These reasons indicate the importance of the Web in data storage and manipulation. The Web-enabled
data access has many significant features, such as:
The first
The second
The emerging third
HTML publishing
Helper applications
Plug-ins
Server-centric components
Java and active-x applications
DATA MINING
.
What motivated data mining? Why is it important?
The major reason that data mining has attracted a great deal of attention in information industry in recent
years is due to the wide availability of huge amounts of data and the imminent need for turning such data
into useful information and knowledge. The information and knowledge gained can be used for applications
ranging from business management, production control, and market analysis, to engineering design and
science exploration.
Data mining refers to extracting or mining" knowledge from large amounts of data. There are many other
terms related to data mining, such as knowledge mining, knowledge extraction, data/pattern analysis, data
archaeology, and data dredging. Many people treat data mining as a synonym for another popularly used
term, Knowledge Discovery in Databases", or KDD
Knowledge discovery as a process is depicted in following figure and consists of an iterative sequence of
the following steps:
• data cleaning: to remove noise or irrelevant data
• data integration: where multiple data sources may be combined
• data selection: where data relevant to the analysis task are retrieved from the database
• data transformation: where data are transformed or consolidated into forms appropriate for mining
by performing summary or aggregation operations
• data mining :an essential process where intelligent methods are applied in order to extract data
patterns
• Pattern evaluation to identify the truly interesting patterns representing knowledge based on some
interestingness measures knowledge presentation: where visualization and knowledge
representation techniques are used to present the mined knowledge to the user.
Data mining is the process of discovering interesting knowledge from large amounts of data stored either
in databases, data warehouses, or other information repositories. Based on this view, the architecture of a
typical data mining system may have the following major components:
1. A database, data warehouse, or other information repository, which consists of the set of
databases, data warehouses, spreadsheets, or other kinds of information repositories containing
the student and course information.
2. A database or data warehouse server which fetches the relevant data based on users‟ data
mining requests.
3. A knowledge base that contains the domain knowledge used to guide the search or to evaluate
the interestingness of resulting patterns. For example, the knowledge base may contain
metadata which describes data from multiple heterogeneous sources.
4. A data mining engine, which consists of a set of functional modules for tasks such as
classification, association, classification, cluster analysis, and evolution and deviation analysis.
5. A pattern evaluation module that works in tandem with the data mining modules by employing
interestingness measures to help focus the search towards interestingness patterns.
6. A graphical user interface that allows the user an interactive approach to the data mining
system.
How is a data warehouse different from a database? How are they similar?
• Differences between a data warehouse and a database: A data warehouse is a repository of information
collected from multiple sources, over a history of time, stored under a unified schema, and used for data
analysis and decision support; whereas a database, is a collection of interrelated data that represents the
current status of the stored data. There could be multiple heterogeneous databases where the schema of one
database may not agree with the schema of another. A database system supports ad-hoc query and on- line
transaction processing. For more details, please refer to the section “Differences between operational
database systems and data warehouses.”
• Similarities between a data warehouse and a database: Both are repositories of information, storing huge
amounts of persistent data.
Describe the following advanced database systems and applications: object-relational databases,
spatial databases, text databases, multimedia databases, the World Wide Web.
In principle, data mining should be applicable to any kind of information repository. This includes
relational databases, data warehouses, transactional databases, advanced database systems,
flat files, and the World-Wide Web. Advanced database systems include object-oriented and object-
relational databases, and special c application-oriented databases, such as spatial databases, time-series
databases, text databases, and multimedia databases.
Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the
research level. Flat files are simple data files in text or binary format with a structure known by the data
mining algorithm to be applied. The data in these files can be transactions, time-series data, scientific
measurements, etc.
Relational Databases: a relational database consists of a set of tables containing either values of entity
attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns
represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object
or a relationship between objects and is identified by a set of attribute values representing a unique key. In
following figure it presents some relations Customer, Items, and Borrow representing business activity in
a video store. These relations are just a subset of what could be a database for the video store and is given
as an example.
The most commonly used query language for relational database is SQL, which allows retrieval and
manipulation of the data stored in the tables, as well as the calculation of aggregate functions such as
average, sum, min, max and count. For instance, an SQL query to select the videos grouped by category
would be:
Data mining algorithms using relational databases can be more versatile than data mining algorithms
specifically written for flat files, since they can take advantage of the structure inherent to relational
databases. While data mining can benefit from SQL for data selection, transformation and consolidation, it
goes beyond what SQL could provide, such as predicting, comparing, detecting deviations, etc.
Data warehouses
A data warehouse is a repository of information collected from multiple sources, stored under a unified
schema, and which usually resides at a single site. Data warehouses are constructed via a process of data
cleansing, data transformation, data integration, data loading, and periodic data refreshing. The figure shows
the basic architecture of a data warehouse
In order to facilitate decision making, the data in a data warehouse are organized around major subjects,
such as customer, item, supplier, and activity. The data are stored to provide information from a historical
perspective and are typically summarized.
A data warehouse is usually modeled by a multidimensional database structure, where each dimension
corresponds to an attribute or a set of attributes in the schema, and each cell stores the value of some
aggregate measure, such as count or sales amount. The actual physical structure of a data warehouse may
be a relational data store or a multidimensional data cube. It provides a multidimensional view of data and
allows the pre computation and fast accessing of summarized data.
The data cube structure that stores the primitive or lowest level of information is called a base cuboid. Its
corresponding higher level multidimensional (cube) structures are called (non-base) cuboids. A base cuboid
together with all of its corresponding higher level cuboids form a data cube. By providing multidimensional
data views and the precomputation of summarized data, data warehouse systems are well suited for On-
Line Analytical Processing, or OLAP. OLAP operations make use of background knowledge regarding the
domain of the data being studied in order to allow the presentation of data at different levels of abstraction.
Such operations accommodate different user viewpoints. Examples of OLAP operations include drill-down
and roll-up, which allow the user to view the data at differing degrees of summarization, as illustrated in
above figure.
Transactional databases
In general, a transactional database consists of a flat file where each record represents a transaction. A
transaction typically includes a unique transaction identity number (trans ID), and a list of the items making
up the transaction (such as items purchased in a store) as shown below:
• A spatial database contains spatial-related data, which may be represented in the form of raster or vector
data. Raster data consists of n-dimensional bit maps or pixel maps, and vector data are represented by lines,
points, polygons or other kinds of processed primitives, Some examples of spatial databases include
geographical (map) databases, VLSI chip designs, and medical and satellite images databases.
• Time-Series Databases: Time-series databases contain time related data such stock market data or
logged activities. These databases usually have a continuous flow of new data coming in, which sometimes
causes the need for a challenging real time analysis. Data mining in such databases commonly includes the
study of trends and correlations between evolutions of different variables, as well as the prediction of trends
and movements of the variables in time.
• A text database is a database that contains text documents or other word descriptions in the form of long
sentences or paragraphs, such as product specifications, error or bug reports, warning messages, summary
reports, notes, or other documents.
• A multimedia database stores images, audio, and video data, and is used in applications such as picture
content-based retrieval, voice-mail systems, video-on-demand systems, the World Wide Web, and speech-
based user interfaces.
• The World-Wide Web provides rich, world-wide, on-line information services, where data objects are
linked together to facilitate interactive access. Some examples of distributed information services associated
with the World-Wide Web include America Online, Yahoo!, AltaVista, and Prodigy.
Data mining functionalities/Data mining tasks: what kinds of patterns can be mined?
Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. In
general, data mining tasks can be classified into two categories:
• Descriptive
• predictive
Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining
tasks perform inference on the current data in order to make predictions.
Describe data mining functionalities, and the kinds of patterns they can discover
(or)
Define each of the following data mining functionalities: characterization, discrimination, association and
correlation analysis, classification, prediction, clustering, and evolution analysis. Give examples of each
data mining functionality, using a real-life database that you are familiar with.
Data can be associated with classes or concepts. It describes a given set of data in a concise and summarative
manner, presenting interesting general properties of the data. These descriptions can be derived via
1. data characterization, by summarizing the data of the class under study (often called the
target class)
2. data discrimination, by comparison of the target class with one or a set of comparative
classes
3. both data characterization and discrimination
Data characterization
Example:
A data mining system should be able to produce a description summarizing the characteristics of a student
who has obtained more than 75% in every semester; the result could be a general profile of the student.
Data Discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
Example
The general features of students with high GPA‟s may be compared with the general features of students
with low GPA‟s. The resulting description could be a general comparative profile of the students such as
75% of the students with high GPA‟s are fourth-year computing science students while 65% of the students
with low GPA‟s are not.
The output of data characterization can be presented in various forms. Examples include pie charts, bar
charts, curves, multidimensional data cubes, and multidimensional tables, including crosstabs. The resulting
descriptions can also be presented as generalized relations, or in rule form called characteristic rules.
2 Association
It is the discovery of association rules showing attribute-value conditions that occur frequently together in
a given set of data. For example, a data mining system may find association rules like
where X is a variable representing a student. The rule indicates that of the students under study, 12%
(support) major in computing science and own a personal computer. There is a 98% probability (confidence,
or certainty) that a student in this group owns a personal computer.
Example:
A grocery store retailer to decide whether to but bread on sale. To help determine the impact of this decision,
the retailer generates association rules that show what other products are frequently purchased with bread.
He finds 60% of the times that bread is sold so are pretzels and that 70% of the time jelly is also sold. Based
on these facts, he tries to capitalize on the association between bread, pretzels, and jelly by placing some
pretzels and jelly at the end of the aisle where the bread is placed. In addition, he decides not to place either
of these items on sale at the same time.
3 Classification and prediction
Classification:
Classification:
Classification can be defined as the process of finding a model (or function) that describes and distinguishes
data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose
class label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects
whose class label is known).
Example:
An airport security screening station is used to deter mine if passengers are potential terrorist or criminals.
To do this, the face of each passenger is scanned and its basic pattern(distance between eyes, size, and shape
of mouth, head etc) is identified. This pattern is compared to entries in a database to see if it matches any
patterns that are associated with known offenders
1) IF-THEN rules,
2) Decision tree
3) Neural network.
Prediction:
Find some missing or unavailable data values rather than class labels referred to as prediction. Although
prediction may refer to both data value prediction and class label prediction, it is usually confined to data
value prediction and thus is distinct from classification. Prediction also encompasses the identification of
distribution trends based on the available data.
Example:
Predicting flooding is difficult problem. One approach is uses monitors placed at various points in the river.
These monitors collect data relevant to flood prediction: water level, rain amount, time, humidity etc. These
water levels at a potential flooding point in the river can be predicted based on the data collected by the
sensors upriver from this point. The prediction must be made with respect to the time the data were
collected.
Classification differs from prediction in that the former is to construct a set of models (or functions) that
describe and distinguish data class or concepts, whereas the latter is to predict some missing or unavailable,
and often numerical, data values. Their similarity is that they are both tools for prediction: Classification is
used for predicting the class label of data objects and prediction is typically used for predicting missing
numerical data values.
4 Clustering analysis
The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and
minimizing the interclass similarity.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a
hierarchy of classes that group similar events together as shown below:
Are All the “Discovered” Patterns Interesting?
➢ A data mining system/query may generate thousands of patterns, not all of them are interesting.
o Objective: based on statistics and structures of patterns, e.g., support, confidence, etc.
o Subjective: based on user‟s belief in the data, e.g., unexpectedness, novelty, actionability, etc.
o Approaches
▪ First general all the patterns and then filter out the uninteresting ones.
Pattern evaluation: refers to interestingness of pattern: A data mining system can uncover thousands of
patterns. Many of the patterns discovered may be uninteresting to the given user, representing common
knowledge or lacking novelty. Several challenges remain regarding the development of techniques to assess
the interestingness of discovered patterns,
➢ Performance issues. These include efficiency, scalability, and parallelization of data mining
algorithms.
➢ Efficiency and scalability of data mining algorithms: To effectively extract information from
a huge amount of data in databases, data mining algorithms must be efficient and scalable.
➢ Parallel, distributed, and incremental updating algorithms: Such algorithms divide the data
into partitions, which are processed in parallel. The results from the partitions are then merged.
➢ 3. Issues relating to the diversity of database types
➢ Handling of relational and complex types of data: Since relational databases and data
warehouses are widely used, the development of efficient and effective data mining systems for
such data is important.
➢ _ Mining information from heterogeneous databases and global information systems: Local
and wide-area computer networks (such as the Internet) connect many sources of data, forming
huge, distributed, and heterogeneous databases. The discovery of knowledge from different sources
of structured, semi-structured, or unstructured data with diverse data semantics poses great
challenges to data mining.
➢ This rule is very accurate and comprehensible, but it is not interesting, since it represents the
obvious. Another Example from real world data set,
➢ Rule (1) is a general and an obvious rule. But rule (2) contradicts the knowledge represented
by rule (1) and so the user's belief. This kind of knowledge is unexpected from users preset
beliefs and it is always interesting to extract this interesting (or surprising) knowledge from
data sets. “Unexpectedness” means knowledge which is unexpected from the beliefs of users
i.e. A decision rule is considered to be interesting (or surprising) if it represents knowledge that
was not only previously unknown to the users but also contradicts the original beliefs of the
users.
➢ I hope, these examples may help you to understand the concept more clearly.
➢ Edit
Yes, firstly, discover the general rules and then discover exceptions to these general rules. For
example,
➢ A general rule : If bird then fly
➢ However, there are few exceptional birds like emu and penguin that do not fly. It would
definitely be valuable to discover such exceptions along with the rule, making the rule more
accurate, comprehensible as well as interesting.
Classification of data mining systems
There are many data mining systems available or being developed. Some are specialized systems dedicated
to a given data source or are confined to limited data mining functionalities, other are more versatile and
comprehensive. Data mining systems can be categorized according to various criteria among other
classification are the following:
· Classification according to the type of data source mined: this classification categorizes data mining
systems according to the type of data handled such as spatial data, multimedia data, time-series data, text
data, World Wide Web, etc.
· Classification according to the data model drawn on: this classification categorizes data mining
systems based on the data model involved such as relational database, object-oriented database, data
warehouse, transactional, etc.
· Classification according to the king of knowledge discovered: this classification categorizes data
mining systems based on the kind of knowledge discovered or data mining functionalities, such as
characterization, discrimination, association, classification, clustering, etc. Some systems tend to be
comprehensive systems offering several data mining functionalities together.
· Classification according to mining techniques used: Data mining systems employ and provide
different techniques. This classification categorizes data mining systems according to the data analysis
approach used such as machine learning, neural networks, genetic algorithms, statistics, visualization,
database oriented or data warehouse-oriented, etc. The classification can also take into account the degree
of user interaction involved in the data mining process such as query-driven systems, interactive exploratory
systems, or autonomous systems. A comprehensive system would provide a wide variety of data mining
techniques to fit different situations and options, and offer different degrees of user interaction.
• Task-relevant data: This primitive specifies the data upon which mining is to be performed. It involves
specifying the database and tables or data warehouse containing the relevant data, conditions for selecting
the relevant data, the relevant attributes or dimensions for exploration, and instructions regarding the
ordering or grouping of the data retrieved.
• Knowledge type to be mined: This primitive specifies the specific data mining function to be performed,
such as characterization, discrimination, association, classification, clustering, or evolution analysis. As
well, the user can be more specific and provide pattern templates that all discovered patterns must match.
These templates or meta patterns (also called meta rules or meta queries), can be used to guide the discovery
process.
• Background knowledge: This primitive allows users to specify knowledge they have about the domain
to be mined. Such knowledge can be used to guide the knowledge discovery process and evaluate the
patterns that are found. Of the several kinds of background knowledge, this chapter focuses on concept
hierarchies.
• Pattern interestingness measure: This primitive allows users to specify functions that are used to
separate uninteresting patterns from knowledge and may be used to guide the mining process, as well as to
evaluate the discovered patterns. This allows the user to confine the number of uninteresting patterns
returned by the process, as a data mining process may generate a large number of patterns. Interestingness
measures can be specified for such pattern characteristics as simplicity, certainty, utility and novelty.
• Visualization of discovered patterns: This primitive refers to the form in which discovered patterns are
to be displayed. In order for data mining to be effective in conveying knowledge to users, data mining
systems should be able to display the discovered patterns in multiple forms such as rules, tables, cross tabs
(cross-tabulations), pie or bar charts, decision trees, cubes or other visual representations.
Major issues in data mining is regarding mining methodology, user interaction, performance, and diverse
data types
_ Mining different kinds of knowledge in databases: Since different users can be interested in different
kinds of knowledge, data mining should cover a wide spectrum of data analysis and knowledge discovery
tasks, including data characterization, discrimination, association, classification, clustering, trend and
deviation analysis, and similarity analysis. These tasks may use the same database in different ways and
require the development of numerous data mining techniques.
_ Interactive mining of knowledge at multiple levels of abstraction: Since it is difficult to know exactly
what can be discovered within a database, the data mining process should be interactive.
_ Data mining query languages and ad-hoc data mining: Knowledge in Relational query languages
(such as SQL) required since it allow users to pose ad-hoc queries for data retrieval.
_ Presentation and visualization of data mining results: Discovered knowledge should be expressed in
high-level languages, visual representations, so that the knowledge can be easily understood and directly
usable by humans
_ Handling outlier or incomplete data: The data stored in a database may reflect outliers: noise,
exceptional cases, or incomplete data objects. These objects may confuse the analysis process, causing over
fitting of the data to the knowledge model constructed. As a result, the accuracy of the discovered patterns
can be poor. Data cleaning methods and data analysis methods which can handle outliers are required.
_ Pattern evaluation: refers to interestingness of pattern: A data mining system can uncover thousands
of patterns. Many of the patterns discovered may be uninteresting to the given user, representing common
knowledge or lacking novelty. Several challenges remain regarding the development of techniques to assess
the interestingness of discovered patterns,
2. Performance issues. These include efficiency, scalability, and parallelization of data mining
algorithms.
_ Efficiency and scalability of data mining algorithms: To effectively extract information from a huge
amount of data in databases, data mining algorithms must be efficient and scalable.
_ Parallel, distributed, and incremental updating algorithms: Such algorithms divide the data into
partitions, which are processed in parallel. The results from the partitions are then merged.
_ Handling of relational and complex types of data: Since relational databases and data warehouses are
widely used, the development of efficient and effective data mining systems for such data is important.
_ Mining information from heterogeneous databases and global information systems: Local and wide-
area computer networks (such as the Internet) connect many sources of data, forming huge, distributed, and
heterogeneous databases. The discovery of knowledge from different sources of structured, semi-structured,
or unstructured data with diverse data semantics poses great challenges to data mining.
Data Integration
Data integration involves combining data from several disparate sources, which are stored using various
technologies and provide a unified view of the data. Data integration becomes increasingly important in
cases of merging systems of two companies or consolidating applications within one company to provide a
unified view of the company's data assets.
Data Migration
Data Migration is the process of transferring data from one system to another while changing the storage,
database or application. In reference to the ETL (Extract-Transform-Load) process, data migration always
requires at least Extract and Load steps.
Data Synchronization
Data Synchronization is a process of establishing consistency among systems and subsequent continuous
updates to maintain consistency. The word 'continuous' should be stressed here as the data synchronization
should not be considered as a one-time task.
ETL
ETL comes from Data Warehousing and stands for Extract-Transform-Load. ETL covers a process of how
the data are loaded from the source system to the data warehouse.
Business Intelligence
Business Intelligence (BI) is a set of tools supporting the transformation of raw data into useful information
which can support decision making. Business Intelligence provides reporting functionality,
tools for identifying data clusters, support for data mining techniques, business performance management
and predictive analysis.
DW and OLAP are essential elements of decision support. They are complementary technologies
- a DW stores and manages data while OLAP transforms the data into possibly strategic information.
Decision support usually requires consolidating data from many heterogeneous sources: these might include
external sources in addition to several operational databases. The different sources might contain data of
varying quality, or use inconsistent representations, codes and formats, which have to be reconciled. Since
data warehouses contain consolidated data, perhaps from several operational databases, over potentially
long periods of time, they tend to be orders of magnitude larger than operational databases. In data
warehouses historical and summarized data is more important than detailed. Many organizations want to
implement an integrated enterprise warehouse that collects information about all subjects spanning the
whole organization. Some organizations are settling for data marts instead, which are departmental subsets
focused on selected subjects (e.g., a marketing data mart, personnel data mart).
A popular conceptual model that influences the front-end tools, database design, and the query
engines for OLAP is the multidimensional view of data in the warehouse. In a multidimensional data model,
there is a set of numeric measures that are the objects of analysis. Examples of such measures are sales,
revenue, and profit. Each of the numeric measures depends on a set of dimensions, which provide the
context for the measure. For example, the dimensions associated with a sale amount can be the store,
product, and the date when the sale was made.
The dimensions together are assumed to uniquely determine the measure. Thus, the
multidimensional data views a measure as a value in the multidimensional space of dimensions. Often,
dimensions are hierarchical; time of sale may be organized as a day-month-quarter-year hierarchy, product
as a product-category-industry hierarchy [4].
Typical OLAP operations, also called cubing operation, include rollup (increasing the level of
aggregation) and drill-down (decreasing the level of aggregation or increasing detail) along one or more
dimension hierarchies, slicing (the extraction from a data cube of summarized data for a given dimension-
value), dicing (the extraction of a “sub-cube”, or intersection of several slices), and pivot (rotating of the
axes of a data cube so that one may examine a cube from different angles). Other technology, that can be
used for querying the warehouse is Data Mining.
Knowledge Discovery (KD) is a nontrivial process of identifying valid, novel, potentially useful,
and ultimately understandable patterns from large collections of data [6]. One of the KD steps is Data
Mining (DM). DM is the step that is concerned with the actual extraction of knowledge from data, in
contrast to the KD process that is concerned with many other things like understanding and preparation of
the data, verification of the mining results etc. In practice, however, people use terms DM and KD as
synonymous.
STORE ALL
(ALL, ALL, ALL)
Sofia
Toronto
ALL
books
clothing
➢ Different angles.
3. Incorporation of background knowledge.
➢ Over fit the data (apply any outlier analysis, data cleaning methods)
7. Pattern evaluation- the interestingness problem.
Performance Issues
➢ Running time.
➢ High cost
➢ Computational complexity
➢ Web mining→ uncover knowledge about web contents, web structure, web usage and
web dynamics
Data preprocessing
Data preprocessing describes any type of processing performed on raw data to prepare it for another
processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms
the data into a format that will be more easily and effectively processed for the purpose of the user.
Data preprocessing describes any type of processing performed on raw data to prepare it for another
processing procedure. Commonly used as a preliminary data mining practice, data preprocessing transforms
the data into a format that will be more easily and effectively processed for the purpose of the user
Why Data Preprocessing?
Data in the real world is dirty. It can be in incomplete, noisy and inconsistent from. These data needs to
be preprocessed in order to help improve the quality of the data, and quality of the mining results.
❖ If no quality data, then no quality mining results. The quality decision is always based on the quality
data.
❖ If there is much irrelevant and redundant information present or noisy and unreliable data, then
knowledge discovery during the training phase is more difficult
Incomplete data: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. e.g., occupation=“ ”.
❖ Data cleaning
➢ Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
❖ Data integration
❖ Data transformation
❖ Data reduction
➢ Obtains reduced representation in volume but produces the same or similar analytical results
❖ Data discretization
➢ Part of data reduction but with particular importance, especially for numerical data
Forms of Data Preprocessing
Data cleaning:
Data cleaning routines attempt to fill in missing values, smooth out noise while identifying outliers, and
correct inconsistencies in the data.
The various methods for handling the problem of missing values in data tuples include:
(a) Ignoring the tuple: This is usually done when the class label is missing (assuming the mining task
involves classification or description). This method is not very effective unless the tuple contains several
attributes with missing values. It is especially poor when the percentage of missing values per attribute
varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-consuming and may not be a
reasonable task for large data sets with many missing values, especially when the value to be filled in is not
easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing attribute values by the same
constant, such as a label like “Unknown,” or −∞. If missing values are replaced by, say, “Unknown,” then
the mining program may mistakenly think that they form an interesting concept, since they all have a value
in common — that of “Unknown.” Hence, although this method is simple, it is not recommended.
(d) Using the attribute mean for quantitative (numeric) values or attribute mode for categorical
(nominal) values, for all samples belonging to the same class as the given tuple: For example, if
classifying customers according to credit risk, replace the missing value with the average income value for
customers in the same credit risk category as that of the given tuple.
(e) Using the most probable value to fill in the missing value: This may be determined with regression,
inference-based tools using Bayesian formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to predict the missing values for
income.
Noisy data:
Noise is a random error or variance in a measured variable. Data smoothing tech is used for removing such
noisy data.
1 Binning methods: Binning methods smooth a sorted data value by consulting the neighborhood", or
values around it. The sorted values are distributed into a number of 'buckets', or bins. Because binning
methods consult the neighborhood of values, they perform local smoothing.
In this technique,
o Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
o Partition into (equi-depth) bins(equi depth of 3 since each bin contains three values):
- Bin 1: 4, 8, 9, 15
In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. For example, the
mean of the values 4, 8, and 15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by the
value 9. Similarly, smoothing by bin medians can be employed, in which each bin value is replaced by the
bin median. In smoothing by bin boundaries, the minimum and maximum values in a given bin are
identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Suppose that the data for analysis include the attribute age. The age values for the data tuples are (in
increasing order): 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40,
45, 46, 52, 70.
(a) Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps.
Comment on the effect of this technique for the given data.
The following steps are required to smooth the above data using smoothing by bin means with a bin
depth of 3.
• Step 1: Sort the data. (This step is not required here as the data are already sorted.)
• Step 4: Replace each of the values in each bin by the arithmetic mean calculated for the bin.
Bin 1: 14, 14, 14 Bin 2: 18, 18, 18 Bin 3: 21, 21, 21
Bin 4: 24, 24, 24 Bin 5: 26, 26, 26 Bin 6: 33, 33, 33
Bin 7: 35, 35, 35 Bin 8: 40, 40, 40 Bin 9: 56, 56, 56
2 Clustering: Outliers in the data may be detected by clustering, where similar values are organized into
groups, or „clusters‟. Values that fall outside of the set of clusters may be considered outliers.
• Linear regression involves finding the best of line to fit two variables, so that one variable
can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where more than two
variables are involved and the data are fit to a multidimensional surface.
Using regression to find a mathematical equation to fit the data helps smooth out the noise.
Field overloading: is a kind of source of errors that typically occurs when developers compress new
attribute definitions into unused portions of already defined attributes.
Unique rule is a rule says that each value of the given attribute must be different from all other values of
that attribute
Consecutive rule is a rule says that there can be no missing values between the lowest and highest values
of the attribute and that all values must also be unique.
Null rule specifies the use of blanks, question marks, special characters or other strings that may indicate
the null condition and how such values should be handled.
UNIT IV
A typical example of frequent itemset mining is market basket analysis. This process analyzes customer
buying habits by finding associations between the different items that customers place in their ―shopping baskets‖
(Figure 5.1). The discovery of such associations can help retailers develop marketing strategies by gaining insight
into which items are frequently purchased together by customers. For instance, if customers are buying milk, how
likely are they to also buy bread (and what kind of bread) on the same trip to the supermarket? Such information
can lead to increased sales by helping retailers do selective marketing and plan their shelf space.
If we think of the universe as the set of items available at the store, then each item has a Boolean variable
representing the presence or absence of that item. Each basket can then be represented by a Boolean vector of
values assigned to these variables. The Boolean vectors can be analyzed for buying patterns that reflect items that
are frequently associated or purchased together. These patterns can be represented in the form of association rules.
For example, the information that customers who purchase computers also tend to buy antivirus software at the
same time is represented in Association Rule (5.1) below:
Rule support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and
certainty of discovered rules. A support of 2% for Association Rule (5.1) means that 2% of all the transactions under analysis
show that computer and antivirus software are purchased together. A confidence of 60% means that 60% of the customers
who purchased a computer also bought the software. Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain
experts. Additional analysis can be performed to uncover interesting statistical correlations between associated items.
• Rules that satisfy both a minimum support threshold (min sup) and a minimum confidence
threshold (min conf) are called Strong Association Rules.
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a
predetermined minimum support count, min_sup.
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy
minimum support and minimum confidence.
Once the frequent itemsets from transactions in a database D have been found, it is straightforward to generate strong
association rules from them (where strong association rules satisfy both minimum support and minimum confidence).
This can be done using Equation (5.4) for confidence, which we show again here for completeness:
FP-Growth Method: Mining Frequent Itemsets without Candidate Generation
As we have seen, in many cases the Apriori candidate generate-and-test method significantly reduces the size of
candidate sets, leading to good performance gain.
An interesting method in this attempt is called frequent-pattern growth, or simply FP-growth, which adopts a divide-
and-conquer strategy as follows. First, it compresses the database representing frequent items into a frequent-pattern
tree, or FP-tree, which retains the itemset association information. It then divides the compressed database into a set
of conditional databases (a special kind of projected database), each associated with one frequent item or ―pattern
fragment,‖ and mines each such database separately. You’ll see how it works with the following example.
Example 5.5 FP-growth (finding frequent itemsets without candidate generation). We re-examine the mining of
transaction database, D, of Table 5.1 in Example 5.3 using the frequent pattern growth approach.
Figure 5.7 An FP-tree registers compressed, frequent pattern information.
Figure 5.9 The FP-growth algorithm for discovering frequent item sets without candidate generation.
Mining Various Kinds of Association Rules
1) Mining Multilevel Association Rules
For many applications, it is difficult to find strong associations among data items at low or primitive levels of
abstraction due to the sparsity of data at those levels. Strong associations discovered at high levels of abstraction may represent
commonsense knowledge. Moreover, what may represent common sense to one user may seem novel to another. Therefore,
data mining systems should provide capabilities for mining association rules at multiple levels of abstraction, with sufficient
flexibility for easy traversal among different abstraction spaces.
Mining multilevel association rules. Suppose we are given the task-relevant set of transactional data in Table for
sales in an AllElectronics store, showing the items purchased for each transaction. The concept hierarchy for the items is
shown in Figure 5.10. A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher level,
more general concepts. Data can be generalized by replacing low-level concepts
within the data by their higher-level concepts, or ancestors, from a concept hierarchy.
Association rules generated from mining data at multiple levels of abstraction are called multiple-level or multilevel
association rules. Multilevel association rules can be mined efficiently using concept hierarchies under a support- confidence
framework. In general, a top-down strategy is employed, where counts are accumulated for the calculation of frequent itemsets
at each concept level, starting at the concept level 1 and working downward in the hierarchy toward the more specific concept
levels, until no more frequent itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be
used, such as Apriori or its variations.
• Using uniform minimum support for all levels (referred to as uniform support): The same minimum
support threshold is used when mining at each level of abstraction. For example, in Figure 5.11, a
minimum support threshold of 5% is used throughout (e.g., for mining from “computer” down to “laptop
computer”). Both “computer” and “laptop computer” are found to be frequent, while “desktop computer”
is not.
When a uniform minimum support threshold is used, the search procedure is simplified. The method is also
simple in that users are required to specify only one minimum support threshold. An Apriori-like
optimization technique can be adopted, based on the knowledge that an ancestor is a superset of its
descendants: The search avoids examining itemsets containing any item whose ancestors do not have
minimum support.
• Using reduced minimum support at lower levels (referred to as reduced support): Each level of
abstraction has its own minimum support threshold. The deeper the level of abstraction, the smaller the
corresponding threshold is. For example, in Figure, the minimum support thresholds for levels 1 and 2 are
5% and 3%, respectively. In this way, “computer,” “laptop computer,” and “desktop computer” are all
considered frequent.
• Using item or group-based minimum support (referred to as group-based support): Because users or
experts often have insight as to which groups are more important than others, it is sometimes more desirable
to set up user-specific, item, or group based minimal support thresholds when mining multilevel rules. For
example, a user could set up the minimum support thresholds based on product price, or on items of interest,
such as by setting particularly low support thresholds for laptop computers and flash drives in order to pay
particular attention to the association patterns containing items in these categories.
2) Mining Multidimensional Association Rules from Relational Databases and DataWarehouses
We have studied association rules that imply a single predicate, that is, the predicate buys. For instance, in mining our
AllElectronics database, we may discover the Boolean association rule
Following the terminology used in multidimensional databases, we refer to each distinct predicate in a rule as a dimension.
Hence, we can refer to Rule above as a single dimensional or intra dimensional association rule because it contains a single
distinct predicate (e.g., buys)with multiple occurrences (i.e., the predicate occurs more than once within the rule). As we have
seen in the previous sections of this chapter, such rules are commonly mined from transactional data.
Considering each database attribute or warehouse dimension as a predicate, we can therefore mine association rules
containing multiple predicates, such as
Association rules that involve two or more dimensions or predicates can be referred to as multidimensional
association rules. Rule above contains three predicates (age, occupation, and buys), each of which occurs only once in the
rule. Hence, we say that it has no repeated predicates. Multidimensional association rules with no repeated predicates are
called inter dimensional association rules. We can also mine multidimensional association rules with repeated predicates,
which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional association rules. An
example of such a rule is the following, where the predicate buys is repeated:
Note that database attributes can be categorical or quantitative. Categorical attributes have a finite number of possible
values, with no ordering among the values (e.g., occupation, brand, color). Categorical attributes are also called nominal
attributes, because their values are ―names of things.‖ Quantitative attributes are numeric and have an implicit ordering
among values (e.g., age, income, price). Techniques for mining multidimensional association rules can be categorized into
two basic approaches regarding the treatment of quantitative attributes.
where Aquan1 and Aquan2 are tests on quantitative attribute intervals (where the intervals are dynamically determined), and Acat
tests a categorical attribute from the task-relevant data. Such rules have been referred to as two-dimensional quantitative
association rules, because they contain two quantitative dimensions. For instance, suppose you are curious about the
association relationship between pairs of quantitative attributes, like customer age and income, and the type of television (such
as high-definition TV, i.e., HDTV) that customers like to buy. An example of such a 2-D quantitative association rule is
Binning: Quantitative attributes can have a very wide range of values defining their domain. Just think about how big a 2-D
grid would be if we plotted age and income as axes, where each possible value of age was assigned a unique position on one
axis, and similarly, each possible value of income was assigned a unique position on the other axis! To keep grids down to a
manageable size, we instead partition the ranges of quantitative attributes into intervals. These intervals are dynamic in that
they may later be further combined during the mining process. The partitioning process is referred to as binning, that is, where
the intervals are considered ―bins.‖ Three common binning strategies area as follows:
Finding frequent predicate sets: Once the 2-D array containing the count distribution for each category is set up, it can be
scanned to find the frequent predicate sets (those satisfying minimum support) that also satisfy minimum confidence. Strong
association rules can then be generated from these predicate sets, using a rule generation algorithm.
From Association Mining to Correlation Analysis
Most association rule mining algorithms employ a support-confidence framework. Often, many interesting rules can
be found using low support thresholds. Although minimum support and confidence thresholds help weed out or exclude the
exploration of a good number of uninteresting rules, many rules so generated are still not interesting to the users.
Unfortunately, this is especially true when mining at low support thresholds or mining for long patterns. This has been one of
the major bottlenecks for successful application of association rule mining.
Whether or not a rule is interesting can be assessed either subjectively or objectively. Ultimately, only the user can
judge if a given rule is interesting, and this judgment, being subjective, may differ from one user to another. However,
objective interestingness measures, based on the statistics ―behind‖ the data, can be used as one step toward the goal of
weeding out uninteresting rules from presentation to the user.
The support and confidence measures are insufficient at filtering out uninteresting association rules. To tackle this
weakness, a correlation measure can be used to augment the support-confidence framework for association rules. This leads
to correlation rules of the form
That is, a correlation rule is measured not only by its support and confidence but also by the correlation between itemsets
A and B. There are many different correlation measures from which to choose. In this section, we study various correlation
measures to determine which would be good for mining large data sets.
where P1 and P2 are predicate variables that are instantiated to attributes from the given database during the mining
process, X is a variable representing a customer, and Y and W take on values of the attributes assigned to P1 and P2,
respectively. Typically, a user will specify a list of attributes to be considered for instantiation with P1 and P2. Otherwise, a
default set may be used.
Our association mining query is to “Find the sales of which cheap items (where the sum of the prices is less than $100)
may promote the sales of which expensive items (where the minimum price is $500) of the same group for Chicago customers
in 2004.” This can be expressed in the DMQL data mining query language as follows,
Classification and Prediction
A bank loans officer needs analysis of her data in order to learn which loan applicants are ―safe‖ and which
are ―risky‖ for the bank. A marketing manager at AllElectronics needs data analysis to help guess whether a
customer with a given profile will buy a new computer. A medical researcher wants to analyze breast cancer data
in order to predict which one of three specific treatments a patient should receive. In each of these examples, the
data analysis task is classification, where a model or classifier is constructed to predict categorical labels, such as
―safe‖ or ―risky‖ for the loan application data; ―yes‖ or ―no‖ for the marketing data; or ―treatment A,‖
―treatment B,‖ or ―treatment C‖ for the medical data. These categories can be represented by discrete values,
where the ordering among values has no meaning. For example, the values 1, 2, and 3 may be used to represent
treatments A, B, and C, where there is no ordering implied among this group of treatment regimes.
Suppose that the marketing manager would like to predict how much a given customer will spend during
a sale at AllElectronics. This data analysis task is an example of numeric prediction, where the model constructed
predicts a continuous-valued function, or ordered value, as opposed to a categorical label. This model is a predictor
“How does classification work? Data classification is a two-step process, as shown for the loan application
data of Figure 6.1. (The data are simplified for illustrative purposes. In reality, we may expect many more attributes
to be considered.) In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the classifier by analyzing or
―learning from‖ a training set made up of database tuples and their associated class labels.
Issues Regarding Classification and Prediction
•Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques, for example) and the treatment of missing values (e.g., by replacing a missing value
with the most commonly occurring value for that attribute, or with the most probable value based on
statistics). Although most classification algorithms have some mechanisms for handling noisy or missing
data, this step can help reduce confusion during learning.
• Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can be used
to identify whether any two given attributes are statistically related. For example, a strong correlation
between attributes A1 and A2 would suggest that one of the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection4 can be used in these cases to
find a reduced set of attributes such that the resulting probability distribution of the data classes is as close
as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form
of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute
to the classification or prediction task. Including such attributes may otherwise slow down, and possibly
mislead, the learning step. Ideally, the time spent on relevance analysis, when added to the time spent on
learning from the resulting ―reduced‖ attribute (or feature) subset, should be less than the time that would
have been spent on learning from the original set of attributes. Hence, such analysis can help improve
classification efficiency and scalability.
• Data transformation and reduction: The data may be transformed by normalization, particularly when
neural networks or methods involving distance measurements are used in the learning step. Normalization
involves scaling all values for a given attribute so that they fall within a small specified range, such as -
1.0 to 1.0, or 0.0 to 1.0. In methods that use distance measurements, for example, this would prevent
attributes with initially large ranges (like, say, income) from out weighing attributes with initially smaller
ranges (such as binary attributes).
Comparing Classification and Prediction Methods
Classification and prediction methods can be compared and evaluated according to the following criteria:
• Accuracy
• Speed
• Robustness
• Scalability
• Interpretability
Learning Systems
◼ Linear discriminant model - mathematical equation (p = ax1 + bx2 + cx3 + dx4 + ex5).
◼ Presentation comprehensibility
Decision Trees
3. Create a tree node whose value is the chosen attribute. Create child links from this node where each link
represents a unique value for the chosen attribute. Use the child link values to further subdivide the instances
into subclasses
a. If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for
this path of the tree is null, specify the classification for new instances following this decision path.
b. If the subclass does not satisfy the predefined criteria and there is at least one attribute to further
subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.
Information gain
ID3 uses information gain as its attribute selection measure.
Information gain is defined as the difference between the original information requirement (i.e., based on just the
proportion of classes) and the new requirement (i.e., obtained after partitioning on A). That is,
In other words, Gain(A) tells us how much would be gained by branching on A. It is the expected reduction in the information
requirement caused by knowing the value of A. The attribute A with the highest information gain, (Gain(A)), is chosen as the
splitting attribute at node N.
Decision Tree Rules
General Considerations
Here is a list of a few of the many advantages decision trees have to offer.
Decision trees are easy to understand and map nicely to a set of production rules.
Decision trees make no prior assumptions about the nature of the data.
Decision trees are able to build models with datasets containing numerical as well as categorical data.
As with all data mining algorithms, there are several issues surrounding decision tree usage. Specifically,
• Output attributes must be categorical, and multiple output attributes are not allowed.
• Decision tree algorithms are unstable in that slight variations in the training data can result in different
attribute selections at each choice point with in the tree. The effect can be significant as attribute choices
affect all descendent subtrees.
• Trees created from numeric datasets can be quite complex as attribute splits for numeric data are
typically binary.
• Decision tree
– Tree construction
– Tree pruning
Test the attribute values of the sample against the decision tree.
• Classification—a classical problem extensively studied by statisticians and machine learning researchers
• Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable
speed
– can use SQL queries for accessing databases Comparable classification accuracy with other
methods
Bayesian Classification: Why?
➢ Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical
approaches to certain types of learning problems
➢ Incremental: Each training example can incrementally increase/decrease the probability that a
hypothesis is correct. Prior knowledge can be combined with observed data.
➢ Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of
optimal decision making against which other methods can be measured
Bayesian Theorem
➢ Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem
P(h| D) = P(D|h)P(h)
P(D)
➢ MAP (maximum posteriori) hypothesis
h argmaxP(h| D) =argmaxP(D|h)P(h).
MAP hH hH
➢ Practical difficulty: require initial knowledge of many probabilities, significant computational cost
➢ Greatly reduces the computation cost, only count the class distribution.
➢ Bayes theorem:
o P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
P(n) = 5/14
o Bayesian networks, that combine Bayesian reasoning with causal relationships between
attributes
o Decision trees, that reason on one attribute at the time, considering most important
attributes first
➢ Advantages
➢ Criticism
A Neuron:
➢ The n-dimensional input vector x is mapped into variable y by means of the scalar product and a
nonlinear function mapping
Network Training:
o obtain a set of weights that makes almost all the tuples in the training data classified correctly
➢ Steps
▪ Compute the net input to the unit as a linear combination of all the inputs to the unit
Multi-Layer Perceptron
➢ Network pruning
o N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights
o Pruning: Remove some of the links without affecting classification accuracy of the network
o Discretize activation values; replace individual activation value by the cluster average
maintaining the network accuracy
o Enumerate the output from the discretized activation values to find rules between activation
value and output
o Combine the above two to have rules relating the output to input.
Outline
• Performance measurments
Classification
– At what speed?
Classification tasks
• Learning Task
• Classification Task
• Amount of data.
• Use computers – data processing, statistical analysis, try to learn patterns from the data (Machine Learning)
Black box view of Machine Learning
Training data: -Expression patterns of some cancer + expression data from healty person
Model: - The model can distinguish between healty and sick persons. Can be used for prediction.
Tennis example 2
1) k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s. The method is labor intensive when
given large training sets, and did not gain popularity until the 1960s when increased computing power became
available. It has since been widely used in the area of pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with
training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point
in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When
given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are
closest to the unknown tuple. These k training tuples are the k ―nearest neighbors‖ of the unknown tuple.
―Closeness‖ is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance
between two points or tuples, say, X1 = (x11, x12, : : : , x1n) and X2 = (x21, x22, : : : , x2n), is
2) Case-Based Reasoning
Case-based reasoning (CBR) classifiers use a database of problem solutions to solve new problems. Unlike
nearest-neighbor classifiers, which store training tuples as points in Euclidean space, CBR stores the tuples or
―cases‖ for problem solving as complex symbolic descriptions. Business applications of CBR include problem
resolution for customer service help desks, where cases describe product-related diagnostic problems. CBR has
also been applied to areas such as engineering and law, where cases are either technical designs or legal rulings,
respectively. Medical education is another area for CBR, where patient case histories and treatments are used to
help diagnose and treat new patients.
When given a new case to classify, a case-based reasoner will first check if an identical training case exists.
If one is found, then the accompanying solution to that case is returned. If no identical case is found, then the case-
based reasoner will search for training cases having components that are similar to those of the new case.
Conceptually, these training cases may be considered as neighbors of the new case. If cases are represented as
graphs, this involves searching for subgraphs that are similar to subgraphs within the new case. The case- based
reasoner tries to combine the solutions of the neighboring training cases in order to propose a solution for the new
case. If incompatibilities arise with the individual solutions, then backtracking to search for other
solutions may be necessary. The case-based reasoner may employ background knowledge and problem-solving
strategies in order to propose a feasible combined solution.
Instance-Based Methods
➢ Instance-based learning:
o Store training examples and delay the processing (“lazy evaluation”) until a new instance
must be classified
➢ Typical approaches
o k-nearest neighbor approach
▪ Instances represented as points in a Euclidean space.
o Locally weighted regression
▪ Constructs local approximation
o Case-based reasoning
▪ Uses symbolic representations and knowledge-based inference
◼ Cluster analysis
◼ Typical applications
◼ Pattern Recognition
◼ Image Processing
◼ WWW
◼ Document classification
◼ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge
to develop targeted marketing programs
◼ Land use: Identification of areas of similar land use in an earth observation database
◼ Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
◼ City-planning: Identifying groups of houses according to their house type, value, and geographical
location
◼ Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
◼ The quality of a clustering result depends on both the similarity measure used by the method and its
implementation.
◼ The quality of a clustering method is also measured by its ability to discover some or all of the hidden
patterns.
◼ Scalability
◼ High dimensionality
The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering. A cluster is a collection of data objects that are similar to one another within the same cluster
and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one
group and so may be considered as a form of data compression. Although classification is an effective
means for distinguishing groups or classes of objects, it requires the often costly collection and labeling of
a large set of training tuples or patterns, which the classifier uses to model each group. It is often more
desirable to proceed in the reverse direction: First partition the set of data into groups based on data
similarity (e.g., using clustering), and then assign labels to the relatively small number of groups. Additional
advantages of such a clustering-based process are that it is adaptable to changes and helps single out useful
features that distinguish different groups.
Clustering is a challenging field of research in which its potential applications pose their own special
requirements. The following are typical requirements of clustering in data mining: Scalability: Many
clustering algorithms work well on small data sets containing fewer than several hundred data objects;
however, a large database may contain millions of objects. Clustering on a sample of a given large data set
may lead to biased results. Highly scalable clustering algorithms are needed.
Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require clustering other types of data, such as binary,
categorical (nominal), and ordinal data, or mixtures of these data types.
Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on
Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find
spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to
develop algorithms that can detect clusters of arbitrary shape. Minimal requirements for domain knowledge
to determine input parameters: Many clustering algorithms require users to input certain parameters in
cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets containing high-
dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to
control.
Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or
erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor
quality. Incremental clustering and insensitivity to the order of input records: Some clustering algorithms
cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and,
instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order
of input data. That is, given a set of data objects, such an algorithm may return dramatically different
clustering’s depending on the order of presentation of the input objects. It is important to develop
incremental clustering algorithms and algorithms that are insensitive to the order of input.
High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many
clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions.
Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of
data objects in high dimensional space is challenging, especially considering that such data can be sparse
and highly skewed. Constraint-based clustering: Real-world applications may need to perform clustering
under various kinds of constraints. Suppose that your job is to choose the locations for a given number of
new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while
considering constraints such as the city’s rivers and highway networks, and the type and number of
customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy
specified constraints.
Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and
usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is
important to study how an application goal may influence the selection of clustering features and methods.
Type of data in clustering analysis: x ... x ... x
11 1f 1p
Data Structures: ... ... ... ... ...
x ... x ... x
◼ Data matrix i1 if ip
... ... ... ... ...
◼ (two modes) x ... x ... x
n1 nf np
◼ Dissimilarity matrix 0
d(2,1)
0
◼ (one mode)
d(3,1) d (3,2) 0
: : :
d (n,1) d (n,2) .............. 0
◼ The definitions of distance functions are usually very different for interval-scaled, boolean, categorical,
ordinal and ratio variables.
◼ Weights should be associated with different variables based on applications and data semantics.
◼ Interval-scaled variables:
◼ Binary variables:
Interval-valued variables:
◼ Standardize data
◼ Distances are normally used to measure the similarity or dissimilarity between two data objects
d(i, j) =
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
◼ If q = 1, d is Manhattan distance
d(i, j) =
◼ Properties
◼ d(i,j) 0
◼ d(i,i) = 0
◼ d(i,j) = d(j,i)
◼ Also one can use weighted distance, parametric Pearson product moment correlation, or other
dissimilarity measures.
Nominal Variables:
◼ A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Ordinal Variables:
rif −1
zif = M −1
f
◼ Partitioning algorithms: Construct various partitions and then evaluate them by some criterion
◼ Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some
criterion
◼ Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that
model to each other.
◼ Partitioning approach:
◼ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum
of square errors
◼ Hierarchical approach:
◼ Create a hierarchical decomposition of the set of data (or objects) using some criterion
◼ Density-based approach:
◼ Based on connectivity and density functions
◼ Grid-based approach:
◼ Model-based:
◼ A model is hypothesized for each of the clusters and tries to find the best fit of that model to
each other
◼ Frequent pattern-based:
◼ User-guided or constraint-based:
◼ Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(K i,
Kj) = min(tip, tjq)
◼ Complete link: largest distance between an element in one cluster and an element in the other, i.e.,
dis(Ki, Kj) = max(tip, tjq)
◼ Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) =
avg(tip, tjq)
◼ Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)
◼ Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)
iN= 1(t
Cm = ip )
N
◼ Radius: square root of average distance from any point of the cluster to its centroid
N (t −cm ) 2
Rm = i =1 ip
◼ Diameter: square root of average mean squared distance between all pairs of points in the cluster
N N (t −t )2
Dm = i =1 i =1 ip iq
N(N −1)
Partitioning Methods:
◼ Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., some
objective is minimized. E.g., min sum of squared distance in k-means
k
(C − t ) 2
m=1 tmiKm m mi
◼ Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion
◼ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is
represented by one of the objects in the cluster
◼ Compute seed points as the centroids of the clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
◼ Assign each object to the cluster with the nearest seed point
◼ Comment: Often terminates at a local optimum. The global optimum may be found using techniques
such as: deterministic annealing and genetic algorithms
◼ Weakness
◼ Applicable only when mean is defined, then what about categorical data?
◼ Dissimilarity calculations
◼ Since an object with an extremely large value may substantially distort the distribution of the
data.
◼ K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can
be used, which is the most centrally located object in a cluster.
◼ starts from an initial set of medoids and iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance of the resulting clustering
◼ PAM works effectively for small data sets, but does not scale well for large data sets
◼ PAM is more robust than k-means in the presence of noise and outliers because a medoid is less
influenced by outliers or other extreme values than a mean
◼ Pam works efficiently for small data sets but does not scale well for large data sets.
◼ It draws multiple samples of the data set, applies PAM on each sample, and gives the best
clustering as the output
◼ Weakness:
◼ Efficiency depends on the sample size
◼ A good clustering based on samples will not necessarily represent a good clustering of the
whole data set if the sample is biased
◼ The clustering process can be presented as searching a graph where every node is a potential
solution, that is, a set of k medoids
◼ If the local optimum is found, CLARANS starts with new randomly selected node in search for a
new local optimum
◼ Focusing techniques and spatial access structures may further improve its performance (Ester et
al.’95)
Hierarchical Clustering:
◼ Use distance matrix. This method does not require the number of clusters k as an input, but needs
a termination condition
◼ Go on in a non-descending fashion
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each
connected component forms a cluster.
◼ do not scale well: time complexity of at least O(n2), where n is the number of total objects
◼ BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters
BIRCH (1996):
◼ Birch: Balanced Iterative Reducing and Clustering using Hierarchies (Zhang, Ramakrishnan &
Livny, SIGMOD’96)
◼ Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree
◼ Scales linearly: finds a good clustering with a single scan and improves the quality with a few
additional scans
◼ Weakness: handles only numeric data, and sensitive to the order of the data record.
CF-Tree in BIRCH:
◼ Clustering feature:
◼ summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the
subcluster from the statistical point of view.
◼ registers crucial measurements for computing cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering
◼ Major ideas
◼ Not distance-based
◼ Computational complexity:
◼ Experiments
◼ Traditional measures for categorical data may not work well, e.g., Jaccard coefficient
◼ C1. <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b,
c, e}, {b, d, e}, {c, d, e}
◼ C2. <a, b, f, g>: {a, b, f}, {a, b, g}, {a, f, g}, {b, f, g}
1
Sim(T1,T 2) = = = 0.2
5
Link Measure in ROCK:
◼ C1 <a, b, c, d, e>: {a, b, c}, {a, b, d}, {a, b, e}, {a, c, d}, {a, c, e}, {a, d, e}, {b, c, d}, {b,
c, e}, {b, d, e}, {c, d, e}
◼ Two clusters are merged only if the interconnectivity and closeness (proximity) between
two clusters are high relative to the internal interconnectivity of the clusters and
closeness of items within the clusters
◼ A two-phase algorithm
◼ Use a graph partitioning algorithm: cluster objects into a large number of relatively small
sub-clusters
◼ Major features:
◼ Handle noise
◼ One scan
◼ If p is a border point, no points are density-reachable from p and DBSCAN visits the next point
of the database.
◼ Continue the process until all of the points have been processed.
◼ STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
◼ Each cell at a high level is partitioned into a number of smaller cells in the next lower
level
◼ Statistical info of each cell is calculated and stored beforehand and is used to answer
queries
◼ Parameters of higher level cells can be easily calculated from parameters of lower level
cell
◼ For each cell in the current level compute the confidence interval
Comments on STING:
◼ When finish examining the current layer, proceed to the next lower level
◼ Advantages:
◼ Disadvantages:
◼ All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is
detected
Model-Based Clustering?:
◼ Attempt to optimize the fit between the given data and some mathematical model
◼ Typical methods
◼ Statistical approach
◼ COBWEB, CLASSIT
Conceptual Clustering:
◼ Conceptual clustering
◼ Each node refers to a concept and contains a probabilistic description of that concept
◼ New objects are distributed to the cluster whose exemplar is the most similar according to
some distance measure
◼ Typical methods
◼ Competitive learning
◼ Partition the data space and find the number of points that lie inside each cell of the partition.
◼ Identify the subspaces that contain clusters using the Apriori principle
◼ Identify clusters
◼ Determine maximal regions that cover a cluster of connected dense units for each cluster
◼ CLIQUE, ProClus
◼ Typical methods
◼ Less parameters but more user-desired constraints, e.g., an ATM allocation problem: obstacle &
desired clusters
◼ User-specified constraints
◼ Example: Locating k delivery centers, each serving at least m valued customers and n ordinary
ones
◼ Proposed approach
◼ Find an initial “solution” by partitioning the data set into k groups and satisfying user-
constraints
◼ Iteratively refine the solution by micro-clustering relocation (e.g., moving δ μ-clusters
from cluster Ci to Cj) and “deadlock” handling (break the microclusters when necessary)
◼ E.g., having approximately same number of valued customers in each cluster?! — Can
you solve it?
◼ The set of objects are considerably dissimilar from the remainder of the data
◼ Applications:
◼ Customer segmentation
◼ Medical analysis
Assume a model underlying distribution that generates data set (e.g. normal distribution)
◼ data distribution
◼ Drawbacks
◼ Index-based algorithm
◼ Nested-loop algorithm
◼ Cell-based algorithm
◼ Ex. C1 contains 400 loosely distributed points, C2 has 100 tightly condensed points, 2 outlier
points o1, o2
◼ simulates the way in which humans can distinguish unusual objects from among a series
of supposedly like objects
Summary:
◼ Cluster analysis groups objects based on their similarity and has wide applications
◼ Outlier detection and analysis are very useful for fraud detection, etc. and can be performed by
statistical, distance-based or deviation-based approaches
◼ Current clustering techniques do not address all the requirements adequately, still an active area
of research