DM Notes1
DM Notes1
DM Notes1
PART – A
UNIT – 1 6 Hours
Data Warehousing: Introduction, Operational Data Stores (ODS), Extraction Transformation Loading (ETL), Data
Warehouses. Design Issues, Guidelines for Data Warehouse Implementation, Data Warehouse Metadata
UNIT – 2 6 Hours
Online Analytical Processing (OLAP): Introduction, Characteristics of OLAP systems, Multidimensional view and Data
cube, Data Cube Implementations, Data Cube operations, Implementation of OLAP and overview on OLAP Softwares.
UNIT – 3 6 Hours
Data Mining: Introduction, Challenges, Data Mining Tasks, Types of Data, Data Preprocessing, Measures of Similarity and
Dissimilarity, Data Mining Applications
UNIT – 4 8 Hours
Association Analysis: Basic Concepts and Algorithms: Frequent Itemset Generation, Rule Generation, Compact
Representation of Frequent Itemsets, Alternative methods for generating Frequent Itemsets, FP Growth Algorithm, Evaluation
of Association Patterns
PART - B
UNIT – 5 6 Hours
Classification -1 : Basics, General approach to solve classification problem, Decision Trees, Rule Based Classifiers, Nearest
Neighbor Classifiers.
UNIT – 6 6 Hours
Classification - 2 : Bayesian Classifiers, Estimating Predictive accuracy of classification methods, Improving accuracy of
clarification methods, Evaluation criteria for classification methods, Multiclass Problem.
UNIT – 7 8 Hours
Clustering Techniques: Overview, Features of cluster analysis, Types of Data and Computing Distance, Types of Cluster
Analysis Methods, Partitional Methods, Hierarchical Methods, Density Based Methods, Quality and Validity of Cluster
Analysis
UNIT – 8 6 Hours
Web Mining: Introduction, Web content mining, Text Mining, Unstructured Text, Text clustering, Mining Spatial and
Temporal Databases.
Text Books:
1. Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2005.
2. G. K. Gupta: Introduction to Data Mining with Case Studies, 3rd Edition, PHI, New Delhi, 2009.
TABLE OF CONTENTS
INTRODUCTION
• A large company might have the following systems:
→Human resources(HR) →Financials →Billing
→Sales leads →Web sales →Customer support
Such systems are called OLTP systems (OnLine Transaction Processing).
• The systems are mostly relational database systems designed for transaction processing.
• The performance of OLTP systems is usually very important,
since such systems are used to support users(i.e. staff) who provide service to customers.
• The systems must be able to deal with
→ insert & update operations and
→ quickly answering queries
• The systems are not designed to handle management-queries efficiently (since management-
queries are often complex, requiring many joins & aggregations).
• Focus of operational-managers is on improving business-processes across various business-units
(for example: customer support, inventory, and marketing). To achieve this, they require:
→ a single sign-on path to the enterprise-information
→ a single version of the enterprise-information
→ a high level of data accuracy
→ a user-friendly interface to the information
→ easy sharing of data across business-units of enterprise
• Following are some solutions to meet the needs of management-staff in an enterprise:
1) Managers pose queries of interest to a mediator-system that
→ decomposes each query into appropriate subqueries for the systems
→ obtains results from those systems and
→ then combines & presents the result to the user
This is called lazy (or on-demand) query processing.
Advantage: The user is provided with up-to-date information.
Disadvantage: The management-queries may generate a heavy load on some OLTP
systems which may degrade the performance of the systems.
2) Collect the most common queries that the managers ask and
then run them regularly and
finally have the results available when a manager poses one of those queries.
This approach is called the eager query processing.
Advantage: Provides quick response.
Disadvantage: The information may not be up-to-date.
3) This approach
→ involves creation of a separate database that only stores information that is of
interest to the management-staff and
→ involves the following 2 steps:
i) The information needs of management-staff are analyzed and the systems
that store some or all of the information are identified.
ii) The new database is then used to answer management-queries and the
OLTP systems are not accessed for such queries.
1
DATA WAREHOUSING & DATA MINING
ODS
• ODS stands for Operational Data Store.
• This is defined as a subject-oriented, integrated, volatile, current-valued data store, containing
only corporate-detailed data.
→ ODS is subject-oriented i.e. it is organized around the major data-subjects of an enterprise
→ ODS is integrated i.e. it is a collection of data from a variety of systems to provide an
enterprise-wide view of the data
→ ODS is volatile i.e. data in the ODS changes frequently as new information refreshes ODS
→ ODS is current-valued i.e. it is up-to-date and reflects the current status of the information
→ ODS is detailed i.e. it is detailed enough to serve needs of management-staff in enterprise
• Benefits of ODS to an enterprise:
1) An ODS is the unified-operational view of enterprise. It provides operational-managers
improved access to important operational-data. This view may assist in better understanding
of business & customer.
2) The ODS can be much more effective in generating current reports without having to
access the OLTP.
3) The ODS can shorten the time required to implement & populate a data warehouse
system.
• An ODS may be of following types:
1) An ODS that is essentially a reporting-tool for administrative purposes may be fairly
simple. Such an ODS is usually updated daily. (It provides reports about business
transactions for that day, such as sales totals or orders filled).
2) An ODS may be designed to track more complex information such as product & location
codes. Such an ODS is usually updated hourly.
3) An ODS may be designed to support CRM (Customer Relationship Management).
.
2
DATA WAREHOUSING & DATA MINING
WHY A SEPARATE DATABASE?
• Q: Why an ODS should be separate from the operational-databases?
Ans: Because from time to time, complex queries are likely to degrade performance of OLTP
systems. The OLTP systems have to provide a quick response to operational-users as businesses
cannot afford to have response time suffer when a manager is running a complex query.
ZLE
• ZLE stands for Zero Latency Enterprise.
• This is used for near real-time integration of operational-data so that there is no significant delay
in getting information from one system to another system in an enterprise.
• The heart of a ZLE system is an ODS.
• A ZLE data store(ZDS) is something like an ODS that is integrated & up-to-date.
• Aim of a ZLE data store is to:
→ allow management a single view of enterprise-information by bringing together relevant data
in real-time &
→ provide management a "360-degree" view of the customer
• Characteristics of ZLE:
→ It has a unified view of the enterprise operational-data.
→ It has a high level of availability and it involves online refreshing of information.
• Since a ZLE needs to support a large number of concurrent users (for example call centre users), a
fast turnaround time for transactions and 24/7 availability is required.
• Q: How does a ZLE data store fit into enterprise-information architecture?
Ans: Either a ZLE data store can be used almost like an ODS that is refreshed in real-time or
a ZLE can take over some of the roles of the data-warehouse.
ETL
• The ETL process involves extracting, transforming and loading data from multiple source-systems.
• The process is much more complex and tedious. The process may require significant resources to
implement.
• Different data-sources tend to have
→ different conventions for coding information &
→ different standards for the quality of information
• Building an ODS requires data filtering, data cleaning and integration.
• Data-errors at least partly arise because of unmotivated data-entry staff.
• Successful implementation of an ETL system involves resolving following issues:
1) What are the source systems? These systems may include relational database systems,
legacy systems.
2) To what extent are the source systems and the target system interoperable? The more
different the sources and target, the more complex the ETL process.
3) What ETL technology will be used?
4) How big is the ODS likely to be at the beginning and in the long term? Database systems
tend to grow with time. Consideration may have to be given to whether some of the data
→ from the ODS will be archived regularly as the data becomes old and
→ is no longer needed in the ODS
5) How frequently will the ODS be refreshed or reloaded?
6) How will the quality and integrity of the data be monitored? Data cleaning will often
required to deal with issues like missing values, data formats, code values, primary keys
and referential integrity.
7) How will a log be maintained? A dispute may arise about the origin of some data. It is
therefore necessary to be able to not only log which information came from where but also
when the information was last updated.
8) How will recovery take place?
9) Would the extraction process only copy data from the source systems and not delete the
original data?
10) How will the transformation be carried out? Transformation could be done in either
source OLTP system, ODS or staging area.
11) How will data be copied from non-relational legacy systems that are still operational?
3
DATA WAREHOUSING & DATA MINING
ETL FUNCTIONS
• The ETL process consists of
→ data extraction from source systems
→ data transformation which includes data cleaning and
→ data loading in the ODS or the data warehouse
• Data cleaning deals with detecting & removing errors/inconsistencies from the data, in particular
the data that is sourced from a variety of computer systems.
• Building an integrated database from a number of source-systems may involve solving some of the
following problems:
Instance Identity Problem
• The same customer may be represented slightly differently in different source-
systems.
• There is a possibility of mismatching between the different systems that needs to be
identified & corrected.
• Checking for homonyms & synonyms is necessary.
• Achieving very high consistency in names & addresses requires a huge amount of
resources.
Data Errors
• Following are different types of data errors
→ data may have some missing attribute values
→ there may be duplicate records
→ there may be wrong aggregations
→ there may be inconsistent use of nulls, spaces and empty values
→ some attribute values may be inconsistent(i.e. outside their domain)
→ there may be non-unique identifiers
Record Linkage Problem
• This deals with problem of linking information from different databases that relates
to the same customer.
• Record linkage can involve a large number of record comparisons to ensure linkages
that have a high level of accuracy.
Semantic Integration Problem
• This deals with integration of information found in heterogeneous OLTP & legacy
sources.
→ Some of the sources may be relational.
→ Some may be even in text documents
→ Some data may be character strings while others may be integers.
Data Integrity Problem
• This deals with issues like
→ referential integrity → null values → domain of values
• Data cleaning should be based on the following 5 steps:
1) Parsing
• This involves
→ identifying various components of the source data-files and
→ then establishing relationships between those and the fields in the target
files.
2) Correcting
• Correcting the identified components is usually based on a variety of sophisticated
techniques including mathematical algorithms.
3) Standardizing
• Business rules of enterprise may be used to transform the data to standard form.
4) Matching
• Much of the data extracted from a number of source-systems is likely to be related.
Such data needs to be matched.
5) Consolidating
• All corrected, standardized and matched data can now be consolidated to build a
single version of the enterprise-data.
4
DATA WAREHOUSING & DATA MINING
DATA WAREHOUSE
• This is an integrated subject-oriented & time-variant repository of information in support of
management's decision making process.
• In other words, it is a process of integrating enterprise-wide data, originating from a variety of
sources, into a single repository.
• ODS is a current-valued data store
while a DW is a time-variant repository of data (Table: 7.6).
• Benefits of implementing a data warehouse:
1) To provide a single version of truth about enterprise information
2) To speed up adhoc reports and queries that involves aggregations across many attributes
3) To provide a system in which managers who do not have a strong technical background
are able to run complex queries
4) To provide a database that stores relatively clean data
5) To provide a DB that stores historical data that may have been deleted from OLTP systems
• Q: Why a data warehouse must be separate from OLTP systems?
Ans: Complex queries involving number of aggregations are likely to degrade performance of
systems.
• Smaller data warehouse is called data marts (Figure: 7.3).
• A data-mart stores information for a limited number of subject-areas.
• The data-mart approach is attractive since beginning with a single data-mart is relatively
inexpensive and easier to implement.
5
DATA WAREHOUSING & DATA MINING
6
DATA WAREHOUSING & DATA MINING
DATA WAREHOUSE DESIGN
• A dimension (essentially an attribute) is an ordinate within a multidimensional structure consisting
of a list of ordered values (called members).
• A member is a distinct value (or name or identifier) for the dimension.
• Values that depend on the dimensions are called measures.
• The fact table is a collection of related data items, consisting of values of dimensions of interest
and the value of measure (Figure: 7.8).
• A data warehouse model consists of
→ a central fact table &
→ a set of surrounding dimensions tables on which the facts depend. Such a model is called a
star scheme (Figure: 7.6).
• The characteristic of a star scheme is that all the dimensions directly link to the fact table.
• Star schemas may be refined into snowflake schemas if we wish to provide support for dimension
hierarchies by allowing dimension tables to have subtables to represent the hierarchies (Fig: 7.7).
.
7
DATA WAREHOUSING & DATA MINING
8
DATA WAREHOUSING & DATA MINING
DW IMPLEMENTATION STEPS
1) Requirements Analysis & Capacity Planning
• The first step involves
→ defining enterprise needs
→ defining architecture
→ carrying out capacity planning and
→ selecting the hardware and software tools
• This step also involves consulting
→ with senior management &
→ with the various stakeholders
2) Hardware Integration
• Both hardware and software need to be put together by integrating
→ servers
→ storage devices &
→ client software tools
3) Modeling
• This involves designing the warehouse schema and views. This may involve using a
modeling tool if the data warehouse is complex.
4) Physical Modeling
• This involves designing
→ data warehouse organization
→ data placement
→ data partitioning
→ deciding on access methods and indexing
5) Sources
• This involves identifying and connecting the sources using gateways.
6) ETL
• This involves
→identifying a suitable ETL tool vendor and
→purchasing and implementing the tool
• This may include customizing the tool to suit the needs of the enterprise.
7) Populate DW
• This involves testing the required ETL tools, perhaps using a staging area. Then,
ETL tools may be used in populating the warehouse.
8) User Applications
• This involves designing and implementing applications required by the end users.
9) Roll-out the DW and Applications
DW METADATA
• Metadata is data about data or documentation about the data that is needed by the users.
• This is not the actual data warehouse, but answers the "who, what, where, when, why and how"
questions about the data warehouse.
• This describes the characteristics of a resource.
• Metadata may be classified into two groups: back room metadata and front room metadata.
1) Much important information is included in the back room metadata that is process related,
for example, the ETL processes.
Furthermore, this could include
→ data extraction, cleaning and loading specifications
→ dates on which the ETL process was carried out and
→ a list of data that were loaded
2) The front room metadata is more descriptive and could include information needed by the
users, for example,
→ user security privileges
→ various statistics about usage
9
DATA WAREHOUSING & DATA MINING
DW IMPLEMENTATION GUIDELINES
Build Incrementally
• Firstly, a data mart may be built with one particular project in mind.
• Then, a number of other sections of enterprise may be implemented in a similar
manner.
• An enterprise data warehouse can then be implemented in an iterative manner,
allowing all data marts to extract information from the data warehouse.
Need a Champion
• The project must have a champion who is willing to carry out considerable research
into expected costs and benefits of the project.
• The projects require inputs from many units in an enterprise and therefore need to
be driven by someone who is capable of interacting with people in the enterprise.
Senior Management Support
• Given the resource intensive nature of such projects and the time they can take to
implement, the project calls for a sustained commitment from senior management.
Ensure Quality
• Only data that has been cleaned and only data that are of a quality should be loaded
in the data warehouse.
Corporate Strategy
• The project must fit with corporate strategy and business objectives.
Business Plan
• Financial costs, expected benefits & a project plan must be clearly outlined and
understood by all stakeholders.
Training
• The users must be trained
→ to use the warehouse and
→ to understand its capabilities
Adaptability
• Project should build in adaptability so that changes may be made to DW if & when
required.
Joint mManagement
• The project must be managed by both IT and business professionals in the
enterprise.
EXERCISES
1) Explain ODS along with its structure. (6)
2) Explain ETL. List out the issues that need to be resolved for successful
implementation of an ETL system. (6)
3) What is data cleaning? Explain the 5 steps in data cleaning. (4)
4) Explain the problems that need to be solved for building an integrated database
from a number of source systems. (6)
5) Write a short note on
i) Data warehouse (4)
ii) ZLE (4)
6) Compare the following:
i) Lazy query vs. eager query processing (4)
ii) OLTP vs. DW (6)
iii) ODS vs. DW (6)
7) Explain ODS & DW architecture along with its structure. (4)
8) What are star & snowflake schemas? (2)
9) Explain the steps for DW implementation. (6)
10) Explain the guidelines for DW implementation. (6)
11) Write a short note on DW metadata. (4)
10
DATA WAREHOUSING & DATA MINING
INTRODUCTION
• A dimension is an attribute within a multidimensional model consisting of a list of values (called
members).
• A fact is defined by a combination of dimension values for which a non-null value exists.
• The non-null values of facts are the numerical values stored in each data cube cell. They are called
measures.
• A data cube computes aggregates over all subsets of dimensions specified in the cube.
OLAP
• OLAP stands for Online Transaction Processing Systems.
• This is primarily a software-technology concerned with fast analysis of enterprise-information.
• In other words, OLAP is the dynamic enterprise analysis required to create, manipulate, animate &
synthesize information from exegetical, contemplative and formulaic data analysis models.
• Business-Intelligence(BI) is used to mean both data-warehousing and OLAP.
• In other words, BI is defined as a user-centered process of
→ exploring data, data-relationships and trends
→ thereby helping to improve overall decision-making.
11
DATA WAREHOUSING & DATA MINING
CHARACTERISTICS OF OLAP SYSTEMS
1) Users
• OLTP systems are designed for office-workers (100-1000 users).
Whereas, OLAP systems are designed for decision-makers(few business-users).
2) Functions
• OLTP systems are mission-critical. They support an enterprise's day-to-day operations.
They are mostly performance-driven.
Whereas, OLAP systems are management-critical. They support an enterprise's decision-
functions using analytical-investigations.
3) Nature
• OLTP systems are designed to process one record at a time, for ex, a record related to the
customer.
Whereas, OLAP systems involve queries that deal with many records at a time and
provide aggregate data to a manager.
4) Design
• OLTP systems are designed to be application-oriented.
Whereas, OLAP systems are designed to be subject-oriented.
• OLTP systems view the enterprise-data as a collection of tables (based on ER model).
Whereas, OLAP systems view enterprise-information as multidimensional model.
5) Data
• OLTP systems normally deal only with the current-status of information. The old information
may have been archived and may not be accessible online.
Whereas, OLAP systems require historical-data over several years.
6) Kind of use
• OLTP systems are used for read & write operations.
Whereas, OLAP systems normally do not update the data.
12
DATA WAREHOUSING & DATA MINING
FASMI CHARACTERISTICS OF OLAP SYSTEMS
Fast
• Most queries should be answered very quickly, perhaps within seconds.
• The performance of the system has to be like that of a search-engine.
• The data-structures must be efficient.
• The hardware must be powerful enough for
→ amount of data &
→ number of users
• One approach can be
→ pre-compute the most commonly queried aggregates and
→ compute the remaining aggregates on-the-fly
Analytic
• The system must provide rich analytic-functionality.
• Most queries should be answered without any programming.
• The system should be able to cope with any relevant queries for application & user.
Shared
• The system is
→ likely to be accessed only by few business-analysts and
→ may be used by thousands of users
• Being a shared system, the OLAP software should provide adequate security for
confidentiality & integrity.
• Concurrency-control is obviously required if users are writing or updating data in the
database.
Multidimensional
• This is the basic requirement.
• OLAP software must provide a multidimensional conceptual view of the data.
• A dimension often has hierarchies that show parent/child relationships between the
members of dimensions. The multidimensional structure should allow such hierarchies.
Information
• The system should be able to handle a large amount of input-data.
• The capacity of system to handle information and its integration with the data
warehouse may be critical.
Commitment and accountability are words that must flood the core of your being.
13
DATA WAREHOUSING & DATA MINING
Treatment of Missing Values
• The system should ignore all missing-values regardless of their source.
Generic Dimensionality
• The system should treat each dimension as equivalent in both its structure and
operational capabilities.
Unlimited Dimensions and Aggregation Levels
• The system should allow unlimited dimensions and aggregation levels.
• In practice, number of dimensions is rarely more than 10 and
number of hierarchies rarely more than 6.
• The multidimensional view of data by using an example of a simple OLTP database consists of the
three tables:
student(Student_id, Student_name, Country, DOB, Address)
enrolment(Student_id, Degree_id, SSemester)
degree(Degree_id, Degree_name, Degree_length, Fee, Department)
• It is clear that the information given in Tables 8.3, 8.4 and 8.5, although suitable for a student
enrolment OLTP system, is not suitable for efficient management decision making.
• The managers do not need information about the individual students, the degree they are enrolled
in, and the semester they joined the university.
• What the managers needs the trends in student numbers in different degree programs and from
different countries.
• We first consider only two dimensions. Let us say we are primarily interested in finding out how
many students from each country came to do a particular degree (Table: 8.6). Therefore we may
visualize the data as two-dimensional, i.e., Country x Degree
• Using this two-dimensional view we are able to find the number of students joining any degree
from any country (only for semester 2000-01). Other queries that we are quickly able to answer are:
→How many students started BIT in 2000-01?
→How many students joined from Singapore in 2000-01?
• Tables 8.6, 8.7 and 8.8 together now form a three-dimensional cube. The table 8.7 provides totals
for the two semesters and we are able to “drill-down” to find numbers in individual semesters.
• Putting the three tables together gives a cube of 8 x 6 x 3 (= 144) cells including the totals along
every dimension.
• A cube could be represented by:
Country x Degree x Semester
• In the three-dimensional cube, the following eight types of aggregations or queries are
possible:
1. null (e.g. how many students are there? Only 1 possible query)
2. degrees (e.g. how many students are doing BSc? 5 possible queries if we assume 5
different degrees)
3. semester (e.g. how many students entered in semester 2000-01? 2 possible queries if we
only have data about 2 semesters)
4. country (e.g. how many students are from the USA? 7 possible queries if there are 7
countries)
5. degrees, semester (e.g. how many students entered in 2000-01 to enroll in BCom? With 5
degrees and 2 different semesters 10 queries)
6. semester, country (e.g. how many students from the UK entered in 2000-01? With 7
countries and 2 different semesters 14 queries)
7. degrees, country (e.g. how many students from Singapore are enrolled in BCom? 7*5=35
queries)
8. all (e.g. how many students from Malaysia entered in 2000-01 to enroll in BCom?
7*5*2=70 queries)
• All the cell in the cube represents measures or aggregations.
• 2n types of aggregations are possible for n dimensions.
Devote yourself to displaying a standard of excellence at work far higher than anyone would ever expect from you.
16
DATA WAREHOUSING & DATA MINING
DATA CUBE IMPLEMENTATION
Pre-compute and store all
• Millions of aggregates need to be computed and stored.
• This is the best solution as far as query response-time is concerned.
• This solution is impractical for a large data-cube,
since resources required to compute & store the aggregates will be
prohibitively large
• Indexing large amounts of data is also expensive.
Pre-compute(and store) none
• The aggregates are computed on-the-fly using the raw data whenever a query is
posed.
• This approach does not require additional space for storing the cube.
• The query response-time is likely to be very poor for large data-cubes.
Pre-compute and store some
• We pre-compute and store the most frequently queried aggregates and compute
others as the need arises.
• We may also be able to derive some of the remaining aggregates using the
aggregates that have already been computed (Figure 8.3). .
• The more aggregates we are able to pre-compute, the better the query performance
• Data-cube products use different techniques for pre-computing aggregates and
storing them. They are generally based on one of two implementation models.
i) ROLAP (relational OLAP)
ii) MOLAP (multidimensional OLAP)
Asking the right question is often 90% of finding the right answer.
17
DATA WAREHOUSING & DATA MINING
ROLAP
• This uses relational or extended-relational DBMS to store and manage warehouse data.
• This may be considered a bottom-up approach to OLAP.
• This is typically based on using a data warehouse that has been designed using a star scheme.
• The data warehouse provides the multidimensional capabilities by representing data in fact-table
and dimension-tables.
• The fact-table contains one column for each dimension and one column for each measure and
every row of the table provides one fact.
• A fact then is represented as with the last column as 30. An OLAP tool is then provided to
manipulate the data in these data warehouse tables.
• This tool essentially groups the fact-table to find aggregates and uses some of the aggregates
already computed to find new aggregates.
• Advantages:
→ This is more easily used with existing relational DBMS and
→ The data can be stored efficiently using tables.
→ Greater scalability
• Disadvantage:
→ Poor query performance
• Some products in this category are Oracle OLAP mode and OLAP Discoverer.
MOLAP
• This is based on using a multidimensional DBMS rather than a data-warehouse to store and access
data.
• This may be considered as a top-down approach to OLAP.
• This does not have a standard approach to storing and maintaining their data.
• This often uses special-purpose file systems or indexes that store pre-computation of all
aggregations in the cube.
• Advantages:
→ Implementation is usually exceptionally efficient
→ Easier to use and therefore may be more suitable for inexperienced users
→ Fast indexing to pre-computed summarized-data
• Disadvantages:
→ More expensive than ROLAP
→ Data is not always current
→ Difficult to scale a MOLAP system for very large OLAP problems
→ Storage utilization may be low if the data-set is sparse
• Some MOLAP products are Hyperion Essbase and Applix iTM1.
Life is short and you don’t know when it's going to end.
18
DATA WAREHOUSING & DATA MINING
DATA CUBE OPERATIONS
• A number of operations may be applied to data cubes. The common ones are:
→ roll-up
→ drill-down
→ pivot or rotate
→ slice & dice
ROLL-UP
• This is like zooming out on the data cube.
• This is required when the user needs further abstraction or less detail.
• Initially the concept hierarchy was "street < city < province < country".
• On rolling up the data is aggregated by ascending the location hierarchy from the level of city to
level of country.
• The data is grouped into cities rather than countries.
DRILL DOWN
• This is like zooming in on the data and is therefore the reverse of roll-up.
• This is an appropriate operation when the user needs further details or when the user wants to
partition more finely or wants to focus on some particular values of certain dimensions.
• This adds more details to the data.
• Initially the concept hierarchy was "day < month < quarter < year."
• On drill-up the time dimension is descended from the level quarter to the level of month.
• When drill-down operation is performed then one or more dimensions from the data cube are
added.
Life's a game. Don’t take it too seriously. Have fun. Dance. Laugh. Maintain a healthy dose of perspective.
19
DATA WAREHOUSING & DATA MINING
PIVOT OR ROTATE
• This is used when the user wishes to re-orient the view of the data cube.
• This may involve
→ swapping the rows and columns, or
→ moving one of the row dimensions into the column dimension
• In this, the item and location axes in 2-D slice are rotated.
There are no great acts, only small acts done with great love.
20
DATA WAREHOUSING & DATA MINING
SLICE & DICE
• These are operations for browsing the data in the cube. The terms refer to the ability to look at
information from different viewpoints.
• A slice is a subset of cube corresponding to a single value for 1 or more members of dimensions.
• The slice operation is performed for the dimension time using the criterion time ="Q1".
• The dice operation is similar to slice but dicing does not involve reducing number of dimensions.
• A dice is obtained by performing a selection on two or more dimensions.
• The dice operation on cube based on the following selection criteria that involve three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem").
OLAP SOFTWARE
• SQL Server 2000 Analysis Service from Microsoft. SQL Server 2000 analysis services is the OLAP
services component in SQL Server 7.0.
• BI2M(Business Intelligence to Marketing and Management) from B&M Service has 3 modules, one
of which is for OLAP. The OLAP module allows database exploring including slice and dice, roll-up,
drill-down and displays results as 2D Charts, 3D charts and tables.
• Business Objects OLAP Intelligence from BusinessObjects allows access to OLAP servers from
Microsoft, Hyperion, IBM and SAP. BusinessObjects also has widely used Crystal Analysis and
Reports.
• Usual operations like slice and dice, and drill directly on multidimensional sources are possible.
• ContourCube from Contour Components is an OLAP product that enables users to slice and dice,
roll up, drill down and pivot efficiently.
• DB2 Cube Views from IBM includes features and functions for managing and deploying
multidimensional data.
• Express and the Oracle OLAP option. Express is a multidimensional database and application
development environment for building OLAP applications.
EXERCISES
1) What is OLAP? Explain the motivations for using OLAP. (4)
2) Compare the following:
i) OLTP vs. OLAP (6) ii) ROLAP vs. MOLAP (4)
3) Explain the FASMI characteristics of OLAP systems. (6)
4) Explain the Codd's OLAP characteristics. (6)
5) Explain the 3 methods for data cube implementation. (6)
6) Explain various operations on data cube. (6)
7) Explain the guidelines for OLAP implementation. (6)
Aim for the top. There is plenty of room there. There are so few at the top it is almost lonely there.
22
DATA WAREHOUSING & DATA MINING
• The input-data is stored in various formats such as flat files, spread sheet or relational tables.
• Purpose of preprocessing: to transform the raw input-data into an appropriate format for
subsequent analysis.
• The steps involved in data-preprocessing include
→ combine data from multiple sources
→ clean data to remove noise & duplicate observations, and
→ select records & features that are relevant to the DM task at hand
• Data-preprocessing is perhaps the most time-consuming step in the overall knowledge discovery
process.
• “Closing the loop" refers to the process of integrating DM results into decision support systems.
• Such integration requires a postprocessing step. This step ensures that only valid and useful
results are incorporated into the decision support system.
• An example of postprocessing is visualization.
Visualization can be used to explore data and DM results from a variety of viewpoints.
• Statistical measures can also be applied during postprocessing to eliminate bogus DM results.
People with goals succeed because they know where they're going.
24
DATA WAREHOUSING & DATA MINING
DATA MINING TASKS
• DM tasks are generally divided into 2 major categories.
Predictive Tasks
• The objective is to predict the value of a particular attribute based on the values of other
attributes.
• The attribute to be predicted is commonly known as the target or dependent variable,
while the attributes used for making the predication are known as the explanatory or
independent variables.
Descriptive Tasks
• The objective is to derive patterns (correlations, trends, clusters, trajectories and
anomalies) that summarize the relationships in data.
• Descriptive DM tasks are often exploratory in nature and frequently require postprocessing
techniques to validate and explain the results.
We all have two choices; We can make a living or we can design a life.
25
DATA WAREHOUSING & DATA MINING
3) Association Analysis
• This is used to discover patterns that describe strongly associated features in the data.
• The goal is to extract the most interesting patterns in an efficient manner.
• Useful applications include
→ finding groups of genes that have related functionality or
→ identifying web pages that are accessed together
• Ex: market based analysis
We may discover the rule that {diapers} -> {Milk}, which suggests that customers who buy
diapers also tend to buy milk.
4) Anomaly Detection
• This is the task of identifying observations whose characteristics are significantly different
from the rest of the data. Such observations are known as anomalies or outliers.
• The goal is
→ to discover the real anomalies and
→ to avoid falsely labeling normal objects as anomalous.
• Applications include the detection of fraud, network intrusions, and unusual patterns of
disease.
Example 1.4 (Credit Card trYaud Detection).
• A credit card company records the transactions made by every credit card holder,
along with personal information such as credit limit, age, annual income, and address.
• Since the number of fraudulent cases is relatively small compared to the number of
legitimate transactions, anomaly detection techniques can be applied to build a profile
of legitimate transactions for the users.
• When a new transaction arrives, it is compared against the profile of the user.
• If the characteristics of the transaction are very different from the previously created
profile, then the transaction is flagged as potentially fraudulent
EXERCISES
1. What is data mining? Explain Data Mining and Knowledge Discovery? (10)
2. What are different challenges that motivated the development of DM? (10)
3. Explain Origins of data mining (5)
4. Discuss the tasks of data mining with suitable examples. (10)
5. Explain Anomaly Detection .Give an Example? (5)
6. Explain Descriptive tasks in detail? (10)
7. Explain Predictive tasks in detail by example? (10)
Success is not measured in achievement of goals, but in the stress and strain of meeting those goal.
26
DATA WAREHOUSING & DATA MINING
ASYMMETRIC ATTRIBUTES
• Binary attributes where only non-zero values are important are called asymmetric binary
attributes.
• Consider a data-set where each object is a student and each attribute records whether or not a
student took a particular course at a university.
• For a specific student, an attribute has a value of 1 if the student took the course associated with
that attribute and a value of 0 otherwise.
• Because students take only a small fraction of all available courses, most of the values in such a
data-set would be 0.
• Therefore, it is more meaningful and more efficient to focus on the non-zero values.
• This type of attribute is particularly important for association analysis.
Aim for the stars and maybe you'll reach the sky
29
DATA WAREHOUSING & DATA MINING
RECORD DATA
• Data-set is a collection of records.
Each record consists of a fixed set of attributes.
• Every record has the same set of attributes.
• There is no explicit relationship among records or attributes.
• The data is usually stored either
→ in flat files or → in relational databases
Wise people remind themselves that every day could be their last.
30
DATA WAREHOUSING & DATA MINING
GRAPH BASED DATA
• Sometimes, a graph can be a convenient and powerful representation for data.
• We consider 2 specific cases:
1) Data with Relationships among Objects
• The relationships among objects frequently convey important information.
• In particular, the data-objects are mapped to nodes of the graph,
while relationships among objects are captured by link properties such as
direction & weight.
• For ex, in web, the links to & from each page provide a great deal of information
about the relevance of a web-page to a query, and thus, must also be taken into
consideration.
2) Data with Objects that are Graphs
• If the objects contain sub-objects that have relationships, then such objects are
frequently represented as graphs.
• For ex, the structure of chemical compounds can be represented by a graph,
where nodes are atoms and
links between nodes are chemical bonds.
AGGREGATION
• This refers to combining 2 or more attributes (or objects) into a single attribute (or object).
For example, merging daily sales figures to obtain monthly sales figures
• Motivations for aggregation:
1) Data reduction: The smaller data-sets require
→ less memory → less processing time.
Because of aggregation, more expensive algorithm can be used.
2) Aggregation can act as a change of scale by providing a high-level view of the data instead
of a low-level view. E.g. Cities aggregated into districts, states, countries, etc
3) The behavior of groups of objects is often more stable than that of individual objects.
• Disadvantage: The potential loss of interesting details.
SAMPLING
• This is a method used for selecting a subset of the data-objects to be analyzed.
• This is often used for both
→ preliminary investigation of the data → final data analysis
• Q: Why sampling?
Ans: Obtaining & processing the entire set of “data of interest” is too expensive or time consuming.
• Sampling can reduce data-size to the point where better & more expensive algorithm can be used.
• Key principle for effective sampling: Using a sample will work almost as well as using entire data-
set, if the sample is representative.
Sampling Methods
1) Simple Random Sampling
• There is an equal probability of selecting any particular object.
• There are 2 variations on random sampling:
i) Sampling without Replacement
• As each object is selected, it is removed from the population.
ii) Sampling with Replacement
• Objects are not removed from the population as they are selected for the sample.
• The same object can be picked up more than once.
• When the population consists of different types(or number) of objects, simple random
sampling can fail to adequately represent those types of objects that are less frequent.
2) Stratified Sampling
• This starts with pre-specified groups of objects.
• In the simplest version, equal numbers of objects are drawn from each group even though
the groups are of different sizes.
• In another variation, the number of objects drawn from each group is proportional to the
size of that group.
3) Progressive Sampling
• If proper sample-size is difficult to determine then progressive sampling can be used.
• This method starts with a small sample, and
then increases the sample-size until a sample of sufficient size has been obtained.
• This method requires a way to evaluate the sample to judge if it is large enough.
One of our biggest regrets on our deathbeds is that we were not reflective enough.
33
DATA WAREHOUSING & DATA MINING
DIMENSIONALITY REDUCTION
• Key benefit: many DM algorithms work better if the dimensionality is lower.
Purpose
• May help to eliminate irrelevant features or reduce noise.
• Can lead to a more understandable model (which can be easily visualized).
• Reduce amount of time and memory required by DM algorithms.
• Avoid curse of dimensionality.
The Curse of Dimensionality
• Data-analysis becomes significantly harder as the dimensionality of the data increases.
• For classification, this can mean that there are not enough data-objects to allow the
creation of a model that reliably assigns a class to all possible objects.
• For clustering, the definitions of density and the distance between points (which are critical
for clustering) become less meaningful.
• As a result, we get
→ reduced classification accuracy &
→ poor quality clusters.
Opportunity is always knocking. The problem is that most people have the self-doubt station in their head turned
up way too loud to
34
DATA WAREHOUSING & DATA MINING
DISCRETIZATION AND BINARIZATION
• Some DM algorithms (especially classification algorithms) require that the data be in the form of
categorical attributes.
• Algorithms that find association patterns require that the data be in the form of binary attributes.
• Transforming continuous attributes into a categorical attribute is called discretization.
And transforming continuous & discrete attributes into binary attributes is called as binarization.
Binarization
• A simple technique to binarize a categorical attribute is the following: If there are m
categorical values, then uniquely assign each original value to an integer in interval [0,m-1].
• Next, convert each of these m integers to a binary number.
• Since n=[log2(m)] binary digits are required to represent these integers, represent these
binary numbers using ‘n’ binary attributes.
VARIABLE TRANSFORMATION
• This refers to a transformation that is applied to all the values of a variable.
• Ex: converting a floating point value to an absolute value.
• Two types are:
1) Simple Functions
• A simple mathematical function is applied to each value individually.
• If x is a variable, then examples of transformations include ex, 1/x, log(x), sin(x).
2) Normalization (or Standardization)
• The goal is to make an entire set of values have a particular property.
• A traditional example is that of "standardizing a variable" in statistics.
• If x is the mean of the attribute values and sx is their standard deviation, then the
transformation x'=(x- x )/sx creates a new variable that has a mean of 0 and a
standard deviation of 1.
Winners are those people who make a habit of doing the things losers are uncomfortable doing.
36
DATA WAREHOUSING & DATA MINING
DISSIMILARITIES BETWEEN DATA OBJECTS
Distances
• The Euclidean distance, d, between 2 points, x and y, in one-,two-,three- or higher-dimensional
space, is given by the following familiar formula:
where r=parameter.
• The following are the three most common examples of minkowski distance:
r=1. City block( Manhattan L1 norm) distance.
A common example is the Hamming distance, which is the number of bits that are different
between two objects that have only binary attributes ie between two binary vectors.
r=2. Euclidean distance (L2 norm)
r=∞. Supremum(L∞ or Lmax norm) distance. This is the maximum difference between any
attribute of the objects. Distance is defined by
• If d(x,y) is the distance between two points, x and y, then the following properties hold
1) Positivity
d(x,x)>=0 for all x and y d(x,y)=0 only if x=y
2) Symmetry
d(x,y)=d(y,x) for all x and y.
3) Triangle inequality
d(x,z)<=d(x,y)+d(y,z) for all points x,y and z.
• Measures that satisfy all three properties are known as metrics.
Let go of the past and go for the future. Go confidently in the direction of your dreams. Live the life you've imagined.
37
DATA WAREHOUSING & DATA MINING
SIMILARITIES BETWEEN DATA OBJECTS
• For similarities, the triangle inequality typically does not hold, but symmetry positivity typically do.
• If s(x,y) is the similarity between points x and y, then the typical properties of similarities are the
following
1) s(x,y)=1 only if x=y
2) s(x,y)=s(y,x) for all x and y. (Symmetry)
• For ex, cosine and Jaccard similarity.
JACCARD COEFFICIENT
• Jaccard coefficient is frequently used to handle objects consisting of asymmetric binary attributes.
• The jaccard coefficient is given by the following equation:
Courage means to continue seeking solutions to difficult problems, and to stay focused during stressful periods.
38
DATA WAREHOUSING & DATA MINING
COSINE SIMILARITY
• Documents are often represented as vectors, where each attribute represents the frequency with
which a particular term (or word) occurs in the document.
• This is more complicated, since certain common words are ignored and various processing
techniques are used to account for
→ different forms of the same word
→ differing document lengths and
→ different word frequencies.
• The cosine similarity is one of the most common measure of document similarity.
• If x and y are two document vectors, then
• As indicates by figure 2.16, cosine similarity really is a measure of the angle between x and y.
• Thus, if the cosine similarity is 1,the angle between x and y is 0',and x and y are the same except
for magnitude(length).
• If cosine similarity is 0, then the angle between x and y is 90' and they do not share any terms.
Develop success from failures. Discouragement and failure are two of the surest stepping stones to success.
39
DATA WAREHOUSING & DATA MINING
DATA MINING APPLICATIONS
Prediction & Description
• Data mining may be used to answer questions like
→ "would this customer buy a product" or
→ "is this customer likely to leave?”
• DM techniques may also be used for sales forecasting and analysis.
Relationship Marketing
• Customers have a lifetime value, not just the value of a single sale.
• Data mining can help
→ in analyzing customer profiles and improving direct marketing plans
→ in identifying critical issues that determine client loyalty and
→ in improving customer retention
Customer Profiling
• This is the process of using the relevant and available information
→ to describe the characteristics of a group of customers
→ to identify their discriminators from ordinary consumers and
→ to identify drivers for their purchasing decisions
• This can help an enterprise identify its most valuable customers
so that the enterprise may differentiate their needs and values.
Outliers Identification & Detecting Fraud
• For this, examples include:
→ identifying unusual expense claims by staff
→ identifying anomalies in expenditure between similar units of an enterprise
→ identifying fraud involving credit cards
Customer Segmentation
• This is a way to assess & view individuals in market based on their status & needs.
• Data mining may be used
→ to understand & predict customer behavior and profitability
→ to develop new products & services and
→ to effectively market new offerings
Web site Design & Promotion
• Web mining may be used to discover how users navigate a web site and the results
can help in improving the site design.
• Web mining may also be used in cross-selling by suggesting to a web customer,
items that he may be interested in.
EXERCISES
1. Explain Data set. Give an Example? (5)
2. Explain 4 types of attributes by giving appropriate example? (10)
3. With example, explain
i) Continious ii) Discrete iii) Asymmetric Attributes. (10)
4. Explain general characteristics of data sets. (6)
5. Explain record data & its types. (10)
6. Explain graph based data. (6)
7. Explain ordered data & its types. (10)
8. Explain shortly any five data pre-processing approaches. (10)
9. What is sampling? Explain simple random sampling vs. stratified
sampling vs. progressive sampling. (10)
10. Write a short note on the following: (15)
i)Dimensionality reduction ii)Variable transformation iii)Feature selection
11. Distinguish between
i) SMC & Jaccard coefficient ii) Discretization & binarization (6)
12. Explain various distances used for measuring dissimilarity between objects. (6)
13. Consider the following 2 binary vectors
X=(1,0,0,0,0,0,0,0,0,0) Y=(0,0,0,0,0,0,1,0,0,1)
Find i)hamming distance ii)SMC iii)Jaccard coefficient (4)
14. List out issues in proximity calculation. (4)
15. List any 5 applications of data mining. (8)
Success is the prize for those who stand true to their ideas!
40
DATA WAREHOUSING & DATA MINING
41
DATA WAREHOUSING & DATA MINING
CLASSIFICATION VS. CLUSTER ANALYSIS
• Classification is used mostly as a supervised learning method
while clustering is used as unsupervised learning.
Classification
• The classes are predefined (Table 4.1).
• The user already knows what classes there are.
• Some training data that is already labeled by their class-membership is available to build a model.
• The classification-problem then is to build a model that would be able to classify newly
encountered data.
Cluster Analysis
• One does not know what classes or clusters exist (Table 4.2).
• The problem to be solved is to group the given data into meaningful clusters.
• The aim of cluster analysis is to find meaningful groups with
→ small within-group variations &
→ large between-group variation
• Most of the algorithms developed are based on some concept of similarity or distance.
• Drawbacks:
→ This process may be prohibitively expensive for large sets.
→ Cost of computing distances between groups of objects grows as no. of attributes grows.
→ Computing distances between categorical attributes is more difficult (compared to
computing distances between objects with numeric attributes)
42
DATA WAREHOUSING & DATA MINING
DESIRED FEATURES OF CLUSTER ANALYSIS METHOD
Scalability
• Data-mining problems can be large and therefore a cluster-analysis method should
be able to deal with large problems gracefully.
• Ideally, performance should be linear with data-size.
• The method should also scale well to datasets in which number of attributes is large.
Only one Scan of the Dataset
• For large problems, data must be stored on disk, so cost of I/O disk can become
significant in solving the problem.
• Therefore, the method should not require more than one scan of disk-resident data.
Ability to Stop & Resume
• For large dataset, cluster-analysis may require huge processor-time to complete the
task.
• In such cases, task should be able to be stopped & then resumed when convenient.
Minimal Input Parameters
• The method should not expect too much guidance from the data-mining analyst.
• Therefore, the analyst should not be expected
→ to have domain knowledge of data and
→ to posses’ insight into clusters that might exist in the data
Robustness
• Most data obtained from a variety of sources has errors.
• Therefore, the method should be able to deal with noise, outlier & missing values
gracefully.
Ability to Discover Different Cluster-Shapes
• Clusters appear in different shapes and not all clusters are spherical.
• Therefore, method should be able to discover cluster-shapes other than spherical.
Different Data Types
• Many problems have a mixture of data types, for e.g. numerical, categorical & even
textual.
• Therefore, the method should be able to deal with
→ numerical data → boolean data → categorical data
Result Independent of Data Input Order
• Therefore, the method should not be sensitive to data input-order.
• Irrespective of input-order, result of cluster-analysis of the same data should be
same.
TYPES OF DATA
Numerical Data
• Examples include weight, marks, height, price, salary, and count.
• There are a number of methods for computing similarity between these data.
E.g. Euclidean distance, manhattan distance.
Binary Data
• Examples include gender, marital status.
• A simple method involves
→ counting how many attribute values of 2 objects are different amongst n
attributes &
→ using this as an indication of distance
Qualitative Nominal Data
• This is similar to binary data which may take more than 2 values but has no natural
order. Examples include religion, foods or colors.
Qualitative Ranked Data
• This is similar to qualitative nominal data except that data has an order associated
with it.
• Examples include: 1) grades A, B, C, and D 2) sizes S, M, L and XL
• One method of computing distance involves transferring the values to numeric
values according to their rank. For example, grades A, B, C, D could be transformed to
4.0, 3.0, 2.0 and 1.0.
- .
43
DATA WAREHOUSING & DATA MINING
COMPUTING DISTANCE
• Most cluster-analysis methods are based on measuring similarity between objects.
• Distances are normally used to measure the similarity or dissimilarity between 2 objects.
• Let the distance between 2 points x and y be D(x,y).
• Distance has following simple properties:
1) Distance is always positive. i.e. D(x,y) >= 0
2) Distance from point x to itself is always 0. i.e. D(x,y) = 0
3) Distance from point x to point y is always less than the sum of the distance from x to
some other point z and distance from z to y. i.e. D(x,y) <= D(x,z) + D(z,y)
4) Distance from x to y is always the same as from y to x. i.e. D(x,y) = D(y,x)
• Following are some of the distance measures:
1) Euclidean distance (L2 norm of difference vector)
2) Manhattan distance (L1 norm of the difference vector)
3) Chebychev distance (L∞ norm of the difference vector)
4) Categorical data distance
Euclidean Distance
• This is most commonly used to compute distances and has an intuitive appeal.
• The largest valued attribute may dominate the distance.
• Requirement: The attributes should be properly scaled.
• This metric is more appropriate when the data is not standardized.
Manhattan Distance
• In most cases, the result obtained by this measure is similar to those obtained by using the
Euclidean distance.
• The largest valued attribute may dominate the distance.
Chebychev Distance
• This metric is based on the maximum attribute difference.
44
DATA WAREHOUSING & DATA MINING
TYPES OF CLUSTER ANALYSIS METHODS
Partitional Method
• The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster (Figure 4.1a).
• The method obtains a single-level partition of objects.
• The analyst has
→ to specify number of clusters(k) prior → to specify starting seeds of clusters
• The analyst may have to use iterative approach in which he has to run the method many times
→ specifying different numbers of clusters & different starting seeds
→ then selecting the best solution
• The method converges to a local minimum rather than the global minimum.
45
DATA WAREHOUSING & DATA MINING
PARTITIONAL METHODS
• The objects are divided into non-overlapping clusters (or partitions)
such that each object is in exactly one cluster (Figure 4.1a).
• The method obtains a single-level partition of objects.
• The method requires the analyst
→ to specify number of clusters(k) prior
→ to specify starting seeds of the clusters
• The analyst may have to use iterative approach in which he has to run the method many times
→ specifying different numbers of clusters & different starting seeds
→ then selecting the best solution
• The method converges to a local minimum rather than the global minimum.
• These are popular since they tend to be computationally efficient and are more easily adapted for
very large datasets.
• The aim of partitional method is
→ to reduce the variance within each cluster &
→ to have large variance between the clusters
Figure 4.1c
46
DATA WAREHOUSING & DATA MINING
Example 4.1
• Consider the data about students given in Table 4.3.
Step 1 and 2:
• Let the three seeds be the first three students as shown in Table 4.4.
Step 3 and 4:
• Now compute the distances using the 4 attributes and using the sum of absolute differences for
simplicity. The distance values for all the objects are given in Table 4.5.
Step 5:
• Table 4.6 compares the cluster means of clusters found in Table 4.5 with the original seeds.
47
DATA WAREHOUSING & DATA MINING
Step 3 and 4:
• Use the new cluster means to re-compute the distance of each object to each of the means, again
allocating each object to the nearest cluster. Table 4.7 shows the second iteration.
• Number of students in cluster 1 is again 2 and the other two clusters still have 4 students each.
• A more careful look shows that the clusters have not changed at all. Therefore, the method has
converged rather quickly for this very simple dataset.
• The cluster membership is as follows
Cluster C1= {S1, S9} Cluster C2= {S2, S5, S6, S10} Cluster C3= {S3, S4, S7, S8}
48
DATA WAREHOUSING & DATA MINING
EXPECTATION MAXIMIZATION METHOD
• Assumption is that the objects in the dataset have attributes whose values are distributed
according to some linear combination of simple probability distributions.
• K-means method involves assigning objects to clusters to minimize within-group variation,
whereas the EM method assigns objects to different clusters with certain probabilities in an
attempt to maximize expectation(or likelihood) of assignment.
• The EM method consists of a two-step iterative algorithm:
1) The estimation-step (or E-step) involves estimating the probability distributions of the
clusters given the data.
2) The maximization-step(M-step) involves finding the model parameters that maximize the
likelihood of the solution.
HIERARCHICAL METHODS
• A set of nested clusters is organized as a hierarchical tree (Figure 4.1d).
• This approach allows clusters to be found at different levels of granularity.
Figure 4.1d
Figure 4.1e
2. Divisive approach: All the objects are put in a single cluster to start. The method then
repeatedly resulting in smaller clusters until a stopping criterion is reached or each cluster
has only one object in it (Figure 4.1f).
Figure 4.1f
It is not the strongest of the species that survives, nor the most intelligent, but the one most responsive to change.
49
DATA WAREHOUSING & DATA MINING
DISTANCE BETWEEN CLUSTERS
• The hierarchical methods require distances between clusters to be computed. These distance
metrics are often called linkage metrics.
• Following methods are used for computing distances between clusters:
1) Single-link(nearest neighbor) 2) Complete-link(farthest neighbor)
3) Centriod 4) Average 5) Ward's minimum variance
SINGLE-LINK
• The distance between 2 clusters is defined as the minimum of the distances between all pairs of
points(x,y),
where x is from the first cluster &
y is from the second cluster. D(x, y) = min(xi, yj)
• If there are m elements in one cluster and n in another cluster,
all mn pairwise distances must be computed and
the smallest is chosen (Figure 4.2).
• Disadvantages:
→ Each cluster may have an outlier and the 2 outliers may be nearby and so the distance
between the 2 clusters would be computed to be small.
→ Single link can form a chain of objects as clusters are combined since there is no constraint
on distance between objects that are far away from each other.
COMPLETE-LINK
• The distance between 2 clusters is defined as the maximum of the distances between all pairs of
points(x,y). i.e. D(x, y)=max(xi, yj)
• This is strongly biased towards compact clusters (Figure 4.3).
• Disadvantages:
→ Each cluster may have an outlier and the 2 outliers may be very far away and so the
distance between the 2 clusters would be computed to be large.
→ If a cluster is naturally of a shape, say, like a banana then perhaps this method is not
appropriate.
CENTRIOD
• The distance between 2 clusters is defined as the distance between the centroids of the clusters.
• Usually, the squared Euclidean distance between the centroids is used. i.e. D(x, y)=D(C i, Cj)
• Advantages:
→ This method is easy and generally works well.
→ This method is more tolerant of somewhat longer clusters than complete link algorithm.
.
50
DATA WAREHOUSING & DATA MINING
AVERAGE
• The distance between 2 clusters is defined as the average of all pairwise distances between
→ an object from one cluster and
→ another from the other cluster. i.e. D(x, y)=avg(xi, yj)
• Therefore, if there are m elements in one cluster and n in the other,
there are mn distances to be computed, added and divided by mn.
• Advantages:
→ This method tends to join clusters with small variances.
→ This method is easy and generally works well.
→ This method is more tolerant of somewhat longer clusters than complete link algorithm.
where
DW(A,B) = ward's minimum variance distance between clusters A and B with NA & NB objects in
them respectively.
Dc(A,B) = centriod distance between the 2 clusters computed as squared Euclidean distance
between the centroids.
• Advantages:
→ This method generally works well and results in creating small tight clusters
→ This method produces clusters with roughly the same number of objects
→ This method tends to join clusters with a small number of objects
→ The distance measure can be sensitive to outliers
51
DATA WAREHOUSING & DATA MINING
AGGLOMERATIVE METHOD
• This method is basically a bottom-up approach.
• The algorithm is as follows:
1) Allocate each point to a cluster of its own. Thus, we start with n clusters for n objects.
2) Create a distance-matrix by computing distances between all pairs of clusters (either using
the single link metric or the complete link metric). Sort these distances in ascending order.
3) Find the 2 clusters that have the smallest distance between them.
4) Remove the pair of objects and merge them.
5) If there is only one cluster left then stop.
6) Compute all distances from the new cluster and update the distance-matrix after the
merger and go to step 3.
Example 4.3
• We now use agglomerative technique for clustering the data given in Table 4.10.
Steps 1 and 2:
• Allocate each point to a cluster and compute the distance-matrix using the centroid method.
• The distance-matrix is symmetric, so we only show half of it in Table 4.11.
• Table 4.11 gives the distance of each object with every other object.
Steps 3 and 4:
• The smallest distance is 8 between objects S4 and S8. They are combined and removed and we put
the combined cluster (C1) where the object S4 was.
• Table 4.12 is now the new distance-matrix. All distances except those with cluster C 1 remain
unchanged.
52
DATA WAREHOUSING & DATA MINING
Steps 5 and 6:
• The smallest distance now is 15 between objects S 5 and S6. They are combined in C2 cluster and S5
and S6 are removed.
Steps 3, 4, 5 and 6: Table 4.13 is the updated distance-matrix.
• The result of using the agglomerative method could be something like that shown in Figure 4.6.
53
DATA WAREHOUSING & DATA MINING
DIVISIVE HIERARCHICAL METHODS
• These methods
→ start with the whole dataset as one cluster
→ then proceed to recursively divide the cluster into two sub-clusters and
→ continue until each cluster has only one object.
• Two types of divisive methods are:
1) Monothetic: This splits a cluster using only one attribute at a time.
An attribute that has the most variation could be selected.
2) Polythetic: This splits a cluster using all of the attributes together.
Two clusters far apart could be build based on distance between objects.
• The algorithm is as follows:
1) Decide on a method of measuring the distance between 2 objects. Also, decide a threshold
distance.
2) Create a distance-matrix by computing distances between all pairs of objects within the
cluster. Sort these distances in ascending order.
3) Find the 2 objects that have the largest distance between them. They are the most
dissimilar objects.
4) If the distance between the 2 objects is smaller than the pre-specified threshold and there
is no other cluster that needs to be divided then stop, otherwise continue.
5) Use the pair of objects as seeds of a K-means method to create 2 new clusters.
6) If there is only one object in each cluster then stop otherwise continue with step 2.
• We need to resolve the following 2 issues:
1) Which cluster to split next?
i) Split the cluster in some sequential order.
ii) Split the cluster that has the largest number of objects.
iii) Split the cluster that has the largest variation within it.
2) How to split a cluster?
A distance-matrix is created and the 2 most dissimilar objects are selected as seeds
of 2 new clusters. The K-means method is then used to split the cluster.
Example 4.4
• Consider the distance-matrix in Table 4.14.
• The largest distance is 115 between objects S 8 and S9. They become the seeds of 2 new clusters.
K means is used to split the group into 2 clusters.
54
DATA WAREHOUSING & DATA MINING
• The distance-matrix of C1 is given in Table 4.17. The largest distance is 98 between S 8 and S10. C1
can therefore be split with S8 and S10 as seeds.
• The method continues like this until the stopping criteria is met.
55
DATA WAREHOUSING & DATA MINING
DENSITY-BASED METHODS
• A cluster is a dense region of points, which is separated by low-density regions, from other regions
of high density.
• Typically, for each data-point in a cluster, at least a minimum number of points must exist within a
given radius.
• Data that is not within such high-density clusters is regarded as outliers or noise.
• DBSCAN is one example of a density-based method for clustering.
• DBSCAN stands for Density Based Spatial Clustering of Applications with Noise.
• This requires 2 input parameters:
→ size of the neighborhood(R) &
→ minimum points in the neighborhood(N).
• The point-parameter N
→ determines the density of acceptable clusters &
→ determines which objects will be labeled outliers or noise.
• The size-parameter R determines the size of the clusters found.
• If R is big enough, there would be one big cluster and no outliers.
If R is small, there will be small dense clusters and there might be many outliers.
• We define a number of concepts that are required in the DBSCAN method:
1) Neighbourhood: The neighborhood of an object y is defined as all the objects that are
within the radius R from y.
2) Core-object: An object y is called a core-object if there are N objects within its
neighborhood.
3) Proximity: Two objects are defined to be in proximity to each other if they belong to the
same cluster. Object x1 is in proximity to object x2 if two conditions are satisfied:
i) The objects are close enough to each other, i.e. within a distance of R.
ii) x2 is a core object.
4) Connectivity: Two objects x1 and xn are connected if there is a chain of objects x1,x2. . .
.xn from x1 to xn such that each xi+1 is in proximity to object xi.
• The algorithm is as follows
1) Select values of R and N (Figure 4.7).
2) Arbitrarily select an object p.
3) Retrieve all objects that are connected to p, given R and N.
4) If p is a core object, a cluster is formed.
5) If p is a border object, no objects are in its proximity. Choose another object. Go to step 3.
6) Continue the process until all of the objects have been processed.
56
DATA WAREHOUSING & DATA MINING
DEALING WITH LARGE DATABASES
• A method requiring multiple scans of data that is disk-resident could be quite inefficient for large
problems.
K MEANS METHOD FOR LARGE DATABASES
• The method
→ first picks the number of clusters & their seed centriods and
→ then attempts to classify each object to belong to one of the following 3 groups:
1) Those that are certain to belong to a cluster. These objects together are called discard-set.
Some information about these objects is computed and saved. This includes
→ number of objects n
→ a vector sum of all attribute values of the n objects and
→ a vector sum of squares of all attribute values of n objects
2) Those that are sufficiently close to each other to be replaced by their summary. These
objects together are called the compression-set.
The objects are however sufficiently far away from each cluster's centroid that they
cannot be put in the discard set.
3) The remaining objects that are too difficult to assign to either of the 2 groups above.
These objects are called the retained-set.
These are stored as individual objects.
They cannot be replaced by a summary.
57
DATA WAREHOUSING & DATA MINING
QUALITY AND VALIDITY OF CLUSTER ANALYSIS METHODS
• Let number of clusters = k
Let clusters = Ci, i=1... . . . . k
Let total number of objects = N
Let number of objects in cluster = Mi so that
• The within-cluster variation between the objects in a cluster is defined as the average squared
distance of each object from the centroid of the cluster.
• If mi is the centroid of the cluster Ci
then the mean of the cluster is given by
• The between cluster distances E is the average sum of squares of pairwise distances between the
centroids of the k clusters. We may write E as
EXERCISES
1) What is cluster analysis? What are its applications? (2)
2) Compare classification vs. cluster analysis. (6)
3) List out and explain desired features of cluster analysis method. (6)
4) List out and explain different types of data. (4)
5) List out and explain different distance measures. (4)
6) List out and explain different types of cluster analysis methods. (6)
7) Write algorithm for k-means method. (6)
8) Apply k-means method for clustering the data given in Table 4.3. (6)
9) List out disadvantages of k-means method. (6)
10) Explain scaling and weighting. (4)
11) Explain expectation maximization method. (4)
12) Compare agglomerative approach vs. divisive approach. (4)
13) Explain different methods used for computing distances b/t clusters. (6)
14) Write algorithm for agglomerative approach. (6)
15) Apply agglomerative technique for clustering data given in Table4.10. (6)
16) Write algorithm for divisive approach. (6)
17) List out advantages and disadvantages of hierarchical methods. (6)
18) Explain DBSCAN with its algorithm. (6)
19) Explain K means method for large databases. (4)
20) Explain hierarchical method for large databases. (6)
21) Explain quality and validity of cluster analysis methods (6)
.
58
DATA WAREHOUSING & DATA MINING
WEB MINING
• Web-mining is the application of data-mining techniques to extract knowledge from web-data. (i.e.
web-content, web-structure, and web-usage data).
• We interact with the web for the following purposes:
1) Finding Relevant Information
• We use the search-engine to find specific information on the web.
• Query triggered process: We specify a simple keyword-query and the response from a
search-engine is a list of pages, ranked by their similarity to the query.
• Search tools have the following problems:
i) Low precision: This is due to the irrelevance of many of the search-results.
We may get many pages of information which are not really relevant to our query.
ii) Low recall: This is due to inability to index all the information available on the web.
Because some of the relevant pages are not properly indexed.
2) Discovering New Knowledge from the Web
• Data triggered process: This assumes that
→ we already have a collection of web-data and
→ we want to extract potentially useful knowledge out of it
3) Personalized Web Page Synthesis
• We may wish to synthesize a web-page for different individuals from the available set of
web-pages.
• While interacting with the web, individuals have their own preferences for the style of the
content and presentation.
4) Learning about Individual Users
• This is about knowing
→what the customers do and
→what the customers want
• Within this problem, there are sub-problems such as
→ problems related to effective web-site design and management
→ problems related to marketing etc
• Techniques from web-mining can be used to solve these problems.
Other related techniques from different research areas, such as DB(database),
IR(information retrieval) & NLP(natural language processing), can also be used.
• Web-mining has 3 main operations:
→ clustering (e.g. finding natural groupings of users, pages)
→ associations (e.g. which URLs tend to be requested together)
→ sequential analysis (e.g. the order in which URLs tend to be accessed)
• Web mining techniques can be classified into 3 areas of interest (Figure 10.1):
→ web-content mining (e.g. text, image, records, etc.)
→ web-structure mining (e.g. hyperlinks, tags, etc)
→ web-usage mining (e.g. http logs, app server logs, etc.)
You can do what you think you can do and you cannot do what you think you cannot do.
60
DATA WAREHOUSING & DATA MINING
WEB STRUCTURE MINING
• The structure of a typical web-graph consists of web-pages as nodes, and hyperlinks as edges
connecting related pages.
• Web-Structure mining is the process of discovering structure information from the web.
• This type of mining can be performed either at the (intra-page) document level or at the (inter-
page) hyperlink level.
• This can be used to classify web-pages.
• This can be used to generate information such as the similarity & relationship between different
web-sites.
PageRank
• PageRank is a metric for ranking hypertext documents based on their quality.
• The key idea is that a page has a high rank if it is pointed to by many highly ranked pages
• The PageRank of a page A is given by
2) Bibliographic coupling
For a pair of nodes p and q, the bibliographic coupling is equal to the number of nodes
that have links from both p and q.
Social Network Analysis
• This can be used to measure the relative standing or importance of individuals in a network.
• The basis idea is that if a web-page points a link to another web-page, then the former is,
in some sense, endorsing the importance of the latter.
• Links in the network may have different weights, corresponding to the strength of
endorsement.
People with small minds talk about other people. People with average minds talk about events.
People with great minds talk about ideas,
61
DATA WAREHOUSING & DATA MINING
TEXT MINING
• This is concerned with various tasks, such as
→ extraction of information implicitly contained in the collection of documents or
→ similarity-based structuring
• Text-collection lacks the imposed structure of a traditional database.
• The text expresses a vast range of information, but encodes the information in a form that is
difficult to decipher automatically.
• Traditional data-mining techniques have been designed to operate on structured-databases.
• In structured-databases, it is easy to define the set of items and hence, it becomes easy to employ
the traditional mining techniques.
In textual-database, identifying individual items (or terms) is not so easy.
• Text-mining techniques have to be developed to process the unstructured textual-data to aid in
knowledge discovery.
• The inherent nature of textual-data, namely unstructured characteristics, motivates the
development of separate text-mining techniques.
• Following approaches can be used for text-mining:
1) One way is to impose a structure on the textual-database and use any of the known data-
mining techniques meant for structured-databases.
2) The other approach would be to develop a very specific technique for mining that exploits
the inherent characteristics of textual-databases.
One important key to success is self-confidence. An important key to self- confidence is preparation.
62
DATA WAREHOUSING & DATA MINING
UNSTRUCTURED TEXT
• Unstructured-documents are free texts, such as news stories.
• To convert an unstructured-document to a structured form, following features can be extracted:
Word Occurrences
• The vector-representation (or bag of words) takes single words found in the
training-set as features, ignoring the sequence in which the words occur.
• The feature is said to be boolean, if we consider whether a word either occurs or
does not occur in a document.
• The feature is said to be frequency-based, if the frequency of the word in a
document is taken into consideration.
Stopword Removal
• Stopwords are frequently occurring and insignificant words in a language that help
construct sentences but do not represent any content of the documents.
• Common stopwords in English include:
a, about, an, are, as, at, be, by, for, from, how, in, is, of, on, or,
that, the, these, this, to, was, what, when, where, who, will, with
• Such words can be removed before documents are indexed and stored.
Stemming
• Stemming refers to the process of reducing words to their morphological roots or
stems
• A stem is the portion of a word that is left after removing its prefixes and suffixes.
• For example, "informing", "information", "informer" & "informed" are reduced to
"inform".
POS
• POS stands for Part of Speech.
• There can be 25 possible values for POS tags.
• Most common tags are noun, verb, adjective and adverb.
• Thus, we can assign a number 1,2,3,4 or 5, depending on whether the word is a
noun, verb, adjective, adverb or any other, respectively.
Positional Collocations
• The values of this feature are the words that occur one or two position to the right or
left of the given word.
n-gram
• An n-gram is a contiguous sequence of n items from a given sequence of text.
• This can be used for predicting the next item in a sequence.
LSI
• LSI stands for Latent Semantic Indexing.
• LSI is an indexing and retrieval method to identify the relationships between the
terms and concepts contained in an unstructured collection of text.
EXERCISES
• Write a short note on following (5 marks each)
1) Web mining
2) Web content mining
3) Web structure mining
4) Web usage mining
5) Text mining
6) IR and IE
7) Unstructured text
8) Text clustering
The future belongs to those who believe in the beauty of the dream.
64
DATA WAREHOUSING & DATA MINING
If you want to reach a goal, you must "see the reaching" in your own mind before you actually arrive at your goal.
65
DATA WAREHOUSING & DATA MINING
TEMPORAL ASSOCIATION RULES
• Association rules identify whether a particular subset of items is supported by an adequate number
of transactions.
• The presence of a temporal association rule may suggest a number of interpretations such as
→ The earlier event plays some role in causing the later event
→ There is a third set of reasons that cause both events
→ The confluence of events is coincidental
• Temporal association rules are sometimes viewed in the literature as causal rules.
• Causal rules describe relationships, where changes in one event cause subsequent changes in
other parts of the domain.
• They are common targets of scientific investigation within the medical domain, where the search
for factors that may cause or aggravate a particular disease is important.
• The static properties, such as gender, and the temporal properties, such as medical treatments,
are taken into account during mining.
SPATIAL DATABASE
• A spatial database stores a large amount of space-related data(or objects), such as
→ map navigation data
→ preprocessed remote sensing data or
→ medical imaging data
• Spatial objects are objects that occupy space.
It is literally true that you can succeed best and quickest by helping others to succeed.
66
DATA WAREHOUSING & DATA MINING
SPATIAL DATA MINING
• This refers to the extraction of knowledge, spatial relationships, or other interesting patterns not
explicitly stored in spatial-databases.
• Consider a map of the city of Mysore containing various natural and man-made geographic
features, and clusters of points (where each point marks the location of a particular house).
• The houses might be important because of their size, or their current market value.
• Clustering algorithms can be used to assign each point to exactly one cluster, with the number of
clusters being defined by the user.
• We can mine varieties of information by identifying likely relationships.
• For ex, "the land-value of cluster of residential area around ‘Mysore Palace’ is high".
• Such information could be of value to realtors, investors, or prospective home buyers.
• This problem is not so simple because there may be a large number of features to consider.
• We need to be able to detect relationships among large numbers of geo-referenced objects without
incurring significant overheads.
SPATIAL TREND
• This can be defined as a regular change of one or more non-spatial attributes, when moving away
from a given spatially referenced-object.
• For instance, "when we move away northwards from the 'Highway Circle’, the rentals of residential
houses decreases approximately at the rate of 5% per kilometer".
• One can also show the spatial trends pictorially by overlaying the direction of trend on a map.
EXERCISES
• Write a short note on following (5 marks each)
1) Temporal data mining
2) Temporal data mining tasks
3) Temporal association rules
4) Spatial mining
5) Spatial mining tasks
6) Spatial clustering
7) Spatial trend
8) Spatio Temporal data mining