Data Warehousing and Data Mining
Data Warehousing and Data Mining
Data Warehousing and Data Mining
in
JNTU World
LECTURE NOTES ON
JN
TU
or
ld
www.alljntuworld.in
JNTU World
SYLLABUS:
Module I
ld
Data Mining overview, Data Warehouse and OLAP Technology,Data Warehouse Architecture,
Stepsfor the Design and Construction of Data Warehouses, A Three-Tier Data
WarehouseArchitecture,OLAP,OLAP queries, metadata repository,Data Preprocessing Data
Integration and Transformation, Data Reduction,Data Mining Primitives:What Defines a Data
Mining Task? Task-Relevant Data, The Kind of Knowledge to be Mined,KDD
Module II
Module III
or
TU
JN
Module IV
www.alljntuworld.in
JNTU World
Chapter-1
ld
or
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and
TU
JN
Data mining derives its name from the similarities between searching for valuable business
information in a large database for example, finding linked products in gigabytes of store
scanner data and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
www.alljntuworld.in
JNTU World
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive handson analysis can now be answered directly from the data quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
identify the targets most likely to maximize return on investment in future mailings. Other
ld
predictive problems include forecasting bankruptcy and other forms of default, and identifying
segments of a population likely to respond similarly to given events.
Automated discovery of previously unknown patterns. Data mining tools sweep through
or
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
TU
investigation.
JN
Clustering is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression attempts to find a function which models the data with the least error.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
ld
JN
TU
or
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search orevaluate the
www.alljntuworld.in
JNTU World
ld
This is essential to the data mining systemand ideally consists ofa set of functional
or
This component typically employs interestingness measures interacts with the data
mining modules so as to focus thesearch toward interesting patterns. It may use
interestingness thresholds to filterout discovered patterns. Alternatively, the pattern
evaluation module may be integratedwith the mining module, depending on the
TU
implementation of the datamining method used. For efficient data mining, it is highly
recommended to pushthe evaluation of pattern interestingness as deep as possible into
the mining processso as to confine the search to only the interesting patterns.
4. User interface:
JN
Thismodule communicates between users and the data mining system,allowing the
user to interact with the system by specifying a data mining query ortask, providing
information to help focus the search, and performing exploratory datamining based on
the intermediate data mining results. In addition, this componentallows the user to
browse database and data warehouse schemas or data structures,evaluate mined
patterns, and visualize the patterns in different forms.
www.alljntuworld.in
JNTU World
ld
The general experimental procedure adapted to data-mining problems involves the following
steps:
or
TU
JN
This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
distribution is completely unknown after data are collected, or it is partially and implicitly
given in the data-collection procedure. It is very important, however, to understand how
data collection affects its theoretical distribution, since such a priori knowledge can be
very useful for modeling and, later, for the final interpretation of results. Also, it is
important to make sure that the data used for estimating a model and the data used later
ld
for testing and applying a model come from the same, unknown, sampling distribution. If
this is not the case, the estimated model cannot be successfully used in a final application
or
of the results.
In the observational setting, data are usually "collected" from the existing databses, data
tasks:
warehouses, and data marts. Data preprocessing usually includes at least two common
1. Outlier detection (and removal) Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
TU
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:
JN
www.alljntuworld.in
JNTU World
ld
data-mining phases. In every iteration of the data-mining process, all activities, together,
could define new and improved data sets for subsequent iterations. Generally, a good
encoding.
or
The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
TU
applied to perform a successful learning process from data and to develop an appropriate
model.
JN
In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are
more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using highdimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
techniques to validate the results. A user does not want hundreds of pages of numeric
results. He does not understand them; he cannot summarize, interpret, and use them for
or
ld
TU
JN
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
ld
or
We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.
TU
We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:
Characterization
Discrimination
JN
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
www.alljntuworld.in
JNTU World
employed.
ld
these techniques according to degree of user interaction involved or the methods of analysis
as follows:
Finance
DNA
Stock Markets
E-mail
Telecommunications
or
We can classify the data mining system according to application adapted. These applications are
TU
not the same. And Different user may be in interested in different kind of knowledge. Therefore
it is necessary for data mining to cover broad range of knowledge discovery task.
JN
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
Incorporation of background knowledge. - To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be
used to express the discovered patterns not only in concise terms but at multiple level of
abstraction.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
Data mining query languages and ad hoc data mining. - Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse
query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results. - Once the patterns are discovered it
ld
needs to be expressed in high level languages, visual representations. This representations should
be easily understandable by the users.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle
or
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
TU
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size of
databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the
data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
JN
from scratch.
www.alljntuworld.in
JNTU World
Some people treat data mining same as Knowledge discovery while some people view data
mining essential step in process of knowledge discovery. Here is the list of steps involved in
knowledge discovery process:
Data Cleaning - In this step the noise and inconsistent data is removed.
ld
Data Transformation - In this step data are transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
patterns.
or
Data Mining - In this step intelligent methods are applied in order to extract data
JN
TU
www.alljntuworld.in
JNTU World
TU
or
ld
JN
Architecture of KDD
www.alljntuworld.in
JNTU World
Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For
example, "sales" can be a particular subject.
Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there
ld
with a transactions system, where often only the most recent data is kept. For example, a
or
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
TU
combination of both.
The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.
JN
The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
ld
and involves multiple complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data
or
to be represented in the fact table for this process, for example, individual transactions,
Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric
JN
TU
www.alljntuworld.in
JNTU World
TU
or
ld
Tier-1:
JN
The bottom tier is a warehouse database server that is almost always a relationaldatabase
system. Back-end tools and utilities are used to feed data into the bottomtier from
operational databases or other external sources (such as customer profileinformation
provided by external consultants). These tools and utilities performdataextraction,
cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The
data are extracted using application programinterfaces known as gateways. A gateway is
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
supported by the underlying DBMS andallows client programs to generate SQL code to
be executed at a server.
Examplesof gateways include ODBC (Open Database Connection) and OLEDB (Open
Linkingand Embedding for Databases) by Microsoft and JDBC (Java Database
ld
Connection).
This tier also contains a metadata repository, which stores information aboutthe data
warehouse and its contents.
or
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational
TU
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools,
JN
analysis tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
www.alljntuworld.in
JNTU World
ld
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
or
It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in size from a
few gigabytes to hundreds of gigabytes, terabytes, or beyond.
Data mart:
TU
2.
A data mart contains a subset of corporate-wide data that is of value to aspecific group of
users. The scope is confined to specific selected subjects. For example,a marketing data
mart may confine its subjects to customer, item, and sales. Thedata contained in data
JN
www.alljntuworld.in
JNTU World
ld
3. Virtual warehouse:
or
A virtual warehouse is easy to build but requires excess capacity on operational database
servers.
Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine
warehouse objects. Metadata are created for the data names anddefinitions of the given
TU
warehouse. Additional metadata are created and captured fortimestamping any extracted data,
the source of the extracted data, and missing fieldsthat have been added by data cleaning or
integration processes.
JN
A description of the structure of the data warehouse, which includes the warehouse
schema, view, dimensions, hierarchies, and derived data definitions, as well as data mart
locations and contents.
Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or purged),
and monitoring information (warehouse usage statistics, error reports, and audit trails).
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
The algorithms used for summarization, which include measure and dimension
definitionalgorithms, data on granularity, partitions, subject areas, aggregation,
summarization,and predefined queries and reports.
ld
The mapping from the operational environment to the data warehouse, which
includessource databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging rules,
or
Data related to system performance, which include indices and profiles that improvedata
access and retrieval performance, in addition to rules for the timing and scheduling of
Business
metadata,
which
include
business
terms
and
definitions,
data
TU
JN
OLAP tools enable users to analyze multidimensional data interactively from multiple
perspectives.
www.alljntuworld.in
JNTU World
Consolidation involves the aggregation of data that can be accumulated and computed in
one or more dimensions. For example, all sales offices are rolled up to the sales
ld
The drill-down is a technique that allows users to navigate through the details. For
instance, users can view the sales by individual products that make up a regions sales.
or
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data
of the OLAP cube and view (dicing) the slices from different viewpoints.
ROLAP works directly with relational databases. The base data and the dimension
TU
tables are stored as relational tables and new tables are created to hold the aggregated
information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational database to
give the appearance of traditional OLAP's slicing and dicing functionality. In essence,
each action of slicing and dicing is equivalent to adding a "WHERE" clause in the
SQL statement.
JN
ROLAP tools do not use pre-calculated data cubes but instead pose the query to the
standard relational database and its tables in order to bring back the data required to
answer the question.
ROLAP tools feature the ability to ask any question because the methodology does
not limit to the contents of a cube. ROLAP also has the ability to drill down to the
lowest level of detail in the database.
www.alljntuworld.in
JNTU World
MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.
MOLAP stores this data in an optimized multi-dimensional array storage, rather than
ld
or
MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.
MOLAP tools have a very fast response time and the ability to quickly write back
There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
TU
except that a database will divide data between relational and specialized storage.
For example, for some vendors, a HOLAP database will use relational tables to hold
the larger quantities of detailed data, and use specialized storage for at least some
aspects of the smaller quantities of more-aggregate or less-detailed data.
HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the
capabilities of both approaches.
JN
HOLAP tools can utilize both pre-calculated cubes and relational data sources.
www.alljntuworld.in
JNTU World
ld
or
JN
TU
www.alljntuworld.in
JNTU World
ld
How can the data analyst or the computer be sure that customer id in one database and
customer number in another reference to the same attribute.
or
2. Redundancy:
TU
For the same real-world entity, attribute values fromdifferent sources may differ.
In data transformation, the data are transformed or consolidated into forms appropriatefor
mining.
JN
Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and
annualtotal amounts. This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
Generalization of the data, where low-level or primitive (raw) data are replaced
byhigher-level concepts through the use of concept hierarchies. For example,
categoricalattributes, like street, can be generalized to higher-level concepts, like city or
country.
Normalization, where the attribute data are scaled so as to fall within a small
ld
or
Data reduction techniques can be applied to obtain a reduced representation of thedata set that
ismuch smaller in volume, yet closely maintains the integrity of the originaldata. That is, mining
analytical results.
on the reduced data set should be more efficient yet produce thesame (or almost the same)
Data cube aggregation, where aggregation operations are applied to the data in
theconstruction of a data cube.
TU
size.
JN
www.alljntuworld.in
JNTU World
ld
Chapter-2
Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases.
or
interestingness.
Based on the concept of strong rules, RakeshAgrawal et al. introduced association rules.
Problem Definition:
be a set of
Let
Each transaction in
TU
where
and
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set of
and a small database containing the items (1
JN
items is
meaning that if
www.alljntuworld.in
JNTU World
or
ld
Transaction ID
The support
data set which contain the itemset. In the example database, the itemset
has a support of
TU
For
example,
the
rule
has
confidence
of
JN
containing butter and bread the rule is correct (100% of the times a customer buys butter
and bread, milk is bought as well). Confidence can be interpreted as an estimate of the
probability
under the condition that these transactions also contain the LHS.
The liftof a rule is defined as
www.alljntuworld.in
JNTU World
or the ratio of the observed support to that expected if X and Y were independent. The
has a lift of
or
ld
rule
The rule
has a conviction of
and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
TU
This processanalyzes customer buying habits by finding associations between the different items
thatcustomers place in their shopping baskets. The discovery of such associationscan help
retailers develop marketing strategies by gaining insight into which itemsare frequently
JN
purchased together by customers. For instance, if customers are buyingmilk, how likely are they
to also buy bread (and what kind of bread) on the same trip to the supermarket. Such information
can lead to increased sales by helping retailers doselective marketing and plan their shelf space.
JNTU World
Example:
or
ld
www.alljntuworld.in
If customers who purchase computers also tend to buy antivirussoftware at the same time, then
placing the hardware display close to the software displaymay help increase the sales of both
TU
items. In an alternative strategy, placing hardware andsoftware at opposite ends of the store may
entice customers who purchase such items topick up other items along the way. For instance,
after deciding on an expensive computer,a customer may observe security systems for sale while
heading toward the software displayto purchase antivirus software and may decide to purchase a
home security systemas well. Market basket analysis can also help retailers plan which items to
JN
put on saleat reduced prices. If customers tend to purchase computers and printers together,
thenhaving a sale on printers may encourage the sale of printers as well as computers.
www.alljntuworld.in
JNTU World
We can mine the complete set of frequent itemsets, the closed frequent itemsets, and the
maximal frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent itemsets,near-
ld
or
Some methods for associationrule mining can find rules at differing levels of abstraction.
For example, supposethat a set of association rules mined includes the following rules
where X is a variablerepresenting a customer:
(1)
(2)
In rule (1) and (2), the items bought are referenced at different levels ofabstraction (e.g.,
computer is a higher-level abstraction of laptop computer).
TU
If the items or attributes in an association rule reference only one dimension, then it is a
single-dimensional association rule.
JN
If a rule references two or more dimensions, such as the dimensions age, income, and buys,
then it is amultidimensional association rule. The following rule is an exampleof a
multidimensional rule:
age(X, 30,3139) ^ income(X, 42K,48K))=>buys(X, high resolution TV)
www.alljntuworld.in
JNTU World
ld
Frequent pattern analysis can generate various kinds of rules and other interesting
relationships.
or
Association rule mining cangenerate a large number of rules, many of which are
redundant or do not indicatea correlation relationship among itemsets.
Many kinds of frequent patterns can be mined from different kinds of data sets.
Sequential pattern mining searches for frequent subsequences in a sequence data set,
where a sequence records an ordering of events.
TU
For example, with sequential pattern mining, we can study the order in which items are
frequently purchased. For instance, customers may tend to first buy a PC, followed by a
digitalcamera,and then a memory card.
Structuredpatternminingsearches for frequent substructuresin a structured data set.
Single items are the simplest form of structure.
JN
www.alljntuworld.in
JNTU World
ld
Apriori Algorithm
Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining
or
The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support. The
resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
TU
JN
JNTU World
or
ld
www.alljntuworld.in
TU
Example:
T100
I1, I2, I5
T200
I2, I4
T300
I2, I3
T400
I1, I2, I4
T500
I1, I3
T600
I2, I3
T700
I1, I3
T800
T900
I1, I2, I3
JN
TID
www.alljntuworld.in
JNTU World
Steps:
1. In the first iteration of the algorithm, each item is a member of the set of candidate1itemsets, C1. The algorithm simply scans all of the transactions in order to countthe number of
occurrences of each item.
2. Suppose that the minimum support count required is 2, that is, min sup = 2. The set of
ld
frequent 1-itemsets, L1, can thenbe determined. It consists of the candidate 1-itemsets
satisfying minimum support.In our example, all of the candidates in C1 satisfy minimum
support.
3. To discover the set of frequent 2-itemsets, L2, the algorithm uses the join L1 on L1
or
togenerate a candidate set of 2-itemsets, C2.No candidates are removed fromC2 during the
prune step because each subset of thecandidates is also frequent.
4.Next,
the transactions inDare scanned and the support count of each candidate itemsetInC2 is
accumulated.
5. The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate2itemsets in C2 having minimum support.
6. The generation of the set of candidate 3-itemsets,C3, Fromthejoin step, we first getC3 =L2x
L2 = ({I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},{I2, I3, I5}, {I2, I4, I5}. Based on the
Apriori property that all subsets of a frequentitemsetmust also be frequent, we can determine
TU
7.The transactions in D are scanned in order to determine L3, consisting of those candidate
JN
JNTU World
JN
TU
or
ld
www.alljntuworld.in
JNTU World
or
ld
www.alljntuworld.in
TU
Example:
JN
For many applications, it is difficult to find strong associations among data items at low
or primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent commonsense
knowledge.
Therefore, data mining systems should provide capabilities for mining association rules
at multiple levels of abstraction, with sufficient flexibility for easy traversal
amongdifferentabstraction spaces.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
Association rules generated from mining data at multiple levels of abstraction arecalled
multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under a
support-confidence framework.
In general, a top-down strategy is employed, where counts are accumulated for the
ld
calculation of frequent itemsets at each concept level, starting at the concept level 1 and
working downward in the hierarchy toward the more specific concept levels,until no
more frequent itemsets can be found.
or
JN
TU
conceptswithin the data by their higher-level concepts, or ancestors, from a concept hierarchy.
www.alljntuworld.in
JNTU World
The concept hierarchy has five levels, respectively referred to as levels 0to 4, starting with level
0 at the root node for all.
ld
Level 3 includes IBM desktop computer, . . . , Microsoft office software, and so on.
Level 4 is the most specific abstraction level of this hierarchy.
1.UniformMinimum Support:
or
The same minimum support threshold is used when mining at each level of abstraction.
When a uniform minimum support threshold is used, the search procedure is simplified.
threshold.
The method is also simple in that users are required to specify only one minimum support
The uniform support approach, however, has some difficulties. It is unlikely thatitems at
lower levels of abstraction will occur as frequently as those at higher levelsof abstraction.
If the minimum support threshold is set too high, it could miss somemeaningful associations
TU
occurring at low abstraction levels. If the threshold is set too low, it may generate many
JN
www.alljntuworld.in
JNTU World
The deeper the level of abstraction, the smaller the corresponding threshold is.
For example,the minimum support thresholds for levels 1 and 2 are 5% and 3%,respectively.
In this way, computer, laptop computer, and desktop computer areall considered
or
ld
frequent.
Because users or experts often have insight as to which groups are more important than
others, it is sometimes more desirable to set up user-specific, item, or group based minimal
support thresholds when mining multilevel rules.
For example, a user could set up the minimum support thresholds based on product price, or
on items of interest, such as by setting particularly low support thresholds for laptop
computersand flash drives in order to pay particular attention to the association patterns
containing items in these categories.
TU
predicate (e.g., buys)with multiple occurrences i.e., the predicate occurs more than once
JN
Association rules that involve two or more dimensions or predicates can be referred to
as multidimensional association rules.
www.alljntuworld.in
JNTU World
association
rules
with
no
repeated
predicates
arecalled
ld
contain multiple occurrences of some predicates.These rules are called hybriddimensional association rules. An example of sucha rule is the following, where the
predicate buys is repeated:
or
Quantitative association rules are multidimensional association rules in which the numeric
attributes are dynamically discretized during the mining process so as to satisfy some mining
criteria, such as maximizing the confidence or compactness of the rules mined.
In this section, we focus specifically on how to mine quantitative association rules having
two quantitative attributes on the left-hand side of the rule and one categorical attribute on
TU
JN
www.alljntuworld.in
JNTU World
ld
That is, a correlation rule is measured not only by its support and confidence but alsoby
the correlation between itemsetsA and B. There are many different correlation
measuresfrom which to choose. In this section, we study various correlation measures
or
Lift is a simple correlation measure that is given as follows. The occurrence of itemset
A is independent of the occurrence of itemsetB if
= P(A)P(B); otherwise,
itemsetsA and B are dependent and correlated as events. This definition can easily be
TU
If the lift(A,B) is less than 1, then the occurrence of A is negativelycorrelated with the
occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that
JN
between them.
www.alljntuworld.in
JNTU World
Chapter-3
ld
Classification and prediction are two forms of data analysis that can be used to extractmodels
describing important data classes or to predict future data trends.
categorical
continuousvaluedfunctions.
(discrete,
unordered)
labels,
prediction
or
Classificationpredicts
models
Regression analysis is a statistical methodology that is most often used for numeric
prediction.
TU
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
JN
www.alljntuworld.in
JNTU World
(i)Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missingvalues (e.g., by replacing a missing
ld
value with the most commonly occurring value for that attribute, or with the most probable
value based on statistics).
Although most classification algorithms have some mechanisms for handling noisy or
(ii)Relevance analysis:
or
missing data, this step can help reduce confusion during learning.
Correlation analysis can be used to identify whether any two given attributes are
statisticallyrelated.
For example, a strong correlation between attributes A1 and A2 would suggest that one of
the two could be removed from further analysis.
A database may also contain irrelevant attributes. Attribute subset selection can be used
in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained
TU
Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or
prediction task.
JN
hierarchies may be used for this purpose. This is particularly useful for continuous
valuedattributes.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
For example, numeric values for the attribute income can be generalized to discrete
ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can
be generalized to higher-level concepts, like city.
Data can also be reduced by applying many other methods, ranging from wavelet
transformation and principle components analysis to discretization techniques, such
ld
or
The accuracy of a classifier refers to the ability of a given classifier to correctly predict
the class label of new or previously unseen data (i.e., tuples without class label
information).
The accuracy of a predictor refers to how well a given predictor can guess the value of
This refers to the computational costs involved in generating and using the
given classifier or predictor.
Robustness:
TU
Scalability:
JN
Interpretability:
This refers to the level of understanding and insight that is providedby the classifier or
predictor.
www.alljntuworld.in
JNTU World
or
ld
The construction of decision treeclassifiers does not require any domain knowledge or
parameter setting, and therefore I appropriate for exploratory knowledge discovery.
TU
The learning and classification steps of decision treeinduction are simple and fast.
In general, decision tree classifiers have good accuracy.
JN
Decision tree induction algorithmshave been used for classification in many application
areas, such as medicine,manufacturing and production, financial analysis, astronomy, and
molecular biology.
www.alljntuworld.in
JNTU World
TU
or
ld
Attribute list
JN
www.alljntuworld.in
JNTU World
If the tuples in D are all of the same class, then node N becomes a leaf and is labeledwith
that class .
Allof the terminating conditions are explained at the end of the algorithm.
Otherwise, the algorithm calls Attribute selection method to determine the splitting
criterion.
ld
The splitting criterion tells us which attribute to test at node N by determiningthe best
way to separate or partition the tuples in D into individual classes.
1 A is discrete-valued:
or
There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
values of A.
In this case, the outcomes of the test at node N corresponddirectly to the known
A branch is created for each known value, aj, of A and labeled with that value.
Aneed not be considered in any future partitioning of the tuples.
TU
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions
JN
splitting criterion.
SA is the splitting subset for A, returned by Attribute selection methodas part of the splitting
criterion. It is a subset of the known values of A.
JNTU World
or
ld
www.alljntuworld.in
(a)If A is Discrete valued (b)If A is continuous valued (c) IfA is discrete-valued and a binary
TU
JN
www.alljntuworld.in
JNTU World
Let H be some hypothesis, such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis
H holds given the evidence or observed data tuple X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
or
ld
Bayes theorem is useful in that it providesa way of calculating the posterior probability,
1.Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, ,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, , An.
2. Suppose that there are m classes, C1, C2, , Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
TU
That is, the nave Bayesian classifier predicts that tuple X belongs to the class Ci if and only
if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
JN
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are equally
likely, that is, P(C1) = P(C2) = = P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
www.alljntuworld.in
4.Given
JNTU World
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes
ld
areconditionally independent of one another, given the class label of the tuple. Thus,
or
If Akis categorical, then P(xk|Ci) is the number of tuples of class Ciin D havingthe value
xkfor Ak, divided by |Ci,D| the number of tuples of class Ciin D.
If Akis continuous-valued, then we need to do a bit more work, but the calculationis pretty
straightforward.
TU
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.
JN
The classifier predicts that the class label of tuple X is the class Ciif and only if
www.alljntuworld.in
JNTU World
or
ld
Example:
The inputs to the network correspond to the attributes measured for each training tuple. The
inputs are fed simultaneously into the units making up the input layer. These inputs pass
through the input layer and are then weighted and fed simultaneously to a second layer
known as a hidden layer.
TU
The outputs of the hidden layer units can be input to another hidden layer, and so on. The
number of hidden layers is arbitrary.
The weighted outputs of the last hidden layer are input to units making up the output layer,
which emits the networks prediction for given tuples
JN
www.alljntuworld.in
JNTU World
Neural networks involve long training times and are therefore more suitable for
applicationswhere this is feasible.
Backpropagation learns by iteratively processing a data set of training tuples, comparing
the networks prediction for each tuple with the actual known target value.
ld
The target value may be the known class label of the training tuple (for classification
For each training tuple, the weights are modified so as to minimize the mean squared
errorbetween the networks prediction and the actual target value. These modifications
are made in the backwards direction, that is, from the output layer, through each hidden
or
layer down to the first hidden layer hence the name is backpropagation.
Although it is not guaranteed, in general the weights will eventually converge, and the
learning process stops.
Advantages:
It include their high tolerance of noisy data as well as their ability to classify patterns on
which they have not been trained.
They can be used when you may have little knowledge of the relationships between
attributesand classes.
They are well-suited for continuous-valued inputs and outputs, unlike most decision tree
TU
algorithms.
They have been successful on a wide array of real-world data, including handwritten
character recognition, pathology and laboratory medicine, and training a computer to
pronounce English text.
Neural network algorithms are inherently parallel; parallelization techniques can be used
JN
Process:
www.alljntuworld.in
JNTU World
ld
First, the training tuple is fed to the input layer of thenetwork. The inputs pass through the input
units, unchanged. That is, for an input unitj, its output, Oj, is equal to its input value, Ij. Next, the
net input and output of eachunit in the hidden and output layers are computed. The net input to a
unit in the hiddenor output layers is computed as a linear combination of its inputs.
or
Each such unit has anumber of inputs to it that are, in fact, the outputs of the units connected to it
in theprevious layer. Each connection has a weight. To compute the net input to the unit, each
input connected to the unit ismultiplied by its correspondingweight, and this is summed.
wherewi,jis the weight of the connection from unit iin the previous layer to unit j;
Oiis the output of unit ifrom the previous layer
jis
the bias of the unit & it actsas a threshold in that it serves to vary the activity of the unit.
TU
Each unit in the hidden and output layers takes its net input and then applies an activation
JN
function to it.
www.alljntuworld.in
JNTU World
ld
whereOjis the actual output of unit j, and Tjis the known target value of the giventraining
tuple.
or
wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
andErrkis the error of unit k.
Weights are updatedby the following equations, where Dwi j is the change in weight wi j:
JN
TU
www.alljntuworld.in
JNTU World
TU
or
ld
Algorithm:
JN
pattern space for the k trainingtuples that are closest to the unknown tuple. These k training
tuples are the k nearest neighbors of the unknown tuple.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
ld
In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference,and accumulate it.
The square root is taken of the total accumulated distance count.
or
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to
TU
it in pattern space.
Nearestneighborclassifiers can also be used for prediction, that is, to return a real-valued
prediction for a given unknown tuple.
In this case, the classifier returns the averagevalue of the real-valued labels associated
with the k nearest neighbors of the unknowntuple.
JN
An initial population is created consisting of randomly generated rules. Each rule can be
represented by a string of bits. As a simple example, suppose that samples in a given
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
training set are described by two Boolean attributes, A1 and A2, and that there are two
classes,C1 andC2.
The rule IF A1 ANDNOT A2 THENC2 can be encoded as the bit string 100, where
the two leftmost bits represent attributes A1 and A2, respectively, and the rightmost bit
represents the class.
ld
Similarly, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded as 001.
If an attribute has k values, where k > 2, then k bits may be used to encode the attributes
values.
Classes can be encoded in a similar fashion.
or
Based on the notion of survival of the fittest, a new population is formed to consist of
the fittest rules in the current population, as well as offspring of these rules.
Offspring are created by applying genetic operators such as crossover and mutation.
In crossover, substrings from pairs of rules are swapped to form new pairs of rules.
Inmutation, randomly selected bits in a rules string are inverted.
The process of generating new populations based on prior populations of rules continues
until a population, P, evolves where each rule in P satisfies a pre specified fitness
TU
threshold.
Genetic algorithms are easily parallelizable and have been used for classification as
well as other optimization problems. In data mining, they may be used to evaluate the
fitness of other algorithms.
JN
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree ofmembership
that a certain value has in a given category. Each category then represents afuzzy set.
Fuzzy logic systemstypically provide graphical tools to assist users in converting attribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory.
www.alljntuworld.in
JNTU World
ld
Unlike the notion of traditional crisp sets where anelement either belongs to a set S or its
complement, in fuzzy set theory, elements canbelong to more than one fuzzy set.
Fuzzy set theory is useful for data mining systems performing rule-based classification.
It provides operations for combining fuzzy measurements.
or
Several procedures exist for translating the resulting fuzzy output into a defuzzifiedor crisp
Fuzzy logic systems have been used in numerous areas for classification, including
JN
TU
Example:
www.alljntuworld.in
JNTU World
ld
band w are regression coefficientsspecifying the Y-intercept and slope of the line.
or
The regression coefficients, w and b, can also be thought of as weights, so that we can
These coefficients can be solved for by the method of least squares, which estimates the
best-fitting straight line as the one that minimizes the error between the actual data and
Let D be a training set consisting of values of predictor variable, x, for some population and
their associated values for response variable, y. The training set contains |D| data points of
the form(x1, y1), (x2, y2), , (x|D|, y|D|).
TU
The regressioncoefficients can be estimated using this method with the following equations:
JN
where x is the mean value of x1, x2, , x|D|, and y is the mean value of y1, y2,, y|D|.
The coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
www.alljntuworld.in
JNTU World
or
ld
Multiple regression problemsare instead commonly solved with the use of statistical
By applying transformations to the variables, we can convert the nonlinear model into a
linear one that can then be solved by the method of least squares.
Polynomial Regression is a special case of multiple regression. That is, the addition of
high-order terms like x2, x3, and so on, which are simple functions of the single variable, x,
can be considered equivalent to adding new independent variables.
Transformation of a polynomial regression model to a linear regression model:
Consider a cubic polynomial relationship given by
TU
y = w0+w1x+w2x2+w3x3
x1 = x, x2 = x2 ,x3 = x3
It can then be converted to linear formby applying the above assignments,resulting in the
equationy = w0+w1x+w2x2+w3x3
JN
which is easily solved by themethod of least squares using software for regression analysis.
www.alljntuworld.in
JNTU World
ld
True negatives are the negative tuples that were correctly labeled by the classifier.
False positives are the negative tuples that were incorrectly labeled.
How well the classifier can recognize, for this sensitivity and specificity measures can be
used.
or
TU
JN
www.alljntuworld.in
JNTU World
Chapter-4
ld
A cluster is a collection of data objects that are similar to one another within the same
or
A cluster of data objects can be treated collectively as one group and so may be considered
as a form of data compression.
Cluster analysis tools based on k-means, k-medoids, and several methods have also been
SAS.
4.1.1 Applications:
built into many statisticalanalysis software packages or systems, such as S-Plus, SPSS, and
Cluster analysis has been widely used in numerous applications, including market research,
TU
JN
www.alljntuworld.in
JNTU World
Clustering can also be used for outlier detection,Applications of outlier detection include
the detection of credit card fraud and the monitoring of criminal activities in electronic
commerce.
ld
Many clustering algorithms work well on small data sets containing fewer than several
hundred data objects; however, a large database may contain millions of objects. Clustering
or
applications may require clustering other types of data, such as binary, categorical
(nominal), and ordinal data, or mixtures of these data types.
Discovery of clusters with arbitrary shape:
TU
JN
(such as the number of desired clusters). The clustering results can be quite sensitive to
input parameters. Parameters are often difficult to determine, especially for data sets
containing high-dimensional objects. This not only burdens users, but it also makes the
quality of clustering difficult to control.
www.alljntuworld.in
JNTU World
ld
or
Real-world applications may need to perform clustering under various kinds of constraints.
Suppose that your job is to choose the locations for a given number of new automatic
banking machines (ATMs) in a city. To decide upon this, you may cluster households
TU
while considering constraints such as the citys rivers and highway networks, and the type
and number of customers per cluster. A challenging task is to find groups of data with good
clustering behavior that satisfy specified constraints.
Users expect clustering results to be interpretable, comprehensible, and usable. That is,
clustering may need to be tied to specific semantic interpretations and applications. It is
JN
important to study how an application goal may influence the selection of clustering
features and methods.
www.alljntuworld.in
JNTU World
Density-Based Methods
Grid-Based Methods
Model-Based Methods
ld
cluster and k <= n. That is, it classifies the data into k groups, which together satisfy the
following requirements:
or
another.
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
TU
A hierarchical method creates a hierarchical decomposition ofthe given set of data objects. A
hierarchical method can be classified as being eitheragglomerative or divisive, based on
howthe hierarchical decomposition is formed.
Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
JN
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
The divisive approach, also calledthe top-down approach, starts with all of the objects in
the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
www.alljntuworld.in
JNTU World
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
ld
or
Chameleon, or
agglomeration
and
other
approaches
by
first
using
iterative relocation.
Most partitioning methods cluster objects based on the distance between objects. Such
methods can find only spherical-shaped clusters and encounter difficulty at discovering
clusters of arbitrary shapes.
TU
Other clustering methods have been developed based on the notion of density. Their
general idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold; that is, for each data point within a given
cluster, the neighborhood of a given radius has to contain at least a minimum number of
points. Such a method can be used to filter out noise (outliers)and discover clusters of
arbitrary shape.
JN
DBSCAN and its extension, OPTICS, are typical density-based methods that
growclusters according to a density-based connectivity analysis. DENCLUE is a
methodthat clusters objects based on the analysis of the value distributions of density
functions.
Grid-based methods quantize the object space into a finite number of cells that form a
grid structure.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
All of the clustering operations are performed on the grid structure i.e., on the quantized
space. The main advantage of this approach is its fast processing time, which is
typically independent of the number of data objects and dependent only on the number
of cells in each dimension in the quantized space.
STING is a typical example of a grid-based method. Wave Cluster applies wavelet
ld
or
Model-based methods hypothesize a model for each of the clusters and find the best fit
standard statistics, taking noise or outliers into account and thus yielding robust
clustering methods.
TU
Constraint-Based Clustering
JN
www.alljntuworld.in
JNTU World
CLIQUE and PROCLUS are two influential subspace clustering methods, which
search for clusters in subspaces ofthe data, rather than over the entire data space.
Frequent patternbased clustering,another clustering methodology, extractsdistinct
frequent patterns among subsets ofdimensions that occur frequently. It uses such
ld
or
clustering results, and provides an effective means for communicating with the
clustering process.
requirements.
Spatial clustering employs with the existence of obstacles and clustering under userspecified constraints. In addition, semi-supervised clusteringemploys
forpairwise
TU
JN
low.
Cluster similarity is measured in regard to the mean value of the objects in a cluster, which
can be viewed as the clusters centroid or center of gravity.
www.alljntuworld.in
JNTU World
First, it randomly selects k of the objects, each of which initially represents a cluster
mean or center.
For each of the remaining objects, an object is assigned to the cluster to which it is the
most similar, based on the distance between the object and the cluster mean.
It then computes the new mean for each cluster.
or
ld
whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object
The k-means algorithm for partitioning, where each clusters center is represented by the mean
JN
TU
JNTU World
ld
www.alljntuworld.in
or
The k-means algorithm is sensitive to outliers because an object with an extremely large
value may substantially distort the distribution of data. This effect is particularly
exacerbated due to the use of the square-error function.
Instead of taking the mean value of the objects in a cluster as a reference point, we can pick
actual objects to represent the clusters, using one representative object per cluster. Each
remaining object is clustered with the representative object to which it is the most similar.
TU
Thepartitioning method is then performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point. That is, an
JN
whereE is the sum of the absolute error for all objects in the data set
pis the point inspace representing a given object in clusterCj
ojis the representative object of Cj
www.alljntuworld.in
JNTU World
The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
ld
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.
or
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
Case 2:
and p is closest to one of the other representative objects, oi,ij, then p is reassigned to oi.
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
TU
pcurrently belongs to representative object, oi, ij. If ojis replaced by orandomas a representative
object and p is still closest to oi, then the assignment does notchange.
Case 4:
pcurrently belongs to representative object, oi, ij. If ojis replaced byorandomas a representative
JN
JNTU World
or
ld
www.alljntuworld.in
4.4.2 Thek-MedoidsAlgorithm:
JN
TU
www.alljntuworld.in
JNTU World
The k-medoids method ismore robust than k-means in the presence of noise and outliers,
because a medoid is lessinfluenced by outliers or other extreme values than a mean. However,
its processing ismore costly than the k-means method.
ld
A hierarchical clustering method works by grouping data objects into a tree of clusters.
performadjustment once amerge or split decision hasbeen executed. That is, if a particular
merge or split decision later turns out to have been apoor choice, the method cannot
or
fashion.
This bottom-up strategy starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the objects are in a single
cluster or until certain termination conditions are satisfied.
TU
Most hierarchical clustering methods belong to this category. They differ only in their
definition of intercluster similarity.
JN
www.alljntuworld.in
JNTU World
ld
We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
or
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.
A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
TU
number of points in the DBSCAN algorithm. Although such user-specified parameters may
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
JN
We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
ld
The quality of unsupervisedclustering can be significantly improved using some weak form
of supervision.This may be in the formof pairwise constraints (i.e., pairs of objects labeled
as belongingto the same or different cluster). Such a constrained clustering process is
or
calledsemi-supervised clustering.
There exist data objects that do not comply with the general behavior or model of the data.
Such data objects, which are grossly different from or inconsistent with the remaining set
Many data mining algorithms try to minimize the influence of outliers or eliminate them all
together. This, however, could result in the loss of important hidden information because
one persons noise could be another persons signal. In other words, the outliers may be of
particular interest, such as in the case of fraud detection, where outliers may indicate
TU
fraudulent activity. Thus, outlier detection and analysis is an interesting data mining task,
referred to as outlier mining.
It can be used in fraud detection, for example, by detecting unusual usage of credit cards or
telecommunication services. In addition, it is useful in customized marketing for
identifying the spending behavior of customers with extremely low or extremely high
JN
Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
ld
or
probability model for the given data set (e.g., a normal or Poisson distribution) andthen
identifies outliers with respect to the model using a discordancy test. Application ofthe
test requires knowledge of the data set parameters knowledge of distribution parameters
such as the mean and variance and theexpected number of outliers.
An alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from
TU
JN
some statistic, T, has been chosen for discordancy testing, and the value of the statistic
for object oi is vi, then the distribution of T is constructed. Significance probability,
SP(vi)=Prob(T > vi), is evaluated. If SP(vi) is sufficiently small, then oi is discordant and
the working hypothesis is rejected.
An alternative hypothesis, H, which states that oi comes from another distribution model,
G, is adopted. The result is very much dependent on which model F is chosen because
oimay be an outlier under one model and a perfectly valid value under another. The
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
alternative distribution is very important in determining the power of the test, that is, the
probability that the working hypothesis is rejected when oi is really an outlier.
There are different kinds of alternative distributions.
Inherent alternative distribution:
In this case, the working hypothesis that all of the objects come from distribution F is
ld
rejected in favor of the alternative hypothesis that all of the objects arise from another
distribution, G:
H :oi G, where i = 1, 2,, n
or
distribution.
There are constraints on the form of the G distribution in that it must have potential to
produce outliers. For example, it may have a different mean or dispersion, or a longer
tail.
The mixture alternative states that discordant values are not outliers in the F population,
but contaminants from some other population,
G. In this case, the alternative hypothesis is
TU
arise independently from the initial model, F, with its given parameters, whereas the
remaining objects are independent observations from a modified version of F in which
JN
as consistent.
Consecutive procedures:
An example of such a procedure is the insideoutprocedure. Its main idea is that the object
that is least likely to be an outlier istested first. If it is found to be an outlier, then all of the
DEPT OF CSE & IT
www.alljntuworld.in
JNTU World
more extreme values are alsoconsidered outliers; otherwise, the next most extreme object is
tested, and so on. Thisprocedure tends to be more effective than block procedures.
ld
statistical tests, we can think of distance-based outliers as thoseobjects that do not have
or
enoughneighbors, where neighbors are defined based ondistance from the given object. In
For many discordancy tests, it can be shown that if an object, o, is an outlier accordingto the
given test, then o is also a DB(pct, dmin)-outlier for some suitably defined pct anddmin.
For example, if objects that lie three or more standard deviations from the mean
are considered to be outliers, assuming a normal distribution, then this definition can
TU
Given a data set, the index-based algorithm uses multidimensionalindexing structures, such
as R-trees or k-d trees, to search for neighbors of eachobject o within radius dminaround that
object. Let Mbe the maximum number ofobjects within the dmin-neighborhood of an outlier.
JN
Therefore, onceM+1 neighborsof object o are found, it is clear that o is not an outlier. This
algorithm has a worst-casecomplexity of O(n2k), where n is the number of objects in the data
www.alljntuworld.in
JNTU World
The nested-loop algorithm has the same computational complexityas the index-based
algorithm but avoids index structure construction and triesto minimize the number of I/Os. It
divides the memory buffer space into two halvesand the data set into several logical blocks.
By carefully choosing the order in whichblocks are loaded into each half, I/O efficiency can
be achieved.
ld
Cell-based algorithm:
To avoidO(n2) computational complexity, a cell-based algorithm was developed for memoryresident data sets. Its complexity is O(ck+n), where c is a constant depending on the number
of cells and k is the dimensionality.
or
In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis
cells thick, rounded up to the closest integer. The algorithm countsoutliers on a
cell-by-cell rather than an object-by-object basis. For a given cell, itaccumulates three
countsthe number of objects in the cell, in the cell and the firstlayer together, and in the
cell and both layers together. Lets refer to these counts ascell count, cell + 1 layer count, and
cell + 2 layers count, respectively.
Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
TU
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
JN
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
www.alljntuworld.in
JNTU World
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.
ld
uniformlydistributed. These methods encounter difficulties when analyzing data with rather
density distributions.
or
different
To define the local outlier factor of an object, we need to introduce the concepts ofkdistance, k-distance neighborhood, reachability distance,13 and local reachability density.
These are defined as follows:
The k-distance of an object p is the maximal distance that p gets from its knearestneighbors. This distance is denoted as k-distance(p). It is defined as the distance,
d(p, o), between p and an object o 2 D, such that for at least k objects, o0 2 D, it holds that
d(p, o)_d(p, o). That is, there are at least k objects inDthat are as close asor closer to p than
TU
o, and for at most k-1 objects, o00 2 D, it holds that d(p;o) <d(p, o).
That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
JN
www.alljntuworld.in
JNTU World
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the MinPtsdistance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o) for all of
the p close to o.
ld
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
or
The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as
It is the average of the ratio of the local reachability density of p and those of ps
MinPts-nearest neighbors. It is easy to see that the lower ps local reachability density
TU
is, and the higher the local reachability density of ps MinPts-nearest neighbors are,
the higher LOF(p) is.
JN
outliers. Hence, in this approach the term deviations is typically used to referto outliers. In
this section, we study two techniques for deviation-based outlier detection.The first
sequentially compares objects in a set, while the second employs an OLAPdata cube
approach.
www.alljntuworld.in
JNTU World
The sequential exception technique simulates the way in which humans can
distinguishunusual objects from among a series of supposedly like objects. It uses implicit
redundancyof the data. Given a data set, D, of n objects, it builds a sequence of subsets,{D1,
ld
Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
or
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
the dissimilarity among the objects, the higherthe value returned by the function. The
dissimilarity of a subset is incrementally computedbased on the subset prior to it in the
sequence. Given a subset of n numbers, {x1, ,xn}, a possible dissimilarity function is the
TU
where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
JN
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.