DWDM All Units
DWDM All Units
DWDM All Units
Introduction: Why Data Mining? What Is Data Mining? What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined? Which Technologies Are Used? Which Kinds of
Applications Are Targeted? Major Issues in Data Mining. Data Objects and Attribute Types,
Basic Statistical Descriptions of Data, Data Visualization, Measuring Data Similarity and
Dissimilarity
Data mining is the process of discovering interesting patterns and knowledge from large amounts of
data. The data sources can include databases, data warehouses, the Web, other information
repositories, or data that are streamed into the system dynamically.
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.
Architecture of Data Mining System: - The architecture of a data mining system may have the
following components
1. Data Sources or Repositories: - This component represents multiple data sources such as
database, data warehouse, or any other information repository. Data cleaning and data integration
techniques may be performed on the data.
2. Database server or data warehouse server: - The database or data warehouse server is
responsible for fetching the relevant data, based on the user's data mining request.
3. Knowledge base: - . It is the area of knowledge that is used to guide the search, or to perform
analysis of the resulting patterns.
Data mining can be applied to any kind of data as long as the data are meaningful for a target
application.
I. Database Data: A database system, also called a database management system (DBMS),
consists of a collection of interrelated data, known as a database, and a set of software
programs to manage and access the data. A relational database is a collection of tables,
each of which is assigned a unique name. Each table consists of a set of attributes
(columns or fields) and usually stores a large number of tuples (records or rows). Each
tuple in a relational table represents a record identified by a unique key and described by
a set of attribute values.
III. Transactional databases: - A transactional database consists of a file where each record
represents a transaction. A transaction typically includes a unique transaction identity
number (Trans ID), and a list of the items making up the transaction. The Transactional
database may have additional tables associated with it, which contain other information
regarding the sale, such as the date of the transaction, the customer ID number, the ID
number of the sales person, and of the branch at which the sale occurred, and so on
Data discrimination is a comparison of the general features of the target class data objects
against the general features of objects from one or multiple contrasting classes. The target and
contrasting classes can be specified by a user, and the corresponding data objects can be
retrieved through database queries. For example, a user may want to compare the general
features of software products with sales that increased by 10% last year against those with sales
that decreased by at least 30% during the same period. The methods used for data discrimination
are similar to those used for data characterization.
Outlier Analysis
A database may contain data objects that do not comply with the general behavior or Model of
the data. These data objects are outliers. Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud detection, the rare events can be more
interesting than the more regularly occurring ones. The analysis of outlier data is referred to as
outlier mining.
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
extremely large amounts for a given account number in comparison to regular charges incurred
by the same account.
Evolution and deviation analysis: - Data evolution analysis describes and models regularities or
trends for objects whose behavior changes over time. Although this may include
characterization, discrimination, association, classification, or clustering of time-related data,
distinct features of such an analysis include time-series data analysis, sequence or periodicity
pattern matching, and similarity-based data analysis.
As a highly application-driven domain, data mining has incorporated many techniques from
other domains such as statistics, machine learning, pattern recognition, database and data
warehouse systems, information retrieval, visualization, algorithms, high performance
computing, and many application domains.
Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation of data.
Data mining has an inherent connection with statistics.
A statistical model is a set of mathematical functions that describe the behavior of the objects in
a target class in terms of random variables and their associated probability distributions.
Machine learning investigates how computers can learn (or improve their performance) based
on data. A main research area is for computer programs to automatically learn to recognize
complex patterns and make intelligent decisions based on data. For example, a typical machine
learning problem is to program a computer so that it can automatically recognize handwritten
postal codes on mail after learning from a set of examples. Machine learning is a fast-growing
discipline. Here, we illustrate classic problems in machine learning that are highly related to data
mining.
Active learning is a machine learning approach that lets users play an active role in the
learning process. An active learning approach can ask a user (e.g., a domain expert) to
label an example, which may be from a set of unlabeled examples or synthesized by the
learning program. The goal is to optimize the model quality by actively acquiring
knowledge from human users, given a constraint on how many examples they can be
asked to label.
A data warehouse integrates data originating from multiple sources and various timeframes. It
consolidates data in multidimensional space to form partially materialized data cubes. The data
cube model not only facilitates OLAP in multidimensional databases but also promotes
multidimensional data mining
Information Retrieval
Information retrieval (IR) is the science of searching for documents or information in documents.
Documents can be text or multimedia, and may reside on the Web. The differences between
traditional information retrieval and database systems are twofold: Information retrieval assumes
that (1) the data under search are unstructured; and (2) the queries are formed mainly by
keywords, which do not have complex structures (unlike SQL queries in database systems).
Web search engines are essentially very large data mining applications. Various data mining
techniques are used in all aspects of search engines, ranging from crawling 5 (e.g., deciding
which pages should be crawled and the crawling frequencies), indexing (e.g., selecting pages to
be indexed and deciding to which extent the index should be constructed), and searching (e.g.,
deciding how pages should be ranked, which advertisements should be added, and how the
search results can be personalized or made ―context aware‖).
Data mining is a dynamic and fast-expanding field with great strengths. In this section, we
briefly outline the major issues in data mining research, partitioning them into five groups:
mining methodology, user interaction, efficiency and scalability, diversity of data types, and data
mining and society.
Mining Methodology
o Mining various and new kinds of knowledge
Data mining covers a wide spectrum of data analysis and knowledge
discovery tasks, from data characterization and discrimination to
association and correlation analysis, classification, regression, clustering,
outlier analysis, sequence analysis, and trend and evolution analysis.
o Mining knowledge in multi-dimensional space
When searching for knowledge in large data sets, we can explore the data
in multidimensional space. That is, we can search for interesting patterns
among combinations of dimensions (attributes) at varying levels of
abstraction.
o Data mining: An interdisciplinary effort
The power of data mining can be substantially enhanced by integrating
new methods from multiple disciplines. For example, to mine data with
natural language text, it makes sense to fuse data mining methods with
methods of information retrieval and natural language processing.
o Boosting the power of discovery in a networked environment
Most data objects reside in a linked or interconnected environment,
whether it is the Web, database relations, files, or documents. Semantic
links across multiple data objects can be used to advantage in data mining.
o Handling noise, uncertainty, and incompleteness of data
Data often contain noise, errors, exceptions, or uncertainty, or are
incomplete. Errors and noise may confuse the data mining process, leading
DWDM III B. TECH II Semester R16 2020-21 11
to the derivation of erroneous patterns. Data cleaning, data preprocessing,
outlier detection and removal, and uncertainty reasoning are examples of
techniques that need to be integrated with the data mining process.
o Pattern evaluation and pattern- or constraint-guided mining
Not all the patterns generated by data mining processes are interesting.
What makes a pattern interesting may vary from user to user. Therefore,
techniques are needed to assess the interestingness of discovered patterns
based on subjective measures.
User Interaction
o Interactive mining
The data mining process should be highly interactive. Thus, it is important
to build flexible user interfaces and an exploratory mining environment,
facilitating the user’s interaction with the system. A user may like to first
sample a set of data, explore general characteristics of the data, and
estimate potential mining results.
o Incorporation of background knowledge
Background knowledge, constraints, rules, and other information
regarding the domain under study should be incorporated into the
knowledge discovery process.
o Ad hoc data mining and data mining query languages:
Query languages (e.g., SQL) have played an important role in flexible
searching because they allow users to pose ad hoc queries. Similarly, high-
level data mining query languages or other high-level flexible user
interfaces will give users the freedom to define ad hoc data mining tasks.
o Presentation and visualization of data mining results
How can a data mining system present data mining results, vividly and
flexibly, so that the discovered knowledge can be easily understood and
directly usable by humans? This is especially crucial if the data mining
process is interactive.
Efficiency and Scalability
o Efficiency and scalability of data mining algorithms
Data mining algorithms must be efficient and scalable in order to
effectively extract information from huge amounts of data in many data
repositories or in dynamic data streams.
o Parallel, distributed, stream, and incremental mining methods
The humongous size of many data sets, the wide distribution of data, and
the computational complexity of some data mining methods are factors
that motivate the development of parallel and distributed data-intensive
mining algorithms.
o Cloud computing and cluster computing,
Which use computers in a distributed and collaborative way to tackle very
large-scale computational tasks, are also active research themes in parallel
data mining.
Diversity of data types
o Handling complex types of data
Diverse applications generate a wide spectrum of new data types, from
structured data such as relational and data warehouse data to semi-
structured and unstructured data; from stable data repositories to dynamic
data streams; from simple data objects to temporal data, biological
i. Nominal Attributes: Nominal means ―relating to names.‖ The values of a nominal attribute
are symbols or names of things. Each value represents some kind of category, code, or
state, and so nominal attributes are also referred to as categorical. The values do not have
any meaningful order. In computer science, the values are also known as enumerations.
ii. Binary Attributes: A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent, and 1 means that it is
present. Binary attributes are referred to as Boolean if the two states correspond to true
and false.
DWDM III B. TECH II Semester R16 2020-21 13
Example:
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important e.g., gender
Asymmetric binary: outcomes not equally important. e.g., medical test (positive vs.
Negative) Convention: assign 1 to most important outcome (e.g., HIV positive)
iii. Ordinal Attributes: An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between successive values
is not known.
Example: Values have a meaningful order (ranking) but magnitude between successive
values is not known.
Size = {small, medium, large}, grades, army rankings
iv. Numeric Attributes: A numeric attribute is quantitative; that is, it is a measurable quantity,
represented in integer or real values. Numeric attributes can be interval-scaled or ratio-
scaled.
a. Interval-Scaled Attributes: Interval-scaled attributes are measured on a scale of
equal-size units. The values of interval-scaled attributes have order and can be
positive, 0, or negative. Thus, in addition to providing a ranking of values, such
attributes allow us to compare and quantify the difference between values.
Example: Interval
Measured on a scale of equal-sized units
Values have order. E.g., temperature in C˚or F˚, calendar dates
No true zero-point
b. Ratio-Scaled Attributes: A ratio-scaled attribute is a numeric attribute with an
inherent zero-point. That is, if a measurement is ratio-scaled, we can speak of a value
as being a multiple (or ratio) of another value. In addition, the values are ordered, and
we can also compute the difference between values, as well as the mean, median, and
mode.
Example: Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger than the unit of
measurement (10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts, monetary quantities
Mean: The most common and effective numeric measure of the ―center‖ of a set of data is the
(arithmetic) mean. Let x1, x2, : : : ,xN be a set of N values or observations, such as for some
numeric attribute X, like salary. The mean of this set of values is
Example: Suppose we have the following values for salary (in thousands of dollars), shown in
increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, and 110. Using above equation we
have
x = (30+36+47+50+52+52+56+60+63+70+70+110)/ 12 = 696/12 = 58
Thus, the mean salary is $58,000.
Sometimes, each value xi in a set may be associated with a weight wi for i D 1… N. The weights
reflect the significance, importance, or occurrence frequency attached to their respective values.
In this case, we can compute
This is called the weighted arithmetic mean or the weighted average. This corresponds to the
built-in aggregate function, average (avg () in SQL), provided in relational database systems.
Median: Middle value if odd number of values or average of the middle two values otherwise
The median is expensive to compute when we have a large number of observations. Estimated by
interpolation (for grouped data):
Mode: The mode is another measure of central tendency. The mode for a set of data is the value
that occurs most frequently in the set. Therefore, it can be determined for qualitative and
quantitative attributes. Data sets with one, two, or three modes are respectively called unimodal,
bimodal, and trimodal. In general, a data set with two or more modes is multimodal
Example: The data shown in increasing order: 30K, 36K, 47K, 50K, 52K, 52K, 56K, 60K, 63K,
70K, 70K, and 110K. The two modes are $52,000 and $70,000.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the following
empirical relation:
Example: The midrange of the data 30K, 36K, 47K, 50K, 52K, 52K, 56K, 60K, 63K, 70K, 70K,
and 110K is (30,000+110,000)/2 = $70,000.
In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and
mode are all at the same center value, as shown in below Figure (a). Data in most real
applications are not symmetric. They may instead be either positively skewed, where the mode
occurs at a value that is smaller than the median (Figure b), or negatively skewed, where the
mode occurs at a value greater than the median (Figure c).
(ii) Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation
and IQR (Inter quartile Range)
The distance between the first and third quartiles is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the interquartile range
(IQR) and is defined as
IQR = Q3 - Q1.
Example: The data of Example 2.6 contain 12 observations, already sorted in increasing
order. Thus, the quartiles for this data are the third, sixth, and ninth values, respectively, in
the sorted list. Therefore, Q1 is $47,000 and Q3 is $63,000. Thus, the interquartile range is
IQR = 63-47 = $16,000.
Box plots: Box plots are a popular way of visualizing a distribution. A box plot incorporates
the five-number summary as follows:
Typically, the ends of the box are at the quartiles so that the box length is the inter
quartile range.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
Variance and Standard Deviation: Variance and standard deviation are measures of data
dispersion. They indicate how spread out a data distribution is. A low standard deviation
means that the data observations tend to be very close to the mean, while a high standard
deviation indicates that the data are spread out over a large range of values.
Example: For the given salaries information, we found mean=$58,000, Calculate the
variance and Standard deviation of the given dataset of size N=12.
Variance, σ2 = 1/12(302 + 362+472 +502 + 522+ 522+ 562+ 602+ 632+702+702+1102) - 582
= 379.17
Standard deviation, σ =19.47
Quantile Plot: A quantile plot is a simple and effective way to have a first look at a univariate
data distribution. First, it displays all of the data for the given attribute (allowing the user to
assess both the overall behavior and unusual occurrences). Second, it plots quantile information.
For a data xi data sorted in increasing order, fi indicates that approximately 100 fi% of
the data are below or equal to the value xi
fi= (i-0.5) / N
Histogram Analysis Graph display of tabulated frequencies, shown as bars If X is numeric, the
term histogram is preferred. The range of values for X is partitioned into disjoint consecutive sub
ranges. The sub ranges, referred to as buckets or bins, are disjoint subsets of the data distribution
for X. The range of a bucket is known as the width.
Dissimilarity matrix
o n data points, but registers only the distance
o A triangular matrix
o Single mode
2. Manhattan distance
Another well-known measure is the Manhattan (or city block) distance, named so because it is the
distance in blocks between any two points in a city (such as 2 blocks down and 3 blocks over for a
total of 5 blocks). It is defined as
Both the Euclidean and the Manhattan distance satisfy the following mathematical properties: Non-
negativity, Identity of indiscernible, Symmetry, Triangle inequality
Example: Euclidean distance and Manhattan distance. Let x1 = (1, 2) and x2 = (3, 5). The Euclidean
distance between the two is =3.61 The Manhattan distance between the two is 2+3 = 5.
3. Minkowski distance
Minkowski distance is a generalization of the Euclidean and Manhattan distances. It is defined as
Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
2. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
(c) Compute the Minkowski distance between the two objects, using q D 3.
(d) Compute the supremum distance between the two objects.
Why Pre-process the Data:-Today's real-world databases are highly susceptible to noise, and
consist of missing, and inconsistent data due to their huge size. Data preprocessing is done to
improve the quality of the data. Preprocessed data improve the efficiency and ease of the mining
process. There are a number of data preprocessing techniques. They are
1. Data cleaning can be applied to remove noise and correct inconsistencies in the data.
2. Data integration merges data from multiple sources into a single data store, such as a data
warehouse or a data cube.
3. Data transformations, such as normalization, may be applied. Normalization may improve
the accuracy and efficiency of mining algorithms.
4. Data reduction can reduce the data size by aggregating, eliminating redundant features, or
clustering.
Noisy data:-
A data is said to be noisy if its attribute values are invalid and incorrect.
Noise is a random error or variance in a measured variable. ―Smooth‖ out the data to remove noise.
Some of the data smoothing techniques that are commonly used are.
1. Binning methods: - Binning methods smooth a sorted data value by consulting the neighborhood",
that is values around it. In Binning method the sorted values are distributed into a number of
'buckets', or bins. Because binning methods consult the neighborhood of values, they perform local
smoothing.
Commonly used binning methods are
a) Smoothing by bin means: - In smoothing by bin means, each value in a bin is replaced
by the mean value of the bin.
b) Smoothing by bin median: - In smoothing by bin medians, each bin value is replaced
by the bin median.
c) Smoothing by bin means: - In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value.
Example: - Smooth out the following prices 21, 8, 28, 4, 34, 21, 15, 25, 24.
Data for price are first sorted and then partitioned into equi depth bins of depth 3.
Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Partition into (equi-width) bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by bin means:
Bin 1: 9, 9, 9,
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29
Smoothing by bin median:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
DWDM III B. TECH II Semester R16 2020-21 29
Smoothing by bin boundaries:
Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Figure 3.2: Binning methods for data smoothing.
2. Clustering: - Outliers may be detected by clustering. Similar values are organized into group’s
clusters. Values which fall outside all clusters may be considered outliers.
3. Combined computer and human inspection: - Outliers may be identified through a combination of
computer and human inspection. This is much faster than having to manually search through the
entire database. The garbage patterns can then be removed from the database.
4. Regression: - Data can be smoothed by using regression. Linear regression involves finding the
best line to fit two variables, so that one variable can be used to predict the other. Multiple linear
regressions are an extension of linear regression, where more than two variables are involved to
predict the unknown value.
5. Inconsistent data: - Data inconsistencies may be corrected manually using external references. For
example, errors made at data entry may be corrected by performing a
Paper trace. Knowledge engineering tools may also be used to detect the violation of known data
constraints.
3. Data integration:-Data integration combines data from multiple sources into a single data store,
such as large database or data warehouse. Major Issues that are to be considered during data
integration are
Entity identification problem: - Sometimes customer_id in one database and cust_number in another
refer to the same entity. Data analyst or computer decides whether they both refer to the same entity
by examining the metadata of the data warehouse. Metadata is data about the data. Such metadata
can be used to avoid errors in schema integration.
Redundancy: -
Redundancy is another important issue in data integration. An attribute (such as annual revenue, for
instance) may be redundant if it can be ―derived‖ from another attribute or set of attributes.
Inconsistencies in attribute or dimension naming can also cause redundancies in the resulting data
set.
Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can
measure how strongly one attribute implies the other, based on the available data. For nominal data,
we use the χ2 (chi-square) test.
Where Oij is the observed frequency (i.e., actual count) of the joint event (Ai, Bj) and eij is the
expected frequency of (Ai, Bj), which can be computed as
Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated based on the data
distribution in the two categories)
For this 2 X 2 table, the degrees of freedom are (2-1) X (2-1) = 1. For 1 degree of freedom, the 2
value needed to reject the hypothesis at the 0.001 significance level is 10.828. Since our computed
value is above this, we can reject the hypothesis that gender and preferred reading are independent
and conclude that the two attributes are (strongly) correlated for the given group of people.
For numeric attributes, we can evaluate the correlation between two attributes, A and B, by
computing the correlation coefficient (also known as Pearson’s product moment coefficient, named
after its inventor, Karl Pearson). This is
where n is the number of tuples, ai and bi are the respective means of A and B, σA and σB are the
respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-product.
If we compare equation for rA,B (correlation coefficient) with Equation for covariance, we see that
Normalization
1. Min-max normalization: Performs a linear transformation on the original data. Suppose that
minA and maxA are the minimum and maximum values of an attribute, A. Min-max normalization
maps a value, vi , of A to vi’ in the range [new_minA, new_maxA] by computing
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
2. Z-score normalization (or zero-mean normalization), the values for an attribute, A, are
normalized based on the mean (i.e., average) and standard deviation of A. A value, vi, of A is
normalized to vi’ by computing
Ex. Suppose that the mean and standard deviation of the values for the attribute income are $54,000
and $16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed
to
3. Normalization by decimal scaling normalizes by moving the decimal point of values of attribute
A. The number of decimal points moved depends on the maximum absolute value of A. A value, vi ,
of A is normalized to v0i by computing. Where j is the smallest integer such that Max(|ν’|) < 1
4. Data Reduction
Obtain a reduced representation of the data set that is much smaller in volume but yet produces the
same (or almost the same) analytical results. Why data reduction? — A database/data warehouse
may store terabytes of data. Complex data analysis may take a very long time to run on the complete
data set. The following Data reduction strategies can be applied on data.
Data reduction techniques can be applied to obtain a reduced representation of the data set that is
much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on
the reduced data set should be more efficient yet produce the same (or almost the same) analytical
results. Strategies for data reduction include the following.
1. Data cube aggregation: - In Data cube aggregation, aggregation operations are applied to the
data in the construction of a data cube.
Suppose AllElectronics have their data as sales per quarter for the years 1997 to 1999 as
shown in fig(a) . But the management are interested in the annual sales (total per year), rather than
the total per quarter. Thus the data can be aggregated so that the resulting data summarize the total
a. Step-wise forward selection:- The procedure starts with an empty set of attributes. The best of the
original attributes is determined and added to the set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.
b. Step-wise backward elimination:- The procedure starts with the full set of attributes. At each step,
it removes the worst attribute remaining in the set.
a. Wavelet transforms: - The discrete wavelet transform (DWT) is a linear signal processing
technique that, when applied to a data vector D, transforms it to a numerically different vector, D0,
of wavelet coefficients. This technique is useful for data reduction if the wavelet transformed data
are of the same length as the original data.
The DWT is closely related to the discrete Fourier transform (DFT), a signal processing technique
involving sines and cosines. In general the DWT achieves better lossy compression. That is, if the
same number of coefficients are retained for a DWT and a DFT of a given data vector, the DWT
version will provide a more accurate approximation of the original data.
Popular wavelet transforms include the Daubechies-4 and the Daubechies-6 transforms. Wavelet
transforms can be applied to multidimensional data, such as a data cube. Wavelet transforms give
good results on sparse or skewed data, and data with ordered attributes.
There is only one DFT, yet there are several DWTs. The general algorithm for a discrete wavelet
transform is as follows.
1. The length, L, of the input data vector must be an integer power of two. This condition can be met
by padding the data vector with zeros, as necessary.
2. Each transform involves applying two functions. The first applies some data smoothing, such as a
sum or weighted average. The second performs a weighted difference.
3. The two functions are applied to pairs of the input data, resulting in two sets of data of length L=2.
In general, these respectively represent a smoothed version of the input data, and the high-frequency
content of it.
4. The two functions are recursively applied to the sets of data obtained in the previous loop, until
the resulting data sets obtained are of desired length.
5. A selection of values from the data sets obtained in the above iterations are designated the wavelet
coefficients of the transformed data.
4. Numerosity reduction
Can we reduce the data volume by choosing alternative, `smaller' forms of data representation?"
Techniques of numerosity reduction can indeed be applied for this purpose. These techniques may
be parametric or non-parametric.
For parametric methods, a model is used to estimate the data, so that typically only the data
parameters need be stored, instead of the actual data. (Outliers may also be stored). Log-linear
models, which estimate discrete multidimensional probability distributions, are an example. Non-
parametric methods for storing reduced representations of the data include histograms, clustering,
and sampling.
Let's have a look at each of the numerosity reduction techniques mentioned above.
6. Histograms
Histograms use binning to approximate data distributions and are a popular form of data reduction.
A histogram
For an attribute A partitions the data distribution of A into disjoint subsets, or buckets. The buckets
are displayed on a horizontal axis, while the height (and area) of a bucket typically represents the
average frequency of the values represented by the bucket. If each bucket represents only a single
attribute-value/frequency pair, the buckets are called singleton buckets. Often, buckets instead
represent continuous ranges for the given attribute.
7. Clustering
Clustering techniques consider data tuples as objects. They partition the objects into groups or
clusters, so that objects within a cluster are \similar" to one another and ―
dissimilar" to objects in other clusters. Similarity is commonly defined in terms of how ―close" the
objects are in space, based on a distance function. The ―quality" of a cluster may be represented by
its diameter, the maximum distance between any two objects in the cluster
8. Sampling:
Sampling can be used as a data reduction technique since it allows a large data set to be represented
by a much smaller random sample (or subset) of the data. Suppose that a large data set, D, contains
N tuples. Let's have a look at some possible samples for D.
1. Simple random sample without replacement (SRSWOR) of size n: This is created by drawing
n of the
N tuples from D (n < N), where the probably of drawing any tuple in D is 1=N, i.e., all tuples are
equally likely.
2. Simple random sample with replacement (SRSWR) of size n: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple
is drawn, it is placed back in D so that it may be drawn again.
Exercise Problems
1. Suppose a group of 12 sales price records has been sorted as follows:
5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215.
Partition them into three bins by each of the following methods:
(a) equal-frequency (equal-depth) partitioning
(b) equal-width partitioning
(c) Smoothing by bin means, bin median and bin boundaries
2. For the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70.
(a) Use min-max normalization to transform the value 35 for age onto the range [0.0, 1.0].
(b) Use z-score normalization to transform the value 35 for age, where the standard deviation
of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
(d) Comment on which method you would prefer to use for the given data, giving reasons
as to why.
3. . For the attribute age: 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35,
35, 35, 36, 40, 45, 46, 52, 70.
(a) Plot an equal-width histogram of width 10.
(b) Sketch examples of each of the following sampling techniques: SRSWOR, SRSWR,
Cluster sampling, and stratified sampling. Use samples of size 5 and the strata
―Youth,‖ ―middle-aged,‖ and ―senior.‖
Each technique employs a learning algorithm to identify a model that best fits the relationship
between the attribute set and class label of the input data. The model generated by a learning
algorithm should both fit the input data well and correctly predict the class labels of records it has
never seen before. Therefore, a key objective of the learning algorithm is to build models with good
generalization capability; i.e., models that accurately predict the class labels of previously unknown
records.
The following figure shows a general approach for solving classification problems.
First, a training set consisting of records whose class labels are known must be provided. The
training set is used to build a classification model, which is subsequently applied to the test set,
which consists of records with unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model. These counts are tabulated in a table known as a
confusion matrix as shown below.
Based on the entries in the confusion matrix, the total number of correct predictions made by the
model is (f11 + foo) and the total number of incorrect predictions is (f1o + f01).
Although a confusion matrix provides the information needed to determine how well a classification
model performs, summarizing this information with a single number would make it more convenient
to compare the performance of different models. This can be done using a performance metric such
as accuracy, which is defined as follows:
Equivalently, the performance of a model can be expressed in terms of its error rate, which is given
by the following equation:
Most classification algorithms seek models that attain the highest accuracy, or equivalently, the
lowest error rate when applied to the test set.
The first question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded,
then it is definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we
need to ask a follow-up question: Do the females of the species give birth to their young? Those that
do give birth are definitely mammals, while those that do not are likely to be non-mammals.
The previous example illustrates how we can solve a classification problem by asking a series of
carefully crafted questions about the attributes of the test record. Each time we receive an answer, a
follow-up question is asked until we reach a conclusion about the class label of the record. The series
of questions and their possible answers can be organized in the form of a decision tree, which is a
hierarchical structure consisting of nodes and directed edges.
The following Figure shows the decision tree for the mammal classification problem. The tree has
three types of nodes:
• A root node that has no incoming edges and zero or more outgoing edges.
DWDM III B. TECH II Semester R16 2020-21 46
• Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non-terminal nodes, which include the
root and other internal nodes, contain attribute test conditions to separate records that have different
characteristics.
For example, the root node shown in above figure uses the attribute body Temperature to separate
warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals,
a leaf node labeled Non-mammals is created as the right child of the root node. If the vertebrate is
warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other
warm-blooded creatures, which are mostly birds.
Classifying a test record is straightforward once a decision tree has been constructed. Starting from
the root node, we apply the test condition to the record and follow the appropriate branch based on
the outcome of the test.
Step 1: If all the records in Dt, belong to the same class yt then t is a leaf node labeled as yt.
Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the
test condition and the records in Dt are distributed to the children based on the outcomes. The
algorithm is then recursively applied to each child node.
In the example shown in the following Figure, each record contains the personal information of a
borrower along with a class label indicating whether the borrower has defaulted on loan payments.
Figure: Training set for predicting borrowers who will default on loan payments.
The initial tree for the classification problem contains a single node with class label Defaulted = No
(see following Figure (a)), which means that most of the borrowers successfully repaid their loans.
The tree, however, needs to be refined since the root node contains records from both classes. The
records M" subsequently divided into smaller subsets based on the outcomes of the Home Owner
test condition, as shown in following Figure (b). Hunt's algorithm is then applied recursively to each
child of the root node. From the training set given in above Figure, notice that all borrowers who are
homeowners successfully repaid their loans. The left child of the root is, therefore, a leaf node
labeled Defaulted = No (see following Figure (b)). For the right child, we need to continue applying
the recursive step of Hunt's algorithm until all the records belong to the same class. The trees
resulting from each recursive step are shown in following Figures (c) and (d).
Continuous Attributes For continuous attributes, the test condition can be expressed as a
comparison test (A < v) or (A ≥ v) with binary outcomes, or a range query with outcomes of the
form Vi ≤ A < Vi+l, for i = 1, ... , k. The difference between these approaches is shown in the
following Figure
The entropy or expected information based on the partitioning into subsets by A is given by:
The smaller the entropy value is, the greater the purity of the subset partitions. The encoding
information that would be gained by branching on A is
Example: Induction of a decision tree. The following Table presents a training set of data tuples
taken from the All Electronics customer database. (The data are adapted from [Quinlan 1986b]). The
class label attribute buys a computer, has two distinct values (namely {yes, no}), therefore, there are
two distinct classes (m = 2). Let C1 correspond to the class yes and class C2 correspond to no. There
are 9 samples of class yes and 5 samples of class no. To compute the information gain of each
attribute, we first use first Equation to compute the expected information needed to classify a given
sample. This is:
Next, we need to compute the entropy of each attribute. Let's start with the attribute age. We need to
look at the distribution of yes and no samples for each value of age. We compute the expected
information for each of these distributions.
Similarly, we can compute Gain (income) = 0.029, Gain(student) = 0.151, and Gain(credit rating) =
0.048. Since age has the highest information gain among the attributes, it is selected as the test
attribute. A node is created and labeled with age, and branches are grown for each of the attribute's
values. The samples are then partitioned accordingly, as shown in following Figure.
Where c is the number of classes and 0 log2 0 = 0 in entropy calculations. We provide several
examples of computing the different impurity measures.
Similarly, for the second binary grouping of {Sports} and {Family. Luxury}, the weighted average
Gini index is 0.167. The second grouping has a lower Gini index because its corresponding subsets
are much purer.
DWDM III B. TECH II Semester R16 2020-21 54
For the multiway split, the Gini index is computed for every attribute value. Since Gini({Family}) =
0.375, Gini({Sports}) = 0, and Gini({Luxury}) = 0.219, the overall Gini index for the multi way
split is equal to 4/20 X 0.375 + 8/20 X 0 + 8/20 X 0.219 = 0.163.
The multiway split has a smaller Gini index compared to both two-way splits.
1. The createNode() function extends the decision tree by creating a new node. A node in the
decision tree either has a test condition, denoted as node.test_cond, or a class label, denoted as
node.label.
2. The find_best_split() function determines the attribute test condition for partitioning the training
instances associated with a node. The splitting attribute chosen depends on the impurity measure
used. The popular measures include entropy and the Gini index.
3. The Classify () function determines the class label to be assigned to a leaf node. For each leaf
node t, let p(i|t) denote the fraction of training instances from class i associated with the node t.
4. The stopping_cond() function is used to terminate the tree-growing process by checking whether
all the instances have identical class label or attribute values. Since decision tree classifiers employ a
top-down, recursive partitioning approach for building a model, the number of training instances
associated with a node decreases as the depth of the tree increases.
(a) Compute the Gini index for the overall collection of training examples.
(b) Compute the Gini index for the Customer ID attribute.
(c) Compute the Gini index for the Gender attribute
(d) Compute the Gini index for the Car Type attribute using multiway split.
(e) Compute the Gini index for the Shirt Size attribute using multiway split.
(f) Which attribute is better, Gender, Car Type, or Shirt Size?
(g) Explain why Customer ID should not be used as the attribute test condition even though it has the
lowest Gini.
2. Naive Bayesian Classification: The naive Bayesian classifier, or simple Bayesian classifier, works
as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented
by an n-dimensional attribute vector, X=(.x1, x2, : : : , xn), depicting n measurements made on the
tuple from n attributes, respectively, A1, A2, : : : , An.
2. Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, the naive
Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
P(Ci|X) > P(Cj|X) for j!= i.
Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum
posteriori hypothesis. By Bayes‘ theorem,
.
3. As P(X) is constant for all classes, only P(X|Ci)/P(Ci) needs to be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that is,
P(C1)=P(C2)………P(Cm), and we would therefore maximize P(X|Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X|Ci). To reduce computation in evaluating P(X|Ci), the naive assumption of class-conditional
independence is made.
5. To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci . The classifier predicts
that the class label of tuple X is the class Ci if and only if
P(X|Ci)/P(Ci) > P(X|Cj)P(Cj) for 1 <j<m, j !=i.
In other words, the predicted class label is the class Ci for which P.XjCi/P.Ci/ is the maximum.
“How effective are Bayesian classifiers?” Various empirical studies of this classifier in comparison to
decision tree and neural network classifiers have found it to be comparable in some domains. In
theory, Bayesian classifiers have the minimum error rate in comparison to all other classifiers.
Problem: Players will play if weather is sunny. Is this statement is correct? We can solve it using
above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Figure: Simple Bayesian belief network. (a) A proposed causal model, represented by a directed
acyclic graph. (b) The conditional probability table for the values of the variable LungCancer (LC)
showing each possible combination of the values of its parent nodes, FamilyHistory (FH) and Smoker
(S).
The arcs in Figure (a) allow a representation of causal knowledge. For example, having lung cancer is
influenced by a person‘s family history of lung cancer, as well as whether or not the person is a
smoker. Note that the variable PositiveXRay is independent of whether the patient has a family history
of lung cancer or is a smoker, given that we know the patient has lung cancer. In other words, once we
know the outcome of the variable LungCancer, then the variables FamilyHistory and Smoker do not
provide any additional information regarding PositiveXRay. The arcs also show that the variable
LungCancer is conditionally independent of Emphysema, given its parents, FamilyHistory and
Smoker.
Exercise Problem
1. Consider a binary classification problem with the following set of attributes and attribute values:
• Air Conditioner = {Working, Broken}, • Engine = {Good, Bad}, • Mileage = {High, Medium, Low}
• Rust = {Yes, No}
Suppose a rule-based classifier produces the following rule set:
Mileage = High →Value = Low
Mileage = Low →Value = High
Air Conditioner = Working, Engine = Good →Value = High
Air Conditioner = Working, Engine = Bad → Value = Low
Air Conditioner = Broken → Value = Low
(a) Are the rules mutually exclusive? Answer: No
(b) Is the rule set exhaustive? Answer: Yes
(c) Is ordering needed for this set of rules?
Answer: Yes because a test instance may trigger more than one rule.
(d) Do you need a default class for the rule set?
Answer: No because every instance is guaranteed to trigger at least one rule.
(a) Estimate the conditional probabilities for P (A|+), P (B|+), P (C|+), P (A|−), P (B|−), and P (C|−).
Answer:
P (A = 1|−) = 2/5 = 0.4, P (B = 1|−) = 2/5 = 0.4,
P(C = 1|−) = 1, P (A = 0|−) = 3/5 = 0.6,
P (B = 0|−) = 3/5 = 0.6, P(C = 0|−) = 0; P (A = 1|+) = 3/5 = 0.6,
P (B = 1|+) = 1/5 = 0.2, P(C = 1|+) = 2/5 = 0.4,
P (A = 0|+) = 2/5 = 0.4, P (B = 0|+) = 4/5 = 0.8,
P (C = 0|+) = 3/5 = 0.6.
DWDM III B. TECH II Semester R16 2018-19 5
UNIT-V
Association analysis: problem definition, frequent itemset generation: The Apriori principle,
frequent itemset generation in the Apriori algorithm, candidate generation and pruning, support
counting, rule generation, compact representation of frequent itemsets, FP-Growth algorithms.
1. Association analysis:
Association rule mining is a popular and well researched method for discovering interesting
relations between variables in large databases. It is intended to identify strong rules discovered in
databases using different measures of interestingness. Given a set of transactions, find rules that
will predict the occurrence of an item based on the occurrences of other items in the transaction.
The uncovered relationships can be represented in the form of association rules or sets of frequent
items. The following table illustrates an example of Market Basket Transactions.
TID Items
1 {Bread, Milk}
2 {Bread, Diaper, Butter, Eggs}
3 {Milk, Diaper, Butter, Coke}
4 {Bread, Milk, Diaper, Butter }
5 {Bread, Milk, Diaper, Cola}
Table. An example of market basket transactions
Problem Definition: Definition: The problem of association rule mining is defined as:
Let I = {i1, i2….in} be a set of n binary attributes called items.
Let D = {t1, t2…tn} be a set of transactions called the database.
Each transaction in D has a unique transaction ID and contains a subset of the items in I. Binary
Representation Market basket data can be represented in a binary format as shown in the following
Table, where each row corresponds to a transaction and each column corresponds to an item. An item
can be treated as a binary variable whose value is one if the item is present in a transaction and
zero otherwise. Because the presence of an item in a transaction is often considered more important
than its absence, an item is an asymmetric binary variable.
In association analysis, a collection of zero or more items is termed an itemset. If an itemset contains k
items, it is called a k-itemset. For instance, {Butter, Diapers, Milk} is an example of a 3-item set. The
null (or empty) set is an item set that does not contain any items. The transaction width is defined as
the number of items present in a transaction.
Example:
Consider the rule {Milk, Diapers} {Butter}. Since the support count for {Milk, Diapers, Butter} is
2 and the total number of transactions is 5, the rule's support is 2/5 =0.4.
The rule's confidence is obtained by dividing the support count for by the support count for {Milk,
Diapers}. Since there are 3 transactions that contain milk and diapers, the confidence for this rule is
2/3=0.67.
If the itemset is infrequent, then all six candidate rules can be pruned immediately without having to
compute their confidence values. Therefore, a common strategy adopted by many association rule
mining algorithms is to decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the itemsets that satisfy the minsup
threshold, these item sets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the frequent
itemsets found in the previous step. These rules are called strong rules.
A brute-force approach for finding frequent itemsets is to determine the support count for every
candidate itemset in the lattice structure. To do this, we need to compare each candidate against every
transaction, an operation that is shown in following Figure. If the candidate is contained in a
transaction, its support count will be incremented. For example, the support for {Bread, Milk} is
incremented three times because the itemset is contained in transactions 1, 4, and 5.Such an approach
can be very expensive because it requires O(NMw) comparisons, where N is the number of
transactions, M = 2k -1 is the number of candidate itemsets, and w is the maximum transaction width.
There are several ways to reduce the computational complexity of frequent item set generation
1. Reduce the number of candidate itemsets (M).
2. Reduce the number of comparisons.
Suppose that the minimum transaction support count required is 2 (i.e., min sup = 50%). The
following Figure provides a high-level illustration of the frequent itemset generation part of the
Apriori algorithm for the transactions shown in the first Table. We assume that the support threshold is
60%, which is equivalent to a minimum support count equal to 3.
Initially, every item is considered as a candidate l-itemset. After counting their supports, the candidate
itemsets {Cola} and {Eggs} are discarded because they appear in fewer than three transactions. In the
next iteration, candidate 2-itemsets are generated using only the frequent l-itemsets because the
Apriori principle ensures that all supersets of the infrequent l-itemsets must be infrequent. Because
there are only four frequent l-itemsets, the number of candidate 2-itemsets generated by the algorithm
is 6. Two of these six candidates, {Butter, Bread} and {Butter, Milk}, are subsequently found to be
infrequent after computing their support values. The remaining four candidates are frequent, and thus
will be used to generate candidate 3-itemsets. Without support-based pruning, there are 20 candidate
3-itemsets that can be formed using the six items given in this example. With the Apriori principle, we
only need to keep candidate 3-itemsets whose subsets are frequent. The only candidate that has this
property is {Bread, Diapers, Milk}.
The effectiveness of the Apriori pruning strategy can be shown by counting the number of candidate
itemsets generated. A brute-force strategy of enumerating all itemsets (up to size 3) as candidates will
produce
Candidates, which represents a 68% reduction in the number of candidates item sets even in this
simple example. The pseudo code for the frequent itemset generation part of the Apriori algorithm is
shown in Algorithm
Brute-Force Method: The brute-force method considers every k-itemset as a potential candidate and
then applies the candidate pruning step to remove any unnecessary candidates.
F k-l X F1 Method: An alternative method for candidate generation is to extend each frequent (k - l)-
itemset with other frequent items. The following Figure illustrates how a frequent 2-itemset such as
{Butter, Diapers} can be augmented with a frequent item such as Bread to produce a candidate 3-
itemset {Butter, Diapers, Bread}.
Figure: Generating and pruning candidate k itemsets by merging a frequent (k - 1)-itemset with a
frequent item.
Note that some of the candidates are unnecessary because their subsets are infrequent. While this
procedure is a substantial improvement over the brute-force method, it can still produce a large
number of unnecessary candidates. For example, the candidate itemset obtained by merging {Butter,
Diapers} with {Milk} is unnecessary because one of its subsets, {Butter, Milk}, is infrequent.
F k-l x F k-l Method: The candidate generation procedure in the apriori-gen function merges a pair of
frequent (k -1)-itemsets only if their first k - 2 items are identical. Let A = and B =
be a pair of frequent (k - 1) itemsets. A and B are merged if they satisfy the
following conditions:
DWDM III B. TECH II Semester R16 2018-19 12
In the above Figure the frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a
candidate 3-itemset {Bread, Diapers, Milk}. The algorithm does not have to merge {Butter, Diapers}
with {Diapers, Milk} because the first item in both itemsets is different. Indeed, if {Butter, Diapers,
Milk} is a viable candidate, it would have been obtained by merging {Butter, Diapers} with {Butter,
Milk} instead.
5. Support Counting
One approach for doing this is to compare each transaction against every candidate itemset and to
update the support counts of candidates contained in the transaction. This approach is computationally
expensive, especially when the numbers of transactions and candidate itemsets are large.
An alternative approach is to enumerate the itemsets contained in each transaction and use them to
update the support counts of their respective candidate itemsets. To illustrate, consider a transaction t
that contains five items, {1, 2, 3,5, 6}. There are 10 itemsets of size 3 contained in this transaction.
The following Figure shows a systematic way for enumerating the 3-itemsets contained in t. assuming
that each itemset keeps its items in increasing lexicographic order.
An association rule can be extracted by partitioning the itemset Y into two non-empty subsets, X and
Y - X, such that X Y - X satisfies the confidence threshold.
Example: Let X = {1, 2, 3} be a frequent itemset. There are six candidate association rules that can be
generated from X: {1,2} => {3}, {1, 3} => {2}, {2,3} => {1}, {1=>{2,3}, {2} => {1,3}, and {3} =>
{1,2}. As each of their support is identical to the support for X, the rules must satisfy the support
threshold.
Theorem: If a rule X Y - X does not satisfy the confidence threshold, then any rule X' Y - X',
where X' is a subset of X, must not satisfy the confidence threshold as well.
For example, if {acd} {b} and {abd} {c} are high-confidence rules, then the candidate rule {ad}
{be} is generated by merging the consequents of both rules. The following Figure shows a lattice
structure for the association rules generated from the frequent itemset {a, b, c, d}. If any node in the
lattice has low confidence, then according to above Theorem, the entire sub graph spanned by the node
can be pruned immediately. Suppose the confidence for {bcd}{a} is low. All the rules containing
item a in its consequent, including {cd}{ab}, {bd} {ae}, {be}{ad}, and {d} {abe} can be
discarded.
Closed Frequent Item sets: An item set X is closed if none of its immediate supersets has exactly the
same support count as X
Examples of closed itemsets are shown in following Figure. To better illustrate the support count of
each itemset, we have associated each node (itemset) in the lattice with a list of its corresponding
transaction IDs.
An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.
8. FP-Growth Algorithm
This section presents an alternative algorithm called FP-growth that takes a radically different
approach to discovering frequent itemsets. The algorithm does not subscribe to the generate-and-test
paradigm of Apriori. Instead, it encodes the data set using a compact data structure called an FP- tree
and extracts frequent itemsets directly from this structure.
FP-Tree Representation
An FP-tree is a compressed representation of the input data. It is constructed by reading the data set
one transaction at a time and mapping each transaction onto a path in the FP-tree. As different
transactions can have several items in common, their paths may overlap. The more the paths overlap
with one another, the more compression we can achieve using the FP-tree structure.
1. After reading the first transaction, {a, b}, the nodes labeled as a and b are created. A path is then
formed from null ab to encode the transaction. Every node along the path has a frequency count
of 1.
2. After reading the second transaction, {b, c, d}, a new set of nodes is created for items b, c, and d. A
path is then formed to represent the transaction by connecting the nodes null b c d.
3. This process continues until every transaction has been mapped onto one of the paths given in the
FP-tree. The resulting FP-tree after reading all the transactions is shown at the bottom of Figure.
FP-growth finds all the frequent itemsets ending with a particular suffix by employing a divide-and-
conquer strategy to split the problem into smaller subproblems..
Exercise Problems
1. Consider the data set shown in Table
Customer_ID Transaction_ID Items_Bought
1 1 {a, d, e}
1 24 {a, b, c, e}
2 12 {a, b, d, e}
2 31 {a, c, d, e}
3 15 {b, c, e}
3 22 {b, d, e}
4 29 {c, d}
4 40 {a, b, c}
5 33 {a, d, e}
5 38 {a, b, e}
(a) Compute the support for itemsets {e}, {b, d}, and {b, d, e} by treating each transaction ID as a
market basket. Answer: s ({e}) = 8/10= 0.8, s ({b, d}) =2/10= 0.2, s ({b, d, e}) =2/10= 0.2
(b) Use the results in part (a) to compute the confidence for the association rules {b, d} −→ {e}
and {e} → {b, d}. Is confidence a symmetric measure?
Answer: c (bd → e) = 0.2/0.2= 100%, c (e → bd) = 0.2/0.8= 25%
2. Different Types of Clustering: A clustering is a set of clusters. There are two types of Clustering
techniques. Hierarchical Clustering and Partitional Clustering, important distinction between
hierarchical and partitional sets of clusters
● Partitional Clustering– A division of data objects into non-•overlapping subsets (clusters) such that
each data object is in exactly one subset
● Hierarchical clustering– A set of nested clusters organized as a hierarchical tree
Example:
K
2
SSE dist (m i, x)
xi C1i
x is a data point in cluster Ci and mi is the representative point for cluster Ci
o can show that mi corresponds to the center (mean) of the cluster
Given two sets of clusters, we prefer the one with the smallest error
One easy way to reduce SSE is to increase K, the number of clusters
o A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Limitations of K-means
K-means has problems when clusters are of differing
o Sizes
o Densities
o Non-globular shapes
K-means has problems when the data contains outliers.
(a) Select an initial partition of k clusters containing randomly chosen samples and compute
their centroids Say, one selects two clusters and assigns to cluster C1 = (x1, x2, x4) and C2 =
(x3, x5). Next, the centroids of the two clusters are determined:
(c) Generate a new partition by assigning each sample to the closest cluster center
For example, the distance of x1 from the centroid M1 is d (M1, x1) = (1.662 + 1.342) 1/2 = 2.14,
while that for d(M2, x1) = 3.40. Thus, object x1 will be assigned to the group which has the
smaller distance, namely C1. Similarly, one can compute distance measures of all other objects,
and assign each object as shown in Table.
(d) Compute new cluster centers as centroids of the clusters. The new cluster centers are M1 =
{0.5, 0.67} and M2 = {5.0, 1.0}
(e) Repeat steps (b) and (c) until an optimum value is found or until the cluster membership stabilizes
Answer:
(a) Containing at least one Nintendo game
The constraint is succinct and monotonic. This constraint can be mined efficiently using
FP-growth as follows.
• All frequent Nintendo games are listed at the end of the list of frequent items L.
• Only those conditional pattern bases and FP-trees for frequent Nintendo games need to be
derived from the global FP-tree and mined recursively.
(b) Containing items the sum of whose prices is less than $150
The constraint is antimonotonic. This constraint can be mined efficiently using Apriori as follows.
Only candidates the sum of whose prices is less than $150 need to be checked.
(c) Containing one free item and other items the sum of whose prices is at least $200 The
constraint is monotonic. (Or, sub constraints ―containing one free item‖ and ―the sum of
whose prices is less than
$150‖ are succinct and monotonic, respectively.) This constraint can be mined efficiently
using FP- growth as follows.
• Put all frequent free items at the end of the list of frequent items L.
• Only conditional pattern bases and FP-trees for frequent free items need to be derived from
the global FP-tree and mined recursively. Other free items should be excluded from these
conditional pattern bases and FP-trees.
• Once a pattern with items the sum of whose prices is at least $200, no further constraint
checking for total price is needed in recursive mining.
• A pattern as well as its conditional pattern base can be pruned if the sum of the price of
items in the pattern and the frequent ones in the pattern base is less than $200.
(d) Where the average price of all the items is between $100 and $150
The constraint is nonconvertible. (Or, the sub constraints ―the average price is at least $100‖
and ―the average price is at most $500‖ are convertible.) This constraint can be mined
efficiently using FP-growth as follows.
• All frequent items are listed in price descending order. (If you use ascending order, you must
rewrite the following two steps.)
• A pattern as well as its conditional pattern base can be pruned if the average price of items in
the pattern and those frequent ones in pattern base with prices greater than $100 is less than
$100.
• A pattern as well as its conditional pattern base can also be pruned if the average price of
items in the pattern is greater than $500.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four dimensions,
namely time, item, branch, and location.
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Note − each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state, country}. This constraint may cause data redundancy. For example, "Vancouver"
and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such
cities may cause data redundancy along the attributes province_or_state and country.
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema is normalized. For example, the
item dimension table in star schema is normalized and split into two dimension tables, namely item
and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains the
attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it
becomes easy to maintain and the save storage space.
Schema Definition: Multidimensional schema is defined using Data Mining Query Language (DMQL). The
two primitives, cube definition and dimension definition, can be used for defining the data warehouses and
data marts.
Syntax for Cube Definition
define cube < cube_name > [ < dimension-list > }: < measure_list >
Syntax for Dimension Definition
define dimension < dimension_name > as ( < attribute_or_dimension_list > )
Online Analytical Processing Server OLAP is based on the multidimensional data model. It allows
managers, and analysts to get an insight of the information through fast, consistent, and interactive
access to information.
Types of OLAP Servers: We have four types of OLAP servers
Relational OLAP - ROLAP
Multidimensional OLAP- MOLAP
Hybrid OLAP- HOLAP
Specialized SQL Servers
Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. T o store
and manage warehouse data, ROLAP uses relational or extended-relational DBMS. ROLAP includes the
following
Implementation of aggregation navigation logic
Optimization for each DBMS back end
Additional tools and services
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse. Therefore, many
MOLAP server uses two levels of data storage representation to handle dense and sparse data sets.
Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of ROLAP and
faster computation of MOLAP. HOLAP servers allow to store the large data volumes of detailed
information. T he aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Specialized SQL servers provide advanced query language and query processing support for SQL
queries over star and snowflake schemas in a read-only environment.
OLAP operations in the multidimensional data model:- In the multidimensional model, data are
organized into multiple dimensions and each dimension contains multiple levels of abstraction defined
by concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives. Some of the OLAP data cube operations are
Roll-up: The roll-up operation (also called the ―drill-up" operation) performs aggregation on a
data cube, either by climbing-up a concept hierarchy for a dimension or by dimension reduction.
The roll-up operation shown aggregates the data by ascending the location hierarchy from the
level of city to the level of country.
Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more
detailed data. Drill-down can be realized by either stepping-down a concept hierarchy for a
dimension or introducing additional dimensions. The drill-down operation shown aggregates the
data by descending the time hierarchy from the level of Quarter to the level of month.
Slice and Dice: The slice operation performs a selection on one dimension of the given cube,
resulting in a subcube. Figure shows a slice operation where the sales data are selected from the
central cube for the dimension time using the criteria time=‖Q2‖. The dice operation defines a
subcube by performing a selection on two or more dimensions. Figure shows a dice operation on
the central cube based on the following selection criteria which involves three dimensions:
(location=‖Montreal" or‖Vancouver‖) and (time=‖Q1‖ or ―Q2") and (item=‖home
entertainment" or ―computer").
Pivot (Rotate): Pivot (also called ―rotate‖) is a visualization operation which rotates the data
axes in view in order to provide an alternative presentation of the data.
DWDM III B. TECH II Semester R16 2018-19 44
Drill across: Executes the queries involves more than on fact table.
Drill through: This operation makes the use of relational SQL by converting the multidimensional
data to standard relational operations.