0% found this document useful (0 votes)
3 views39 pages

Unit I-1data Mining Introduction

The document outlines the curriculum for STA552: Knowledge Discovery and Data Mining, covering topics such as data mining techniques, clustering, artificial neural networks, classification and regression trees, and support vector machines. It also discusses the processes involved in data mining, including selection, preprocessing, transformation, and interpretation of data, while differentiating between predictive and descriptive data mining. Additionally, it highlights the importance of data warehousing and the characteristics and criteria for effective data warehouse systems.

Uploaded by

amairasingh761
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views39 pages

Unit I-1data Mining Introduction

The document outlines the curriculum for STA552: Knowledge Discovery and Data Mining, covering topics such as data mining techniques, clustering, artificial neural networks, classification and regression trees, and support vector machines. It also discusses the processes involved in data mining, including selection, preprocessing, transformation, and interpretation of data, while differentiating between predictive and descriptive data mining. Additionally, it highlights the importance of data warehousing and the characteristics and criteria for effective data warehouse systems.

Uploaded by

amairasingh761
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

STA552: Knowledge Discovery and Data Mining

UNIT I

Introduction to databases, tasks in building a data mining database, data warehouses,


online analytical data processing, Data mining and machine learning, supervised and
unsupervised learning.

Linear dimensionality reduction: principal component analysis for linear feature space,
scree plot and its use for determining the number of principal components to retain,
basic idea of nonparametric kernel density estimation, non-linear principal component
analysis.
UNIT II

Clustering: Similarity and distance measures, Outliers, Minimum spanning tree, squared
error clustering, K-means clustering, Hierarchical clustering, Block clustering and two
way clustering: Hartigan’s block clustering algorithm, Biclustering, Plaid models for
biclustering.

UNIT III

Artificial Neural Network: and extensions of regression models, McCullon-Pitts


Neuron (Threshold Logic Unit), Rosenblatt’s Single layer perceptron, single unit
perceptron gradient descent learning algorithm, Multilayer perceptron, feed forward
and back propogation learning algorithm, Self organizing maps (SOM) or Kohonen
neural network, on-line and batch versions of SOM algorithm, U matrix.
UNIT IV

Classification and Regession Trees (CART): Classification trees, node impurity function
and entropy function, choosing the best split pruning algorithm for classification trees.
Regression trees, terminal node value and splitting strategy, pruning the tree and best
pruned subtree.

Committee Machine: Bagging tree based classifiers and regression tree predictors,
Boosting, ADABOOST algorithm for binary classification.

UNIT V

Support vector machine (SVM) with linear class boundaries, multiclass SVM, Latent
variable models for blind source separation: Independent component analysis (ICA)
and its applications, linear mixing and noiseless ICA, FastICA algorithm for
determining single source component, deflation and parallel FastICA algorithm for
extracting multiple independent source components.

Books Recommended

Izenman, A.J., (2008), Modern Multivariate Statistical Techniques: Regression,


Classification, and Manifold learning, Springer

Han, J. and Kamber, M (2006). Data Mining: Concepts and Techniques, 2nd edition,
Morgan Kaufmann.

Dunham, M. H. (2003). Data Mining: Introductory and Advanced Topics, Pearson


Education.

Sheskin, D. J. (2004). The Handbook of Parametric and Nonparametric Statistical


Procedures, 3rd Edition, Chapman and Hall/CRC.
What is data mining?
A dramatic increase has taken place in the amount of information or data being
stored in electronic format.
Data storage became easier as the availability of large amounts of computing power
at low cost.
The new machine learning methods for knowledge representation, based on logic
programming, are computationally intensive and require more processing power.
Database Management systems gave access to the data stored but this was only a
small part of what could be gained from the data. The decision-makers could make
use of the data stored to gain valuable insight.
On-line transaction processing systems (OLTPs), mainly aim at putting data into
databases quickly, safely and efficiently but are concerned about delivering
meaningful analysis in return. Analyzing data can provide further knowledge by
going beyond the data explicitly stored. This is where Data Mining or Knowledge
Discovery in Databases (KDD) has obvious benefits.
Some of the definitions of Data Mining, or Knowledge Discovery in Databases:
Data Mining, or Knowledge Discovery in Databases (KDD) is the nontrivial
extraction of implicit, previously unknown, and potentially useful
information from data. This encompasses a number of different technical
approaches, such as clustering, data summarization, learning classification
rules, finding dependency net works, analyzing changes, and detecting
anomalies.
Data mining is the search for relationships and global patterns that exist in
large databases but are `hidden' among the vast amount of data. For
example, a relationship between patient data and their medical diagnosis.
Data mining refers to "using a variety of techniques to identify nuggets of
information or decision-making knowledge in bodies of data, and extracting
these in such a way that they can be put to use in the areas such as decision
support, prediction, forecasting and estimation. The data is often
voluminous, but as it stands of low value as no direct use can be made of it;
it is the hidden information in the data that is useful"
Basically data mining is concerned with the analysis of data and the use of software
techniques for finding patterns and regularities in sets of data.
The idea is that it is possible to strike gold in unexpected places as the data mining
software extracts patterns not previously observed.
The analysis process starts with a set of data, uses a methodology to develop an
optimal representation of the structure of the data during which knowledge is
acquired. The acquired knowledge can be extended to larger sets of data working on
the assumption that the larger data set has a structure similar to the sample data.
Some of the stages/processes identified in data mining and knowledge:
We start with the raw data and finish with the extracted knowledge, which is
acquired as a result of the following stages:
Selection - selecting or segmenting the data according to some criteria e.g. all
those people who own a car. In this way subsets of the data can be determined.
Preprocessing - this is the data cleansing stage where certain information is
removed which is unnecessary and may slow down queries. For example
unnecessary to note the sex of a patient when studying pregnancy. There is a
possibility of inconsistent formats because the data is drawn from several
sources e.g. sex may be recorded as f or m and also as 1 or 0. A consistent
format is ensured at this stage.
Transformation - the data is transformed to make useable and navigable.
Data mining and Extraction- this stage is concerned with the extraction of
patterns from the data. A pattern can be defined as given a set of facts (data) F, a
language L, and some measure of certainty C a pattern is a statement S in L that
describes relationships among a subset Fs of F with a certainty C such that S is
simpler in some sense than the enumeration of all the facts in Fs.
Interpretation and evaluation - the patterns identified by the system are
interpreted into knowledge, which can then be used to support decision-making
e.g. prediction and classification tasks, summarizing the contents of a database
or explaining observed phenomena.
Data mining background
Data mining research has drawn on a number of other fields such as inductive
learning, machine learning, statistics etc.
Inductive learning
Induction is the inference of information from data and inductive learning is the
model building process where the database is analyzed with a view to finding
patterns. Similar objects are grouped in classes and rules formulated whereby it is
possible to predict the class of unseen objects.
Inductive learning has two main strategies:
Supervised learning - Supervised learning (SL), or data classification, provides
a mapping from attributes to specified classes or concept groups. The classes are
identified and pre labeled in the data prior to learning. The system has to find a
description of each class. Once the description has been formulated the
description and the class form a classification rule, which can be used to predict
the class of previously unseen objects.
Examples of SL techniques: Regression models; neural networks; decision
trees; k-nearest neighborhood clustering.
Unsupervised learning – Unsupervised learning (USL) amounts to discovering
a number of patterns, subsets or segments within the data, without any prior
knowledge of target classes or concepts, i.e., learning without supervision. The
data mine system is supplied with objects but no classes are defined so it has to
observe the examples and recognize patterns (i.e. class description) by itself.
This system results in a set of class descriptions, one for each class discovered
in the environment.
Examples of USL techniques: K-means Clustering; Self-organized maps.
The model produced by inductive learning methods could be used to predict the
outcome of future situations.
Given a set of examples the system can construct multiple models some of which
will be simpler than others.
The principle of Ockhams (or Occam) razor states that if there are multiple
explanations for a particular phenomena it makes sense to choose the simplest
because it is more likely to capture the nature of the phenomenon.
Statistics:
Statistical analysis software can be used to detect unusual patterns and explain
patterns using statistical models such as linear models.
Machine Learning
Machine learning is a type of artificial intelligence (AI) that provides computers
with the ability to learn without being explicitly programmed. Machine learning
focuses on the development of computer programs that can change when exposed to
new data. It is the automation of a learning process. Learning is concerned with the
construction of rules based on observations of environmental states and transitions.
Machine learning examines previous examples and their outcomes and learns how to
reproduce these and make generalizations about new cases.
A machine learning system uses an entire finite set of observations, called the
training set, at once. This set contains observations coded in some machine-readable
form. The training set is finite hence not all concepts can be learned exactly.
Differences between Data Mining and Machine Learning
Knowledge Discovery in Databases (KDD) or Data Mining, and the part of Machine
Learning (ML) dealing with learning from examples overlap in the algorithms used
and the problems addressed.
The main differences are:
KDD is concerned with finding understandable knowledge, while ML is
concerned with improving performance of an agent.
KDD is concerned with very large, real-world databases, while ML typically
looks at smaller data sets.
ML is a broader field, which includes not only learning from examples, but also
reinforcement learning, learning with teacher, etc.
KDD is that part of ML which is concerned with finding understandable knowledge
in large sets of real-world examples.
The database is often designed for purposes different from data mining and so
properties or attributes that would simplify the learning task are not present.
Databases are usually contaminated by errors so the data-mining algorithm has to
cope with noise whereas ML has laboratory type examples i.e. as near perfect as
possible.
Practical KDD systems are expected to include three interconnected phases
Translation of standard database information into a form suitable for use by
learning facilities;
Using machine learning techniques to produce knowledge bases from databases;
and
Interpreting the knowledge produced to solve user problems and/or reduce data
spaces. Data spaces being the number of examples.

Predictive and descriptive data mining:

Predictive data mining: Predictive data mining methods are supervised. They are
used to induce models or theories (such as decision trees) from class-labeled data.
The induced models can be used for prediction and classification.
Descriptive data mining: Descriptive data mining methods are typically
unsupervised. They are used to induce interesting patterns (such as association rules)
from unlabeled data. The induced patterns are useful in exploratory data analysis.

Data Mining Models


Two types of models or modes of operation, which may be used to unearth
information of interest to the user, are:
Verification Model
The verification model takes a hypothesis from the user and tests the validity of it
against the data. The problem with this model is that no new information is created
in the retrieval process but rather the queries will always return records to verify or
negate the hypothesis.
Discovery Model: In discovery model, the system automatically discovers
important information hidden in the data. The discovery or data mining tools aim to
reveal a large number of facts about the data in as short a time as possible.
An example of such a model is a bank database, which is mined to discover many
groups of customers to target for a mailing campaign. The data is searched with no
hypothesis in mind other than for the system to group the customers according to the
common characteristics found.
Data Warehousing
A data warehouse is a relational database management system (RDMS)
designed specifically to meet the needs of transaction processing systems. Data
warehousing is a powerful technique making it possible to extract archived
operational data and overcome inconsistencies between different legacy data
formats.
The data warehouse provides data that is already transformed and summarized.
Characteristics of a data warehouse
There are generally four characteristics that describe a data warehouse:
Subject-oriented: data are organized according to subject instead of
application. For example, an insurance company using a data warehouse would
organize their data by customer, premium, and claim, instead of by different
products (auto, life, etc.).
Integrated: When data resides in many separate applications in the operational
environment, encoding of data is often inconsistent. For instance, in one
application, gender might be coded as "m" and "f" in another by 0 and 1.
When data are moved from the operational environment into the data
warehouse, they assume a consistent coding convention e.g. gender data is
transformed to "m" and "f".
Time-variant: The data warehouse contains a place for storing data that are five
to 10 years old, or older, to be used for comparisons, trends, and forecasting.
These data are not updated.
Non-volatile: Data are not updated or changed in any way once they enter the
data warehouse, but are only loaded and accessed.
Meta Data: The information that describes the model and definition of the source
data elements is called "metadata". The end-user finds and understands the data in
the warehouse by metadata and is an important part of the warehouse.
The metadata should contain
The structure of the data.
The algorithm used for summarization.
The mapping from the operational environment to the data warehouse.
Data cleansing is an important aspect of creating an efficient data warehouse in that
it is the removal of certain aspects of operational data, which slow down the query
times. Data should be extracted from production sources at regular intervals and
pooled centrally but the cleansing process has to remove duplication and reconcile
differences between various styles of data collection.
Once the data has been cleaned it is transferred to the data warehouse.
Data warehousing and OLTP systems
The data warehouse offers the potential to retrieve and analyze information quickly
and easily. The data in OLTP can be changed whereas the data warehouse is
descriptive and cannot be changed. Thus
The Data Warehouse model
The data warehouse model is illustrated in the following diagram.
The data within the actual warehouse itself has a distinct structure with the emphasis
on different levels of summarization as shown in the figure below.
The current detail data
reflects the most recent happenings, which are usually the most interesting;
is voluminous as it is stored at the lowest level of granularity;
is stored on disk storage, which is fast to access but expensive and complex to
manage.
Older detail data is stored on some form of mass storage; it is infrequently accessed
and stored at a level detail consistent with current detailed data.
Lightly summarized data is data distilled from the low level of detail found at the
current detailed level and generally is stored on disk storage.
Highly summarized data is compact and easily accessible and can even be found
outside the warehouse.
Metadata is the final component of the data warehouse and is used as:
A directory to help the DSS analyst locate the contents of the data warehouse,
A guide to the mapping of data as the data is transformed from the operational
environment to the data warehouse environment,
A guide to the algorithms used for summarization between the current detailed
data and the lightly summarized data and the lightly summarized data and the
highly summarized data, etc.
Criteria for a data warehouse
Red Brick Systems have established a criterion for a relational database
management system (RDBMS) suitable for data warehousing, and documented 10
specialized requirements for an RDBMS to qualify as a relational data warehouse:
1. Load Performance - Data warehouses require incremental loading of new data
on a periodic basis within narrow time windows; performance of the load
process should be measured in hundreds of millions of rows and gigabytes per
hour.
2. Load Processing - Many steps must be taken to load new or updated data into
the data warehouse including data conversions, filtering, reformatting, integrity
checks, physical storage, indexing, and metadata update.
3. Data Quality Management - The warehouse must ensure local consistency,
global consistency, and referential integrity despite "dirty" sources and massive
database size.
4. Query Performance - Fact-based management and ad-hoc analysis must not be
slowed or inhibited by the performance of the data warehouse RDBMS; large,
complex queries for key business operations must complete in seconds not days.
5. Terabyte Scalability - Data warehouse sizes are growing at astonishing rates.
The RDBMS must support modular and parallel management. It must support
continued availability in the event of a point failure, and must provide a
fundamentally different mechanism for recovery. The query performance must
not be dependent on the size of the database, but rather on the complexity of the
query.
6. Mass User Scalability - The RDBMS server must support hundreds, even
thousands, of concurrent users while maintaining acceptable query performance.
7. Networked Data Warehouse - Multiple data warehouse systems cooperate in a
larger network of data warehouses. The server must include tools that coordinate
the movement of subsets of data between warehouses. Users must be able to
look at and work with multiple warehouses from a single client workstation.
Warehouse managers have to manage and administer a network of warehouses
from a single physical location.
8. Warehouse Administration - The RDBMS must provide controls for
implementing resource limits, chargeback accounting to allocate costs back to
users, and query prioritization to address the needs of different user classes and
activities. The RDBMS must also provide for workload tracking and tuning so
system resources may be optimized for maximum performance.
9. Integrated Dimensional Analysis - The power of multidimensional views is
widely accepted, and dimensional support must be inherent in the warehouse
RDBMS to provide the highest performance for relational OLAP tools. The
RDBMS must support fast, easy creation of pre-computed summaries common
in large data warehouses.
10. Advanced Query Functionality - End users require advanced analytic
calculations, sequential and comparative analysis, and consistent access to
detailed and summarized data. The RDBMS must provide a complete set of
analytic operations including core sequential and statistical operations.
Data mining problems/issues
Limited Information
Inconclusive data causes problems because if some attributes essential to knowledge
about the application domain are not present in the data it may be impossible to
discover significant knowledge about a given domain. For example cannot diagnose
malaria from a patient database if that database does not contain the patients’ red
blood cell count.
Noise and missing values
Databases are usually contaminated by errors (noise). Errors in either the values of
attributes or class information are known as noise.
Missing data can be treated by discovery systems in a number of ways such as
Simply disregard missing values
Omit the corresponding records
Infer missing values from known values
Treat missing data as a special value to be included additionally in the attribute
domain
Or average over the missing values using Bayesian techniques.
Statistical methods can treat problems of noisy data, and separate different types of
noise.
Uncertainty
Uncertainty refers to the severity of the error and the degree of noise in the data.
Potential Applications
Data mining has many and varied fields of application some of which are listed
below.
Retail/Marketing
Identify buying patterns from customers
Find associations among customer demographic characteristics
Predict response to mailing campaigns
Market basket analysis (a modelling technique based upon the theory that if you
buy a certain group of items, you are more or less likely to buy another group of
items)
Banking
Detect patterns of fraudulent credit card use
Identify `loyal' customers
Predict customers likely to change their credit card affiliation
Determine credit card spending by customer groups
Find hidden correlations between different financial indicators
Identify stock trading rules from historical market data
Insurance and Health Care
Claims analysis - i.e. which medical procedures are claimed together
Predict which customers will buy new policies
Identify behavior patterns of risky customers
Identify fraudulent behavior
Transportation
Determine the distribution schedules among outlets
Analyze loading patterns
Medicine
Characterize patient behavior to predict office visits
Identify successful medical therapies for different illnesses.
Bioinformatics
Analyzing gene expression data using various data mining techniques like
clustering, visualization (transformation of higher dimensional micro array data
to lower dimensional, human understandable form), string matching.

You might also like