Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
Dipti Chauhan
INTRODUCTION TO DATA MINING Assistant Professor
SCSIT, SUAS Indore
WHY DATA MINING?
2
DATA MINING AS THE EVOLUTION OF INFORMATION TECHNOLOGY
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
WHAT IS DATA MINING?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information
harvesting, business intelligence, etc.
5
KNOWLEDGE DISCOVERY (KDD) PROCESS
This is a view from typical
database systems and data
warehousing communities
Data mining plays an
essential role in the
knowledge discovery process
KNOWLEDGE DISCOVERY FROM DATA (KDD)
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
EXAMPLE: A DATA MINING FRAMEWORK
8
DATA MINING: ON WHAT KINDS OF DATA?
Data mining can be applied to any kind of data as long as the data are meaningful
for a target application.
The basic repositories include- Advanced Database System includes
Relational database Object-oriented relational databases
Data mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
Data mining tasks can be classified in two categories
1. Descriptive- Descriptive mining tasks characterize properties of the data in
a target data set.
These tasks present the general properties of data stored in database. The descriptive tasks
are used to find out patterns in data i.e. cluster, correlation, trends and anomalies etc.
3. Clustering:
Clustering is the process of partitioning a set of object or data in a same group called a
cluster. These objects are more similar (in some sense or another) to each other than to those in
other groups ( clusters). Clustering is used in many fields, including machine learning, patterns
recognition, bioinformatics, image analysis and information retrieval.
4. Mining Frequent patterns, Associations and correlations:
Frequent patterns can be defined as a pattern (a set of items, subsequence, substructures, etc.)
that appears intermittently in data. A intermittent item set is a set of data that occurs
frequently together in a transaction data set for example, a set of items, such as table and
chair. Subsequence means first of all buying a Computer system, then UPS, and thereafter a
printer. This appears frequently in a shopping history data base and is called a frequent
sequential pattern. Substructure as particular structural forms such as sub graphs, sub tree. If a
substructure appears intermittently, it is named as a frequent structural pattern. Discovering
such type of frequent pattern plays an important role in correlation mining association
clustering and other data mining tasks.
DATA MINING FUNCTIONALITIES CONTD..- WHAT KIND OF PATTERNS CAN BE MINED
5. Outlier Analysis:
A data set may contain objects that do not comply with the general behavior or
model of the data. These data objects are outliers. Many data mining methods
discard outliers as noise or exceptions. However, in some applications (e.g., fraud
detection) the rare events can be more interesting than the more regularly occurring
ones. The analysis of outlier data is referred to as outlier analysis or anomaly
mining.
Outliers may be detected using statistical tests that assume a distribution or
probability model for the data, or using distance measures where objects that are
remote from any other cluster are considered outliers. Rather than using statistical or
distance measures, density-based methods may identify outliers in a local region,
although they look normal from a global statistical distribution view.
DATA MINING FUNCTIONALITIES CONTD..- WHAT KIND OF PATTERNS CAN BE MINED