Data Mining

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 1 —

1
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
2
Why Data Mining?
◼ The Explosive Growth of Data: from terabytes to petabytes
◼ 1 terabyte (TB) equals 1,000 gigabytes (GB) or 1,000,000 megabytes (MB)
◼ There are 1,024 terabytes (TB) in 1 petabyte and approximately 1,024 PB make up
one exabyte.

3
Why Data Mining?

◼ The Explosive Growth of Data: from terabytes to petabytes


◼ Data collection and data availability
◼ Data mining turns a large collection of data into knowledge.
◼ Data Mining as the Evolution of Information Technology
◼ Data mining can be viewed as a result of the natural evolution of
information technology.
◼ Major sources of abundant data
◼ Business: Web, e-commerce, transactions, stocks, …
◼ Science: Remote sensing, bioinformatics, scientific simulation, …
◼ Society and everyone: news, digital cameras, YouTube
◼ We are drowning in data, but starving for knowledge!

4
Evolution of Sciences

5
Evolution of Sciences
◼ Before 1600, empirical science, which is based on observations or
experience.

◼ 1600-1950s, theoretical science, based on theories and hypotheses.

◼ Each discipline has grown a theoretical component. Theoretical models


often motivate experiments and generalize our understanding.

◼ 1950s-1990s, computational science


◼ Over the last 50 years, most disciplines have grown a third, computational
branch (e.g. empirical, theoretical, and computational ecology, or physics,
or linguistics.)
◼ Computational science is a discipline concerned with the design,
implementation and use of mathematical models to analyze and solve
scientific problems.
◼ Computational Science traditionally meant simulation.
6
Evolution of Sciences
◼ 1990-now, data science, which is the study of data to extract meaningful
insights for business.

◼ It is a multidisciplinary approach that combines principles and practices


from the fields of mathematics, statistics, artificial intelligence, and
computer engineering to analyze large amounts of data.
◼ The flood of data from new scientific instruments and simulations
◼ The ability to economically store and manage petabytes of data online
◼ The Internet and computing Grid that makes all these archives
universally accessible

7
Evolution of Database Technology
◼ 1960s:
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s:
◼ Relational data model, relational DBMS implementation
◼ 1980s:
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.)
◼ 1990s:
◼ Data mining, data warehousing, multimedia databases, and Web databases
◼ 2000s
◼ Stream data management and mining
◼ Data mining and its applications
◼ Web technology (XML, data integration) and global information systems

8
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
9
What Is Data Mining?

◼ Data mining (knowledge discovery from data)


◼ is the process of discovering interesting patterns from massive
amounts of data.

10
What Is Data Mining?

◼ Data mining (knowledge discovery from data)


◼ As a knowledge discovery process, it typically involves
data cleaning, data integration, data selection, data
transformation, pattern discovery, pattern evaluation, and
knowledge presentation.
◼ Alternative names
◼ Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

11
What Is Data Mining?

◼ The knowledge discovery process includes the following:


1. Data cleaning (to remove noise and inconsistent data);

2. Data integration (where multiple data sources may be


combined);
◼ A popular trend in the information industry is to perform data
cleaning and data integration as a preprocessing step, where
the resulting data are stored in a data warehouse.

3. Data selection (where data relevant to the analysis task are


retrieved from the database);

12
What Is Data Mining?

◼ The knowledge discovery process includes the following:


4. Data transformation (where data are transformed and
consolidated into forms appropriate for mining by performing
summary or aggregation operations);

5. Data mining (an essential process where intelligent methods


are applied to extract data patterns);

13
What Is Data Mining?

◼ The knowledge discovery process includes the following:


6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on interestingness measures);
◼ A pattern is interesting if it is valid on test data with some degree of
certainty, novel, potentially useful and easily understood by humans.
Interesting patterns represent knowledge.

7. Knowledge presentation (where visualization and knowledge


representation techniques are used to present mined
knowledge to users).

14
What Is
Data
Mining?

15
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
16
KDD Process: A Typical View from ML and
Statistics

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis

◼ This is a view from typical machine learning and statistics communities

17
Example: Medical Data Mining

◼ Health care & medical data mining – often


adopted such a view in statistics and machine
learning
◼ Preprocessing of the data (including feature
extraction and dimension reduction)
◼ Classification or/and clustering processes
◼ Post-processing for presentation

18
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
19
Multi-Dimensional View of Data Mining
◼ Data to be mined
◼ Database data (extended-relational, object-oriented,

heterogeneous, legacy), data warehouse, transactional


data, stream, spatiotemporal, time-series, sequence,
text and web, multi-media, graphs & social and
information networks.

20
Multi-Dimensional View of Data Mining
◼ Knowledge to be mined (or: Data mining functions)
◼ Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.

◼ Descriptive data mining vs. predictive data mining

◼ Descriptive data mining is a data mining technique that


identifies what happened in the past by analyzing the stored
past data. The main objective of descriptive data mining is
the summarization and transformation of data into useful
information for monitoring and reporting purposes.

21
Multi-Dimensional View of Data Mining
◼ Knowledge to be mined (or: Data mining functions)
◼ Descriptive data mining vs. predictive data mining

◼ Predictive data mining is the analysis done to predict a


future event or other data or trends. Business Analysts can
use predictive data mining to make better decisions and add
value to the analytics team’s efforts.

◼ Multiple/integrated functions and mining at multiple levels

22
Multi-Dimensional View of Data Mining
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine

learning, statistics, pattern recognition, visualization,


high-performance, etc.

◼ Applications adapted
◼ Retail, telecommunication, banking, fraud analysis,

bio-data mining, stock market analysis, text mining,


Web mining, etc.

23
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
24
Data Mining: On What Kinds of Data?

◼ As a general technology, data mining can be applied to


any kind of data as long as the data are meaningful for a
target application.
◼ Database-oriented data sets and applications
◼ A database system, also called a database management
system (DBMS), consists of a collection of interrelated data,
known as a database, and a set of software programs to
manage and access the data.

25
Data Mining: On What Kinds of Data?

◼ Database-oriented data sets and applications


◼ Relational database, is a collection of tables, each of which is
assigned a unique name. Each table consists of a set of
attributes (columns or fields) and usually stores a large set of
tuples (records or rows). Each tuple in a relational table
represents an object identified by a unique key and described
by a set of attribute values.

26
Data Mining: On What Kinds of Data?

◼ Database-oriented data sets and applications


◼ A data warehouse is a repository of information collected from multiple sources,
stored under a unified schema, and usually residing at a single site.
◼ Data warehouses are constructed via a process of data cleaning, data
integration, data transformation, data loading, and periodic data refreshing.
◼ A data warehouse is usually modeled by a multidimensional data structure,
called a data cube, in which each dimension corresponds to an attribute or a
set of attributes in the schema, and each cell stores the value of some
aggregate measure such as count or sum(sales amount).
◼ A data cube provides a multidimensional view of data and allows the
precomputation and fast access of summarized data.

27
Data Mining:
On What
Kinds of
Data?

28
Data Mining: On What Kinds of Data?

◼ Database-oriented data sets and applications


◼ In general, each record in a transactional database captures a transaction,
such as a customer’s purchase, a flight booking, or a user’s clicks on a web
page.
◼ A transactional database may have additional tables, which contain other
information related to the transactions, such as item description,
information about the salesperson or the branch, and so on.

29
Data Mining: On What Kinds of Data?

◼ Advanced data sets and advanced applications


◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web

30
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
31
Data Mining Function: (1)Characterization
and Discrimination
◼ Data characterization is a summarization of the
general characteristics or features of a target class of
data.
◼ The data corresponding to the user-specified class are
typically collected by a query.
◼ E.g. Summarize the characteristics of customers who spend
more than $5000 a year at Walmart.
◼ The result is a general profile of these customers, such
as that they are 40 to 50 years old, employed, and have
excellent credit ratings.

32
Data Mining Function: (1)Characterization
and Discrimination
◼ Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or multiple
contrasting classes.
◼ The target and contrasting classes can be specified by a user, and the
corresponding data objects can be retrieved through database queries.
◼ For example, a user may want to compare the general features of software
products with sales that increased by 10% last year against those with
sales that decreased by at least 30% during the same period.
◼ For example, a customer relationship manager at Samsung may want to
compare two groups of customers—those who shop for computer
products regularly (e.g., more than twice a month) and those who rarely
shop for such products (e.g., less than three times a year).
33
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)
◼ Frequent patterns, are patterns that occur
frequently in data.
◼ What items are frequently purchased together in
your Walmart?
◼ Association analysis
◼ Suppose that, as a marketing manager a Walmart,
you want to know which items are frequently
purchased together (i.e., within the same
transaction).
34
Data Mining Function: (2) Association and
Correlation Analysis
◼ Frequent patterns (or frequent itemsets)

◼ where X is a variable representing a customer.


◼ A confidence, or certainty, of 50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well.
◼ A 1% support means that 1% of all the transactions under analysis show that
computer and software are purchased together.

◼ This association rule involves a single attribute or predicate (i.e., buys) that
repeats. Association rules that contain a single predicate are referred to as
single-dimensional association rules.
◼ Dropping the predicate notation, the rule can be written simply as
“computer→software [1%, 50%].”
35
Data Mining Function: (2) Association and
Correlation Analysis
◼ Suppose, instead, that we are given the Samsung relational database related to
purchases. A data mining system may find association rules like:

36
Data Mining Function: (2) Association and
Correlation Analysis
◼ Suppose, instead, that we are given the Samsung relational database related to
purchases. A data mining system may find association rules like:

◼ The rule indicates that of the Samsung customers under study, 2% are 20
to 29 years old with an income of $40,000 to $49,000 and have purchased
a laptop (computer) at Samsung. There is a 60% probability that a
customer in this age and income group will purchase a laptop.
◼ Note that this is an association involving more than one attribute
or predicate (i.e., age, income, and buys).
◼ The above rule can be referred to as a multidimensional
association rule.

37
Data Mining Function: (3) Classification

◼ Classification and label prediction


◼ Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts.
◼ The model are derived based on the analysis of a set of training data (i.e.,
data objects for which the class labels are known).
◼ The model is used to predict the class label of objects for which the class
label is unknown.
◼ E.g., classify countries based on (climate), or classify cars based on (gas
mileage)

◼ Typical methods
◼ Decision trees, naïve Bayesian classification, support vector machines,
neural networks, rule-based classification, pattern-based classification,
logistic regression.
38
Data Mining Function: (3) Classification

39
Data Mining Function: (3) Classification

◼ Typical methods
◼ A decision tree is a flowchart-like tree structure, where each
node denotes a test on an attribute value, each branch
represents an outcome of the test, and tree leaves represent
classes or class distributions.
◼ A neural network, when used for classification, is typically a
collection of neuron-like processing units with weighted
connections between the units.

40
Data Mining Function: (3) Classification

◼ Typical methods
◼ Regression analysis is a statistical methodology that is most
often used for numeric prediction, although other methods
exist as well. Regression also encompasses the identification
of distribution trends based on the available data.

41
Data Mining Function: (4) Cluster Analysis

◼ Clustering analyzes data objects without consulting class labels.


◼ Unsupervised learning (i.e., Class label is unknown)
◼ Group data to form new categories (i.e., clusters), e.g., cluster
houses to find distribution patterns.
◼ Clustering can also facilitate taxonomy formation, that is, the
organization of observations into a hierarchy of classes that group
similar events together.

These clusters may


represent individual target
groups for marketing.

42
Data Mining Function: (5) Outlier Analysis

◼ Outlier analysis
◼ Outlier: A data object that does not comply with the general
behavior of the data.
◼ Noise or exception? ― One person’s garbage could be
another person’s treasure
◼ Methods: by product of clustering or regression analysis, …
◼ Useful in fraud detection, rare events analysis

43
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
44
Data Mining: Confluence of Multiple Disciplines

45
Data Mining: Confluence of Multiple
Disciplines

◼ Statistics studies the collection, analysis,


interpretation or explanation, and presentation of data.
◼ For example, we can use statistics to model noise and
missing data values.
◼ Statistics research develops tools for prediction and
forecasting using data and statistical models.

◼ Machine learning investigates how computers can


learn (or improve their performance) based on data.
◼ A main research area is for computer programs to
automatically learn to recognize complex patterns and make
intelligent decisions based on data.

46
Data Mining: Confluence of Multiple
Disciplines

◼ Database systems research focuses on the creation,


maintenance, and use of databases for organizations
and end-users.
◼ Database systems are often well known for their high
scalability in processing very large, relatively structured data
sets.
◼ A data warehouse integrates data originating from
multiple sources and various timeframes.
◼ It consolidates data in multidimensional space to form
partially materialized data cubes.
◼ Information retrieval (IR) is the science of searching
for documents or information in documents.
◼ Documents can be text or multimedia, and may reside on the
Web.
47
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
48
Applications of Data Mining
◼ Business intelligence (BI) technologies provide
historical, current, and predictive views of business
operations.
◼ Examples include reporting, online analytical processing, business
performance management, competitive intelligence, benchmarking, and
predictive analytics.

◼ A Web search engine is a specialized computer server


that searches for information on the Web.
◼ The search results of a user query are often returned as a list (sometimes
called hits).
◼ The hits may consist of web pages, images, and other types of files.

49
Applications of Data Mining
◼ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
◼ Collaborative analysis & recommender systems
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster
analysis (microarray data analysis), biological sequence
analysis, biological network analysis
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue)
◼ From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to
invisible data mining 50
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
51
Major Issues in Data Mining (1)

◼ Mining Methodology
◼ Mining various and new kinds of knowledge
◼ Mining knowledge in multi-dimensional space
◼ Data mining: An interdisciplinary effort
◼ Boosting the power of discovery in a networked environment
◼ Handling noise, uncertainty, and incompleteness of data
◼ Pattern evaluation and pattern- or constraint-guided mining
◼ User Interaction
◼ Interactive mining
◼ Incorporation of background knowledge
◼ Presentation and visualization of data mining results

52
Major Issues in Data Mining (2)

◼ Efficiency and Scalability


◼ Efficiency and scalability of data mining algorithms
◼ Parallel, distributed, stream, and incremental mining
methods
◼ Diversity of data types
◼ Handling complex types of data
◼ Mining dynamic, networked, and global data repositories
◼ Data mining and society
◼ Social impacts of data mining
◼ Privacy-preserving data mining
◼ Invisible data mining 53
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
54
A Brief History of Data Mining Society

◼ 1989 IJCAI Workshop on Knowledge Discovery in Databases


◼ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
◼ 1991-1994 Workshops on Knowledge Discovery in Databases
◼ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
◼ 1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
◼ Journal of Data Mining and Knowledge Discovery (1997)
◼ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
◼ More conferences on data mining
◼ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
◼ ACM Transactions on KDD starting in 2007

55
Conferences and Journals on Data Mining

◼ KDD Conferences ◼ Other related conferences


◼ ACM SIGKDD Int. Conf. on
◼ DB conferences: ACM SIGMOD,
Knowledge Discovery in Databases
VLDB, ICDE, EDBT, ICDT, …
and Data Mining (KDD)
◼ SIAM Data Mining Conf. (SDM) ◼ Web and IR conferences: WWW,
SIGIR, WSDM
◼ (IEEE) Int. Conf. on Data Mining
(ICDM) ◼ ML conferences: ICML, NIPS
◼ European Conf. on Machine Learning ◼ PR conferences: CVPR,
and Principles and practices of ◼ Journals
Knowledge Discovery and Data
◼ Data Mining and Knowledge
Mining (ECML-PKDD)
Discovery (DAMI or DMKD)
◼ Pacific-Asia Conf. on Knowledge
Discovery and Data Mining (PAKDD) ◼ IEEE Trans. On Knowledge and Data
◼ Int. Conf. on Web Search and Data Eng. (TKDE)
Mining (WSDM) ◼ KDD Explorations
◼ ACM Trans. on KDD

56
Where to Find References? DBLP, CiteSeer, Google

◼ Data mining and KDD (SIGKDD: CDROM)


◼ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
◼ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
◼ Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
◼ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
◼ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
◼ AI & Machine Learning
◼ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
◼ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI,
etc.
◼ Web and IR
◼ Conferences: SIGIR, WWW, CIKM, etc.
◼ Journals: WWW: Internet and Web Information Systems,
◼ Statistics
◼ Conferences: Joint Stat. Meeting, etc.
◼ Journals: Annals of statistics, etc.
◼ Visualization
◼ Conference proceedings: CHI, ACM-SIGGraph, etc.
◼ Journals: IEEE Trans. visualization and computer graphics, etc.
57
Chapter 1. Introduction
◼ Why Data Mining?

◼ What Is Data Mining?

◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined?

◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?

◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
58
Summary
◼ Data mining is the process of discovering interesting patterns frommassive
amounts of data.
◼ A pattern is interesting if it is valid on test data with some degree of certainty,
novel, potentially useful and easily understood by humans.
◼ Data mining can be conducted on any kind of data as long as the data are
meaningful for a target application, such as database data, data warehouse
data, transactional data, and advanced data types.
◼ A data warehouse is a repository for long-term storage of data from multiple
sources, organized so as to facilitate management decision-making.
◼ Data mining functionalities are used to specify the kinds of patterns or
knowledge to be found in data mining tasks.
◼ There are many challenging issues in data mining research. Areas include
mining methodology, user interaction, efficiency and scalability, and dealing
with diverse data types. 59
Recommended Reference Books
◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann,
2002
◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data
Mining. AAAI/MIT Press, 1996
◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Kaufmann, 2001
◼ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
◼ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
◼ https://www.nature.com/subjects/computationalscience#:~:text=Computational%20science%20is%20a%20di
scipline,a%20scientific%20system%20or%20process.

60
Activity 1
◼ What is data mining? In your answer, address the following: (5 pts. each)
◼ a) Is it another hype?

◼ b) Is it a simple transformation or application of technology developed


from databases, statistics, machine learning, and pattern recognition?
◼ c) We have presented a view that data mining is the result of the
evolution of database technology. Do you think that data mining is also
the result of the evolution of machine learning research? Can you
present such views based on the historical progress of this discipline?
Do the same for the fields of statistics and pattern recognition.
◼ d) Describe the steps involved in data mining when viewed as a
process of knowledge discovery
◼ Present an example where data mining is crucial to the success of a
business. (10 pts. each)
◼ What data mining functionalities does this business need (e.g., think of
the kinds of patterns that could be mined)?
◼ Can such patterns be generated alternatively by data query processing
or simple statistical analysis?

February 18, 2024 Data Mining: Concepts and Techniques 61

You might also like