Introduction To Data Mining
Introduction To Data Mining
Introduction To Data Mining
We live in a world where vast amounts of data are collected daily. Analyzing such data is an
important need. “We are living in the information age” is a popular saying; however, we are
actually living in the data age. Terabytes or petabytes of data pour into our computer networks,
the World Wide Web (WWW), and various data storage devices every day from business, society,
science and engineering, medicine, and almost every other aspect of daily life. This explosive
growth of available data volume is a result of the computerization of our society and the fast
development of powerful data collection and storage tools.
1. Businesses worldwide generate gigantic data sets, including sales transactions, stock
trading records, product descriptions, sales promotions, company profiles and
performance, and customer feedback.
2. Scientific and engineering practices generate high orders of petabytes of data in a
continuous manner, from remote sensing, process measuring, scientific experiments,
system performance, engineering observations, and environment surveillance.
3. Global backbone telecommunication networks carry tens of petabytes of data traffic every
day.
4. The medical and health industry generates tremendous amounts of data from medical
records, patient monitoring, and medical imaging.
5. Billions of Web searches supported by search engines process tens of petabytes of data
daily.
6. Communities and social media have become increasingly important data sources,
producing digital pictures and videos, blogs, Web communities, and various kinds of social
networks.
7. The list of sources that generate huge amounts of data is endless.
Powerful and versatile tools are badly needed to automatically uncover valuable information
from the tremendous amounts of data and to transform such data into organized knowledge. This
necessity has led to the birth of data mining.
Steps of KDD
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined). A popular trend in the
information industry is to perform data cleaning and data integration as a preprocessing
step, where the resulting data are stored in a data warehouse.
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms appropriate
for mining by performing summary or aggregation operations). Sometimes data
transformation and consolidation are performed before the data selection process,
particularly in the case of data warehousing. Data reduction may also be performed to
obtain a smaller representation of the original data without sacrificing its integrity.
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are
presented to the user and may be stored as new knowledge in the knowledge base.
Database Data: When mining relational databases, we can go further by searching for trends
or data patterns. For example, data mining systems can analyze customer data to predict the credit
risk of new customers based on their income, age, and previous credit information. Data mining
systems may also detect deviations—that is, items with sales that are far from those expected in
comparison with the previous year. Such deviations can then be further investigated. For example,
data mining may discover that there has been a change in packaging of an item or a significant
increase in price.
Data Warehouse: Although data warehouse tools help support data analysis, additional tools for
data mining are often needed for in-depth analysis. Multidimensional data mining (also called
exploratory multidimensional data mining) performs data mining in multidimensional space in
an OLAP style. That is, it allows the exploration of multiple combinations of dimensions at varying
levels of granularity in data mining, and thus has greater potential for discovering interesting
patterns representing knowledge.
Transactional Data: A traditional database system is not able to perform market basket data
analysis. Fortunately, data mining on transactional data can do so by mining frequent itemsets, that
is, sets of items that are frequently sold together. “Which items sold well together?” This kind of
market basket data analysis would enable you to bundle groups of items together as a strategy for
boosting sales. For example, given the knowledge that printers are commonly purchased together
with computers, you could offer certain printers at a steep discount (or even for free) to customers
buying selected computers, in the hopes of selling more computers (which are often more
expensive than printers).
Other Kind of Data: Such kinds of data can be seen in many applications: time-related or
sequence data (e.g., historical records, stock exchange data, and timeseries and biological sequence
data), data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps), engineering design data (e.g., the design of buildings, system components,
or integrated circuits), hypertext and multimedia data (including text, image, video, and audio
data), graph and networked data (e.g., social and information networks), and the Web (a huge,
widely distributed information repository made available by the Internet). These applications bring
about new challenges, like how to handle data carrying special structures (e.g., sequences, trees,
graphs, and networks) and specific semantics (such as ordering, image, audio and video contents,
and connectivity), and how to mine patterns that carry rich structures and semantics.
Data Mining tasks can be classified into two categories: descriptive and predictive.
Descriptive mining: This term is basically used to produce correlation, cross-tabulation,
frequency etc. These technologies are used to determine the similarities in the data and to find
existing patterns. One more application of descriptive analysis is to develop the captivating
subgroups in the major part of the data available. This analytics emphasis on the summarization
and transformation of the data into meaningful information for reporting and monitoring.
Predictive Data Mining: The main goal of this mining is to say something about future results
not of current behavior. It uses the supervised learning functions which are used to predict the
target value. The methods come under this type of mining category are called classification, time-
series analysis and regression. Modelling of data is the necessity of the predictive analysis, and it
works by utilizing a few variables of the present to predict the future not known data values for
other variables.
Data discrimination is a comparison of the general features of the target class data objects against
the general features of objects from one or multiple contrasting classes. The target and contrasting
classes can be specified by a user, and the corresponding data objects can be retrieved through
database queries. For example, a user may want to compare the general features of software
products with sales that increased by 10% last year against those with sales that decreased by at
least 30% during the same period. The methods used for data discrimination is similar to those
used for data characterization.
Frequent patterns, as the name suggests, are patterns that occur frequently in data. There are
many kinds of frequent patterns, including frequent itemsets, frequent subsequences (also known
as sequential patterns), and frequent substructures. A frequent itemset typically refers to a set of
items that often appear together in a transactional data set—for example, milk and bread, which
are frequently bought together in grocery stores by many customers. A frequently occurring
subsequence, such as the pattern that customers, tend to purchase first a laptop, followed by a
digital camera, and then a memory card, is a (frequent) sequential pattern. A substructure can refer
to different structural forms (e.g., graphs, trees, or lattices) that may be combined with itemsets or
subsequences. If a substructure occurs frequently, it is called a (frequent) structured pattern.
Mining frequent patterns leads to the discovery of interesting associations and correlations within
data. Frequent itemset mining is a fundamental form of frequent pattern mining.
Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts. The model is derived based on the analysis of a set of training data (i.e., data
objects for which the class labels are known). The model is used to predict the class label of objects
for which the class label is unknown. The derived model can be represented using decision tree,
neural network, etc.
The regression models continuous-valued functions. That is, regression is used to predict missing
or unavailable numerical data values rather than (discrete) class labels. The term prediction refers
to both numeric prediction and class label prediction. Regression analysis is a statistical
methodology that is most often used for numeric prediction, although other methods exist as well.
Regression also encompasses the identification of distribution trends based on the available data.
Clustering analyzes data objects without consulting class labels. In many cases, class-labeled data
may simply not exist at the beginning. Clustering can be used to generate class labels for a group
of data. The objects are clustered or grouped based on the principle of maximizing the intraclass
similarity and minimizing the interclass similarity. Each cluster so formed can be viewed as a class
of objects, from which rules can be derived.
A data set may contain objects that do not comply with the general behavior or model of the data.
These data objects are outliers. Many data mining methods discard outliers as noise or exceptions.
However, in some applications (e.g., fraud detection) the rare events can be more interesting than
the more regularly occurring ones. The analysis of outlier data is referred to as outlier analysis or
anomaly mining. Outliers may be detected using statistical tests that assume a distribution or
probability model for the data, or using distance measures where objects that are remote from any
other cluster are considered outliers. Rather than using statistical or distance measures, density-
based methods may identify outliers in a local region, although they look normal from a global
statistical distribution view.
Several objective measures of pattern interestingness exist. These are based on the structure of
discovered patterns and the statistics underlying them.
Support: Support of a rule is a measure of how frequently the items involved in it occur together.
Using probability notation: support (A B) = P(AB). A 1% support means that 1% of all the
transactions under analysis show that computer and software are purchased together.
Other objective interestingness measures include accuracy and coverage for classification (IF-
THEN) rules. In general terms, accuracy tells us the percentage of data that are correctly
classified by a rule. Coverage is similar to support, in that it tells us the percentage of data to
which a rule applies.
Subjective interestingness measures are based on user beliefs in the data. These measures find
patterns interesting if the patterns are unexpected (contradicting a user's belief) or offer strategic
information on which the user can act. In the latter case, such patterns are referred to as actionable.
For example, patterns like “a large earthquake often follows a cluster of small quakes” may be
highly actionable if users can act on the information to save lives. Patterns that are expected can
be interesting if they confirm a hypothesis that the user wishes to validate or they resemble a user's
hunch.
Web search engines are essentially very large data mining applications. Various data mining
techniques are used in all aspects of search engines, ranging from crawling (e.g., deciding which
pages should be crawled and the crawling frequencies), indexing (e.g., selecting pages to be
indexed and deciding to which extent the index should be constructed), and searching (e.g.,
deciding how pages should be ranked, which advertisements should be added, and how the search
results can be personalized or made “context aware”). Search engines often need to use computer
clouds, which consist of thousands or even hundreds of thousands of computers that
collaboratively mine the huge amount of data.
Web search engines often have to deal with online data, fast-growing data streams and queries that
are asked only a very small number of times.
2. User Interaction
a. Interactive mining
b. Incorporation of background knowledge
c. Ad hoc data mining and data mining query languages
d. Presentation and visualization of data mining results