Data Mining

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Data mining

Data mining is the process of sorting through large amounts of data and picking out relevant
information. It is usually used by business intelligence organizations, and financial analysts, but
is increasingly being used in the sciences to extract information from the enormous data sets
generated by modern experimental and observational methods. It has been described as "the
nontrivial extraction of implicit, previously unknown, and potentially useful information from
data"[1] and "the science of extracting useful information from large data sets or databases."[2]
Data mining in relation to enterprise resource planning is the statistical and logical analysis of
large sets of transaction data, looking for patterns that can aid decision making.[3]

Background
Traditionally, business analysts have performed the task of extracting useful information from
recorded data, but the increasing volume of data in modern business and science calls for
computer-based approaches. As data sets have grown in size and complexity, there has been a
shift away from direct hands-on data analysis toward indirect, automatic data analysis using
more complex and sophisticated tools. The modern technologies of computers, networks, and
sensors have made data collection and organization much easier. However, the captured data
needs to be converted into information and knowledge to become useful. Data mining is the
entire process of applying computer-based methodology, including new techniques for
knowledge discovery, to data.[4]

Data mining identifies trends within data that go beyond simple analysis. Through the use of
sophisticated algorithms, non-statistician users have the opportunity to identify key attributes of
business processes and target opportunities. However, abdicating control of this process from the
statistician to the machine may result in false-positives or no useful results at all.

Although data mining is a relatively new term, the technology is not. For many years, businesses
have used powerful computers to sift through volumes of data such as supermarket scanner data
to produce market research reports (although reporting is not considered to be data mining).
Continuous innovations in computer processing power, disk storage, and statistical software are
dramatically increasing the accuracy and usefulness of data analysis.

The term data mining is often used to apply to the two separate processes of knowledge
discovery and prediction. Knowledge discovery provides explicit information that has a readable
form and can be understood by a user. Forecasting, or predictive modeling provides predictions
of future events and may be transparent and readable in some approaches (e.g., rule-based
systems) and opaque in others such as neural networks. Moreover, some data-mining systems
such as neural networks are inherently geared towards prediction and pattern recognition, rather
than knowledge discovery.

Metadata, or data about a given data set, are often expressed in a condensed data-minable
format, or one that facilitates the practice of data mining. Common examples include executive
summaries and scientific abstracts.

[Type text] Page 1


Data mining relies on the use of real world data. This data is extremely vulnerable to collinearity
precisely because data from the real world may have unknown interrelations. An unavoidable
weakness of data mining is that the critical data that may expose any relationship might have
never been observed. Alternative approaches using an experiment-based approach such as
Choice Modelling for human-generated data may be used. Inherent correlations are either
controlled for or removed altogether through the construction of an experimental design.

Recently, there were some efforts to define a standard for data mining, for example the CRISP-
DM standard for analysis processes or the Java Data-Mining Standard. Independent of these
standardization efforts, freely available open-source software systems like RapidMiner and
Weka have become an informal standard for defining data-mining processes.

Privacy concerns
There are also privacy and human rights concerns associated with data mining, specifically
regarding the source of the data analyzed. Data mining provides information that may be difficult
to obtain otherwise. When the data collected involves individual people, there are many
questions concerning privacy, legality, and ethics. In particular, data mining government or
commercial data sets for national security or law enforcement purposes has raised privacy
concerns.

Notable uses of data mining


Combating Terrorism

Data mining has been cited as the method by which the U.S. Army unit Able Danger had
identified the September 11, 2001 attacks leader, Mohamed Atta, and three other 9/11 hijackers
as possible members of an Al Qaeda cell operating in the U.S. more than a year before the attack.
It has been suggested that both the Central Intelligence Agency and the Canadian Security
Intelligence Service have employed this method.

Previous data mining to stop terrorist programs under the US government include the Terrorism
Information Awareness (TIA) program, Computer-Assisted Passenger Prescreening System
(CAPPS II), Analysis, Dissemination, Visualization, Insight, and Semantic Enhancement
(ADVISE), Multistate Anti-Terrorism Information Exchange (MATRIX), and the Secure Flight
program Security-MSNBC. These programs have been discontinued due to controversy over
whether they violate the US Constitution's 4th amendment.

Games

Since the early 1960s, with the availability of oracles for certain combinatorial games, also called
tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-and-boxes,
small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a new area for data
mining has been opened up. This is the extraction of human-usable strategies from these oracles.
Current pattern recognition approaches do not seem to fully have the required high level of

[Type text] Page 2


abstraction in order to be applied successfully. Instead, extensive experimentation with the
tablebases, combined with an intensive study of tablebase-answers to well designed problems
and with knowledge of prior art, i.e. pre-tablebase knowledge, is used to yield insightful patterns.
Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable examples of
researchers doing this work, though they were not and are not involved in tablebase generation.

Business

Data mining in customer relationship management applications can contribute significantly to


the bottom line.[citation needed] Rather than contacting a prospect or customer through a call center or
sending mail, only prospects that are predicted to have a high likelihood of responding to an
offer are contacted. More sophisticated methods may be used to optimize across campaigns so
that we can predict which channel and which offer an individual is most likely to respond to -
across all potential offers. Finally, in cases where many people will take an action without an
offer, uplift modeling can be used to determine which people will have the greatest increase in
responding if given an offer. Data clustering can also be used to automatically discover the
segments or groups within a customer data set.

Businesses employing data mining quickly see a return on investment, but also they recognize
that the number of predictive models can quickly become very large. Rather than one model to
predict which customers will churn, a business could build a separate model for each region and
customer type. Then instead of sending an offer to all people that are likely to churn, it may only
want to send offers to customers that will likely take to offer. And finally, it may also want to
determine which customers are going to be profitable over a window of time and only send the
offers to those that are likely to be profitable. In order to maintain this quantity of models, they
need to manage model versions and move to automated data mining.

Data mining can also be helpful to human-resources departments in identifying the


characteristics of their most successful employees. Information obtained, such as universities
attended by highly successful employees, can help HR focus recruiting efforts accordingly.
Additionally, Strategic Enterprise Management applications help a company translate corporate-
level goals, such as profit and margin share targets, into operational decisions, such as
production plans and workforce levels.

Another example of data mining, often called the market basket analysis, relates to its use in
retail sales. If a clothing store records the purchases of customers, a data-mining system could
identify those customers who favour silk shirts over cotton ones. Although some explanations of
relationships may be difficult, taking advantage of it is easier. The example deals with
association rules within transaction-based data. Not all data are transaction based and logical or
inexact rules may also be present within a database. In a manufacturing application, an inexact
rule may state that 73% of products which have a specific defect or problem will develop a
secondary problem within the next six months.

Given below is a list of the top eight data-mining software vendors in 2008 published in a
Gartner study.

[Type text] Page 3


 Angoss Software
 Infor CRM Epiphany
 Portrait Software
 SAS
 SPSS
 ThinkAnalytics
 Unica
 Viscovery

[Type text] Page 4

You might also like