Unit 1 Datamining For Business Intelligence

Download as pdf or txt
Download as pdf or txt
You are on page 1of 101

BA4027 DATAMINING FOR BUSINESS

INTELLIGENCE
UNIT I INTRODUCTION

Data mining, Text mining, Web mining, Spatial mining,


Process mining, Data ware house and datamarts.
Data mining
Data mining is the process of extracting and discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database systems.

Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal of extracting
information (with intelligent methods) from a data set and transforming the information into a comprehensible
structure for further use. ‘

Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw
analysis step, it also involves database and data management aspects, data pre-processing, model and inference
considerations, interestingness metrics, complexity considerations, post-processing of discovered structures,
visualization, and online updating.

The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is collected
and assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting
and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of future
events. Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data mining involves six common classes of tasks:
● Anomaly detection (outlier/change/deviation detection) – The identification of unusual data
records, that might be interesting or data errors that require further investigation.
● Association rule learning (dependency modeling) – Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits.
Using association rule learning, the supermarket can determine which products are
frequently bought together and use this information for marketing purposes. This is
sometimes referred to as market basket analysis.
● Clustering – is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
● Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
● Regression – attempts to find a function that models the data with the least error that is, for
estimating the relationships among data or datasets.
● Summarization – providing a more compact representation of the data set, including
visualization and report generation.
Data Mining Architecture
The significant components of data mining systems are a data source, data mining engine, data warehouse server, the
pattern evaluation module, graphical user interface, and knowledge base.
Data Source:

The actual source of data is the Database, data warehouse, World Wide Web (WWW),
text files, and other documents.

You need a huge amount of historical data for data mining to be successful. Organizations
typically store data in databases or data warehouses.

Data warehouses may comprise one or more databases, text files spreadsheets, or
other repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.
different processes:
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected.
As the information comes from various sources and in different formats, it can't be
used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified.
More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server.
These procedures are not as easy as we think. Several methods may be performed on
the data as part of selection, integration, and cleaning.
Database or Data Warehouse Server:
The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on
data mining as per user request.

Data Mining Engine:

The data mining engine is a major component of any data mining system. It contains
several modules for operating data mining tasks, including association,
characterization, classification, clustering, prediction, time-series analysis, etc.
In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.
Pattern Evaluation Module:
The Pattern evaluation module is primarily responsible for the measure of investigation of
the pattern by using a threshold value. It collaborates with the data mining engine to focus
the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake threshold
to filter out discovered patterns.
On the other hand, the pattern evaluation module might be coordinated with the mining
module, depending on the implementation of the data mining techniques used. For efficient
data mining, it is abnormally suggested to push the evaluation of pattern stake as much as
possible into the mining procedure to confine the search to only fascinating patterns.

.
Graphical User Interface:

The graphical user interface (GUI) module communicates between the data
mining system and the user.

This module helps the user to easily and efficiently use the system without
knowing the complexity of the process. This module cooperates with the data
mining system when the user specifies a query or a task and displays the results
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It
might be helpful to guide the search or evaluate the stake of the result
patterns.
The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process.
The data mining engine may receive inputs from the knowledge base to
make the result more accurate and reliable.
The pattern assessment module regularly interacts with the knowledge
base to get inputs, and also update it.
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level
applications of specific Data Mining techniques. It is a field of interest to researchers
in various fields, including artificial intelligence, machine learning, pattern recognition,
databases, statistics, knowledge acquisition for expert systems, and data visualization.
The main objective of the KDD process is to extract information from data in the
context of large databases. It does this by using Data Mining algorithms to identify
what is deemed knowledge.
Advantages of Data Mining

● The Data Mining technique enables organizations to obtain knowledge-based data.


● Data mining enables organizations to make lucrative modifications in operation and
production.
● Compared with other statistical data applications, data mining is a cost-efficient.
● Data Mining helps the decision-making process of an organization.
● It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
● It can be induced in the new system as well as the existing platforms.
● It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.
Disadvantages of Data Mining
● There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card purchases of
their customers to other organizations.
● Many data mining analytics software is difficult to operate and needs advance training to work on.
● Different data mining instruments operate in distinct ways due to the different algorithms used in
their design. Therefore, the selection of the right data mining tools is a very challenging task.
● The data mining techniques are not precise, so that it may lead to severe consequences in certain
conditions.
Challenges of Implementation in Data mining

Incomplete and noisy data:


The process of extracting useful data from large volumes of data is data mining.
The data in the real-world is heterogeneous, incomplete, and noisy. Data in huge
quantities will usually be inaccurate or unreliable. These problems may occur due
to data measuring instrument or because of human errors.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data
repository mainly due to organizational and technical concerns.
Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including


audio and video, images, complex data, spatial data, time series, and so on.
Managing these various types of data and extracting useful information is a tough
task. Most of the time, new technologies, new tools, and methodologies would
have to be refined to obtain specific information.

Performance:

The data mining system's performance relies primarily on the efficiency of


algorithms and techniques used. If the designed algorithm and techniques are
not up to the mark, then the efficiency of the data mining process will be affected
adversely.
Data Privacy and Security:
Data mining usually leads to serious issues in terms of data security, governance,
and privacy. For example, if a retailer analyzes the details of the purchased items,
then it reveals data about buying habits and preferences of the customers
without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the
primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express. But
many times, representing the information to the end-user in a precise and easy
way is difficult.
Data Mining Applications
Data Mining in Healthcare:

Data mining in healthcare has excellent potential to improve the health system. It uses data
and analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
Multi-dimensional database, Data visualization, Soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure that the patients
get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis:

Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This technique
may enable the retailer to understand the purchase behavior of a buyer. This data may
assist the retailer in understanding the requirements of the buyer and altering the store's
layout accordingly. Using a different analytical comparison of results between various
stores, between customers in different demographic groups can be done.
Data mining in Education:

Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives
are recognized as affirming student's future learning behavior, studying the impact of
educational support, and promoting learning science. An organization can use data mining
to make precise decisions and also to predict the results of the student. With the results,
the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

Knowledge is the best asset possessed by a manufacturing company. Data mining tools can
be beneficial to find patterns in a complex manufacturing process. Data mining can be used
in system-level designing to obtain the relationships between product architecture, product
portfolio, and data needs of the customers. It can also be used to forecast the product
development period, cost, and expectations among the other tasks.
Data Mining in CRM (Customer Relationship Management):

Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get a
decent relationship with the customer, a business organization needs to collect data and
analyze the data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection:

Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are
a little bit time consuming and sophisticated. Data mining provides meaningful patterns and
turning data into information. An ideal fraud detection system should protect the data of all
the users. Supervised methods consist of a collection of sample records, and these records
are classified as fraudulent or non-fraudulent. A model is constructed using this data, and
the technique is made to identify whether the document is fraudulent or not.
Data Mining in Lie Detection:

Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging
task. Law enforcement may use data mining techniques to investigate offenses, monitor
suspected terrorist communications, etc. This technique includes text mining also, and it seeks
meaningful patterns in data, which is usually unstructured text. The information collected from the
previous investigations is compared, and a model for lie detection is constructed.

Data Mining Financial Banking:

The Digitalization of the banking system is supposed to generate an enormous amount of data
with every new transaction. The data mining technique can help bankers by solving business-
related problems in banking and finance by identifying trends, casualties, and correlations in
business information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by experts. The
manager may find these data for better targeting, acquiring, retaining, segmenting, and maintain
a profitable customer.
Types of Data Mining

Each of the following data mining techniques serves several different business
problems and provides a different insight into each of them. However,
understanding the type of business problem you need to solve will also help in
knowing which technique will be best to use, which will yield the best results. The
Data Mining types can be divided into two basic parts that are as follows:

1. Predictive Data Mining Analysis


2. Descriptive Data Mining Analysis
1. Predictive Data Mining
As the name signifies, Predictive Data-Mining analysis works on the data
that may help to know what may happen later (or in the future) in
business. Predictive Data-Mining can also be further divided into four
types that are listed below:

● Classification Analysis
● Regression Analysis
● Time Series Analysis
● Prediction Analysis
2. Descriptive Data Mining

The main goal of the Descriptive Data Mining tasks is to summarize or turn
given data into relevant information. The Descriptive Data-Mining Tasks can
also be further divided into four types that are as follows:

● Clustering Analysis
● Summarization Analysis
● Association Rules Analysis
● Sequence Discovery Analysis
1. CLASSIFICATION ANALYSIS

This type of data mining technique is generally used in fetching or retrieving important
and relevant information about the data & metadata. It is also even used to categories
the different types of data format into different classesAs clustering also categorizes
or classify the data segments into the different data records known as the classes.
However, unlike clustering, the data analyst would have the knowledge of different
classes or clusters. Therefore in the classification analysis,

This technique is usually very helpful for retailers who can use it to study the buying
habits of their different customers. Retailers can also study the past sales data and
then lookout (or search ) for products that customers usually buy together. After which,
they can put those products nearby of each other in their retail stores to help
customers save their time and as well as to increase their sales.
2. REGRESSION ANALYSIS
In statistical terms, regression analysis is a process usually used to identify and
analyze the relationship among variables. It means one variable is dependent on
another, but it is not vice versa. It is generally used for prediction and forecasting
purposes. It can also help you understand the characteristic value of the
dependent variable changes if any of the independent variables is varied.
3. Time Series Analysis
A time series is a sequence of data points that are usually recorded at specific
time intervals of points. Usually, they are - most often in regular time intervals
(seconds, hours, days, months etc.). Almost every organization generates a high
volume of data every day, such as sales figures, revenue, traffic, or operating cost.
Time series data mining can help in generating valuable information for long-term
business decisions, yet they are underutilized in most organizations.
4. Prediction Analysis
This technique is generally used to predict the relationship that exists between
both the independent and dependent variables as well as the independent
variables alone. It can also use to predict profit that can be achieved in future
depending on the sale. Let us imagine that profit and sale are dependent and
independent variables, respectively. Now, on the basis of what the past sales data
says, we can make a profit prediction of the future using a regression curve.
5. Clustering Analysis

In Data Mining, this technique is used to create meaningful object clusters that contain
the same characteristics. Usually, most people get confused with Classification, but
they won't have any issues if they properly understand how both these techniques
actually work. Unlike Classification that collects the objects into predefined classes,
clustering stores objects in classes that are defined by it. To understand it in more
detail, you can consider the following given example:
Example

Suppose you are in a library that is full of books on different topics. Now the real challenge for you is to organize
those books so that readers don't face any problem finding out books on any particular topic. So here, we can use
clustering to keep books with similarities in one particular shelf and then give those shelves a meaningful name or
class. Therefore, whenever a reader looking for books on a particular topic can go straight to that shelf. Hence he
won't be required to roam the entire library to find the book he wants to read.
6. SUMMARIZATION ANALYSIS
The Summarization analysis is used to store a group (or a set ) of data in a more
compact way and an easier-to-understand form. We can easily understand it with
the help of an example:

You might have used Summarization to create graphs or calculate averages from
a given set (or group) of data. This is one of the most familiar and accessible
forms of data mining.
7. ASSOCIATION RULE LEARNING

In general, it can be considered a method that can help us identify some interesting
relations (dependency modeling) between different variables in large databases. This
technique can also help us to unpack some hidden patterns in the data, which can be
used to identify the variables within the data. It also helps in detecting the concurrence
of different variables that appear very frequently in the dataset.
Association rules are generally used for examining and forecasting the behavior of the
customer. It is also highly recommended in the retail industry analysis. This technique
is also used to determine shopping basket data analysis, catalogue design, product
clustering, and store layout. In IT, programmers also uses the association rules to
create programs capable of machine learning. Or in short, we can say that this data
mining technique helps to find the association between two or more Items. It discovers
a hidden pattern in the data set.
8. Sequence Discovery Analysis
The primary goal of sequence discovery analysis is to discover interesting
patterns in data on the basis of some subjective or objective measure of how
interesting it is. Usually, this task involves discovering frequent sequential
patterns with respect to a frequency support measure.
Some people may often confuse it with time series as both the Sequence
discovery analysis and Time series analysis contains the adjacent observation
that are order dependent. However, if the people see both of them in a little more
depth, their confusion can be easily avoided as the Time series analysis technique
contains numerical data, whereas the Sequence discovery analysis contains
discrete values or data.
Text mining

Text mining is the process of exploring and analyzing large amounts of unstructured
text data aided by software that can identify concepts, patterns, topics, keywords and
other attributes in the data.

It's also known as text analytics, although some people draw a distinction between the two
terms; in that view, text analytics refers to the application that uses text mining techniques to
sort through data sets.

Text mining has become more practical for data scientists and other users due to the
development of big data platforms and deep learning algorithms that can analyze massive
sets of unstructured data.
How text mining works
Text mining is similar in nature to data mining, but with a focus on text instead of more structured forms of

data. However, one of the first steps in the text mining process is to organize and structure the data in some

fashion so it can be subjected to both qualitative and quantitative analysis.

Doing so typically involves the use of natural language processing (NLP) technology, which applies
computational linguistics principles to parse and interpret data sets.

The upfront work includes categorizing, clustering and tagging text; summarizing data sets; creating

taxonomies; and extracting information about things like word frequencies and relationships between data
entities. Analytical models are then run to generate findings that can help drive business strategies and
operational actions.
In the past, NLP algorithms were primarily based on statistical or rules-based models that provided direction

on what to look for in data sets. In the mid-2010s, though, deep learning models that work in a less

supervised way emerged as an alternative approach for text analysis and other advanced analytics
applications involving large data sets. Deep learning uses neural networks to analyze data using an iterative
method that's more flexible and intuitive than what conventional machine learning supports.

As a result, text mining tools are now better equipped to uncover underlying similarities and associations in
text data, even if data scientists don't have a good understanding of what they're likely to find at the start of a

project. For example, an unsupervised model could organize data from text documents or emails into a group
of topics without any guidance from an analyst.
Applications of text mining
Sentiment analysis is a widely used text mining application that can track customer
sentiment about a company. Also known as opinion mining, sentiment analysis mines
text from online reviews, social networks, emails, call center interactions and
other data sources to identify common threads that point to positive or negative
feelings on the part of customers.

Other common text mining uses include screening job candidates based on the
wording in their resumes, blocking spam emails, classifying website content,
flagging insurance claims that may be fraudulent, analyzing descriptions of
medical symptoms to aid in diagnoses, and examining corporate documents as
part of electronic discovery processes.
Benefits of text mining

Using text mining and analytics to gain insight into customer sentiment can help companies detect product and
business problems and then address them before they become big issues that affect sales. Mining the text in
customer reviews and communications can also identify desired new features to help strengthen product
offerings. In each case, the technology provides an opportunity to improve the overall customer experience, which
will hopefully result in increased revenue and profits.

Text mining can also help predict customer churn, enabling companies to take action to head off potential
defections to business rivals as part of their marketing and customer relationship management programs.
Fraud detection, risk management, online advertising and web content management are other functions
that can benefit from the use of text mining tools.

In healthcare, the technology may be able to help diagnose illnesses and medical conditions in patients
based on the symptoms they report.
Text mining challenges and issues
Text mining can be challenging because the data is often vague, inconsistent and contradictory. Efforts to analyze it
are further complicated by ambiguities that result from differences in syntax and semantics, as well as the use of slang,

sarcasm, regional dialects and technical language specific to individual vertical industries. As a result, text mining

algorithms must be trained to parse such ambiguities and inconsistencies when they categorize, tag and summarize sets of

text data.

In addition, the deep learning models used in many text mining applications require large amounts of training data and
processing power, which can make them expensive to run. Inherent bias in data sets is another issue that can lead deep

learning tools to produce flawed results if data scientists don't recognize the biases during the model development process.

There's also a lot of text mining software to choose from. Dozens of commercial and open source technologies are

available, including tools from major software vendors, including IBM, Oracle, SAS, SAP and Tibco.
Web mining

Web mining has a distinctive property to provide a set of various data types. The
web has multiple aspects that yield different approaches for the mining process,
such as web pages consist of text, web pages are linked via hyperlinks, and user
activity can be monitored via web server logs. These three features lead to the
differentiation between the three areas are web content mining, web structure
mining, web usage mining.
types of data mining:
1. Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge
from the web page content. In web content mining, each web page is considered
as an individual document. The individual can take advantage of the semi-
structured nature of web pages, as HTML provides information that concerns not
only the layout but also logical structure. The primary task of content mining is
data extraction, where structured data is extracted from unstructured websites.
The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish
topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
2. Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is
used to identify that data either link the web pages or direct link network. In Web
Structure Mining, an individual considers the web as a directed graph, with the
web pages being the vertices that are associated with hyperlinks. The most
important application in this regard is the Google search engine, which estimates
the ranking of its outcomes primarily with the PageRank algorithm. It
characterizes a page to be exceptionally relevant when frequently connected by
other highly related pages. Structure and content mining methodologies are
usually combined. For example, web structured mining can be beneficial to
organizations to regulate the network between two commercial sites.
3. Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from
the weblog records, and assists in recognizing the user access patterns for web
pages. In Mining, the usage of web resources, the individual is thinking about
records of requests of visitors of a website, that are often collected as web server
logs. While the content and structure of the collection of web pages follow the
intentions of the authors of the pages, the individual requests demonstrate how
the consumers see these pages. Web usage mining may disclose relationships
that were not proposed by the creator of the pages.
Application of Web Mining:

Web mining has an extensive application because of various uses of the web. The
list of some applications of web mining is given below.

● Marketing and conversion tool


● Data analysis on website and application accomplishment.
● Audience behavior analysis
● Advertising and campaign accomplishment analysis.
● Testing and analysis of a site.
Challenges in Web Mining:

● The complexity of web pages:

The site pages don't have a unifying structure. They are extremely complicated as compared to
traditional text documents. There are enormous amounts of documents in the digital library of the
web. These libraries are not organized according to a specific order.

● The web is a dynamic data source:

The data on the internet is quickly updated. For example, news, climate, shopping, financial news,
sports, and so on.

● Diversity of client networks:

The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.
● Relevancy of data:

It is considered that a specific person is generally concerned about a small portion of the
web, while the rest of the segment of the web contains the data that is not familiar to the
user and may lead to unwanted results.

● The web is too broad:

The size of the web is tremendous and rapidly increasing. It appears that the web is too
huge for data warehousing and data mining.
Spatial data mining

Spatial data mining is the application of data mining to spatial


models. In spatial data mining, analysts use geographical or spatial
information to produce business intelligence or other results. This
requires specific techniques and resources to get the geographical data
into relevant and useful formats.
It is expected to have broad applications in geographic data systems,
marketing, remote sensing, image database exploration, medical
imaging, navigation, traffic control, environmental studies, and many
other areas where spatial data are used.
A spatial database saves a huge amount of space-related data, including maps, preprocessed
remote sensing or medical imaging records, and VLSI chip design data. Spatial databases have
several features that distinguish them from relational databases. They carry topological and/or
distance information, usually organized by sophisticated, multidimensional spatial indexing
structures that are accessed by spatial data access methods and often require spatial reasoning,
geometric computation, and spatial knowledge representation techniques.

Spatial data mining refers to the extraction of knowledge, spatial relationships, or other
interesting patterns not explicitly stored in spatial databases. Such mining demands the
unification of data mining with spatial database technologies. It can be used for learning spatial
records, discovering spatial relationships and relationships among spatial and nonspatial
records, constructing spatial knowledge bases, reorganizing spatial databases, and optimizing
spatial queries.
A central challenge to spatial data mining is the exploration of efficient spatial data mining
techniques because of the large amount of spatial data and the difficulty of spatial data
types and spatial access methods. Statistical spatial data analysis has been a popular
approach to analyzing spatial data and exploring geographic information.

The term geostatistics is often associated with continuous geographic space, whereas the
term spatial statistics is often associated with discrete space. In a statistical model that
manages non-spatial records, one generally considers statistical independence among
different areas of data.
There is no such separation among spatially distributed records because, actually spatial objects
are interrelated, or more exactly spatially co-located, in the sense that the closer the two objects
are placed, the more likely they send the same properties. For example, natural resources,
climate, temperature, and economic situations are likely to be similar in geographically closely
located regions.
Such a property of close interdependency across nearby space leads to the notion of spatial
autocorrelation. Based on this notion, spatial statistical modeling methods have been developed
with success. Spatial data mining will create spatial statistical analysis methods and extend
them for large amounts of spatial data, with more emphasis on effectiveness, scalability,
cooperation with database and data warehouse systems, enhanced user interaction, and the
discovery of new kinds of knowledge.
The following are examples of the kinds of data mining applications that could benefit from
including spatial information in their processing:

● Business prospecting: Determine if colocation of a business with another franchise


(such as colocation of a Pizza Hut restaurant with a Blockbuster video store) might
improve its sales.
● Store prospecting: Find a good store location that is within 50 miles of a major city
and inside a state with no sales tax. (Although 50 miles is probably too far to drive
to avoid a sales tax, many customers may live near the edge of the 50-mile radius
and thus be near the state with no sales tax.)
● Hospital prospecting: Identify the best locations for opening new hospitals based on
the population of patients who live in each neighborhood.
● Spatial region-based classification or personalization: Determine if south
eastern United States customers in a certain age or income category are more
likely to prefer "soft" or "hard" rock music.
● Automobile insurance: Given a customer's home or work location, determine if
it is in an area with high or low rates of accident claims or auto thefts.
● Property analysis: Use colocation rules to find hidden associations between
proximity to a highway and either the price of a house or the sales volume of a
store.
● Property assessment: In assessing the value of a house, examine the values
of similar houses in a neighborhood, and derive an estimate based on
variations and spatial correlation.
Process mining
Process mining is a family of techniques relating the fields of data science and
process management to support the analysis of operational processes based on
event logs. The goal of process mining is to turn event data into insights and actions.
Process mining is an integral part of data science, fueled by the availability of event
data and the desire to improve processes.

Process mining techniques use event data to show what people, machines, and
organizations are really doing. Process mining provides novel insights that can be used to
identify the executional path taken by operational processes and address their performance
and compliance problems.

● Easily collaborate to improve business process mapping with cloud-based software IBM
Blueworks Live.

IBM Cloud Pak for Business Automation is a flexible set of integrated software that helps you
design, build and run intelligent automation services and applications on any cloud, using
There are three categories of process mining techniques.
● Process Discovery: The first step in process mining. The main goal of process discovery is to
transform the event log into a process model. An event log can come from any data storage
system that records the activities in an organisation along with the timestamps for those
activities. Such an event log is required to contain a case id (a unique identifier to recognise the
case to which activity belongs), activity description (a textual description of the activity
executed), and timestamp of the activity execution. The result of process discovery is generally
a process model which is representative of the event log. Such a process model can be
discovered, for example, using techniques such as alpha algorithm (a didactically driven
approach), heuristic miner, or inductive miner.[13] Many established techniques exist for
automatically constructing process models (for example, Petri nets, BPMN diagrams, activity
diagrams, State diagrams, and EPCs) based on an event log.[13][14][15][16][17] Recently, process
mining research has started targeting other perspectives (e.g., data, resources, time, etc.). One
example is the technique described in (Aalst, Reijers, & Song, 2005),[18] which can be used to
construct a social network. Now a days, techniques such as "streaming process mining" are
being developed to work with continuous online data that has to be processed on the spot.
● Conformance checking: Helps in comparing an event log with an existing process model
to analyse the discrepancies between them. Such a process model can be constructed
manually or with the help of a discovery algorithm. For example, a process model may
indicate that purchase orders of more than 1 million euros require two checks. Another
example is the checking of the so-called "four-eyes" principle. Conformance checking may
be used to detect deviations (compliance checking), or evaluate the discovery algorithms,
or enrich an existing process model. An example is the extension of a process model with
performance data, i.e., some a priori process model is used to project the potential
bottlenecks. Another example is the decision miner described in (Rozinat & Aalst,
2006b),[19] which takes an a priori process model and analyses every choice in the process
model. The event log is consulted for each option to see which information is typically
available the moment the choice is made. Conformance checking has various techniques
such as "token-based replay", "streaming conformance checking" that are used depending
on the system needs.Then classical data mining techniques are used to see which data
elements influence the choice. As a result, a decision tree is generated for each choice in
the process.
● Performance Analysis: Used when there is an a priori model. The model is
extended with additional performance information such as processing times,
cycle times, waiting times, costs, etc., so that the goal is not to check
conformance, but rather to improve the performance of the existing model
with respect to certain process performance measures. An example is the
extension of a process model with performance data, i.e., some prior process
model dynamically annotated with performance data. It is also possible to
extend process models with additional information such as decision rules and
organisational information (e.g., roles).
What is Data Warehousing?

Data warehousing is the process of constructing and using a data warehouse. A data
warehouse is constructed by integrating data from multiple heterogeneous sources that
support analytical reporting, structured and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data consolidations.
Functions of Data Warehouse Tools and Utilities

The following are the functions of data warehouse tools and utilities −

● Data Extraction − Involves gathering data from multiple heterogeneous sources.


● Data Cleaning − Involves finding and correcting the errors in data.
● Data Transformation − Involves converting the data from legacy format to
warehouse format.
● Data Loading − Involves sorting, summarizing, consolidating, checking integrity,
and building indices and partitions.
● Refreshing − Involves updating from data sources to warehouse.
Integrating Heterogeneous Databases
To integrate heterogeneous databases, we have two approaches −

● Query-driven Approach
● Update-driven Approach

Query-Driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple
heterogeneous databases. These integrators are also known as mediators.

Process of Query-Driven Approach


● When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites
involved.
● Now these queries are mapped and sent to the local query processor.
● The results from heterogeneous sites are integrated into a global answer set.
Disadvantages
● Query-driven approach needs complex integration and filtering processes.
● This approach is very inefficient.
● It is very expensive for frequent queries.
● This approach is also very expensive for queries that require aggregations.

Update-Driven Approach
This is an alternative to the traditional approach. Today's data warehouse systems follow update-driven approach rather than the traditional
approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are
stored in a warehouse. This information is available for direct querying and analysis.

Advantages

This approach has the following advantages −

● This approach provide high performance.


● The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.
● Query processing does not require an interface to process data at local sources.
Dimensional approach[edit]
In a dimensional approach, transaction data is partitioned into "facts", which are generally numeric transaction data, and "dimensions", which are the
reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as t he number of products
ordered and the total price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to
locations, and salesperson responsible for receiving the order.
A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from
the data warehouse tends to operate very quickly.[16] Dimensional structures are easy to understand for business users, because the structure is
divided into measurements/facts and context/dimensions. Facts are related to the organization's business processes and operat ional system
whereas the dimensions surrounding them contain context about the measurement (Kimball, Ralph 2008). Another advantage offered by dimensional
model is that it does not involve a relational database every time. Thus, this type of modeling technique is very useful for end-user queries in data
warehouse.
The model of facts and dimensions can also be understood as a data cube.[18] Where the dimensions are the categorical coordinates in a multi-
dimensional cube, the fact is a value corresponding to the coordinates.
The main disadvantages of the dimensional approach are the following:
1. To maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is
complicated.
2. It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which
it does business.
normalized approach[edit]
In the normalized approach, the data in the data warehouse are stored following, to a degree, database normalization rules. Tables are
grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.). The normalized
structure divides data into entities, which creates several tables in a relational database. When applied in large enterprises the result is
dozens of tables that are linked together by a web of joins. Furthermore, each of the created entities is converted into sepa rate physical
tables when the database is implemented (Kimball, Ralph 2008). The main advantage of this approach is that it is straightforw ard to add
information into the database. Some disadvantages of this approach are that, because of the number of tables involved, it can be difficult for
users to join data from different sources into meaningful information and to access the information without a precise underst anding of the
sources of data and of the data structure of the data warehouse.
Both normalized and dimensional models can be represented in entity-relationship diagrams as both contain joined relational tables. The
difference between the two models is the degree of normalization (also known as Normal Forms). These approaches are not mutually
exclusive, and there are other approaches. Dimensional approaches can involve normalizing data to a degree (Kimball, Ralph 20 08).
In Information-Driven Business,[19] Robert Hillard proposes an approach to comparing the two approaches based on the information needs
of the business problem. The technique shows that normalized models hold far more information than their dimensional equivale nts (even
when the same fields are used in both models) but this extra information comes at the cost of usability. The technique measur es information
quantity in terms of information entropy and usability in terms of the Small Worlds data transformation measure. [20]
Dimensional versus normalized approach for storage of data [edit]
There are three or more leading approaches to storing data in a data warehouse – the most important approaches are the dimensional
approach and the normalized approach.
The dimensional approach refers to Ralph Kimball's approach in which it is stated that the data warehouse should be modeled using a
Dimensional Model/star schema. The normalized approach, also called the 3NF model (Third Normal Form), refers to Bill Inmon's approach
in which it is stated that the data warehouse should be modeled using an E-R model/normalized model. [17
Data warehouse characteristics
Subject-oriented
Unlike the operational systems, the data in the data warehouse revolves around
the subjects of the enterprise. Subject orientation is not database normalization.
Subject orientation can be really useful for decision-making. Gathering the
required objects is called subject-oriented.

Integrated
The data found within the data warehouse is integrated. Since it comes from
several operational systems, all inconsistencies must be removed. Consistencies
include naming conventions, measurement of variables, encoding structures,
physical attributes of data, and so forth.
Time-variant
While operational systems reflect current values as they support day-to-day
operations, data warehouse data represents a long time horizon (up to 10 years)
which means it stores mostly historical data. It is mainly meant for data mining and
forecasting. (E.g. if a user is searching for a buying pattern of a specific customer,
the user needs to look at data on the current and past purchases.)[23]
Nonvolatile
The data in the data warehouse is read-only, which means it cannot be updated,
created, or deleted (unless there is a regulatory or statutory obligation to do so).
History[edit]
The concept of data warehousing dates back to the late 1980s[10] when IBM researchers Barry Devlin and Paul Murphy developed the
"business data warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data
from operational systems to decision support environments. The concept attempted to address the various problems associated with this
flow, mainly the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy was
required to support multiple decision support environments. In larger corporations, it was typical for multiple decision support environments to
operate independently. Though each environment served different users, they often required much of the same stored data. The process of
gathering, cleaning and integrating data from various sources, usually from long-term existing operational systems (usually referred to as
legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as
new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from "data
marts" that was tailored for ready access by users.
Additionally, with the publication of The IRM Imperative (Wiley & Sons, 1991) by James M. Kerr, the idea of managing and putting a dollar
value on an organization's data resources and then reporting that value as an asset on a balance sheet became popular. In the book, Kerr
described a way to populate subject-area databases from data derived from transaction-driven systems to create a storage area where
summary data could be further leveraged to inform executive decision-making. This concept served to promote further thinking of how a data
warehouse could be developed and managed in a practical way within any enterprise.
Key developments in early years of data warehousing:
● 1960s – General Mills and Dartmouth College, in a joint research project, develop the terms dimensions and facts.[11]
● 1970s – ACNielsen and IRI provide dimensional data marts for retail sales. [11]
● 1970s – Bill Inmon begins to define and discuss the term Data Warehouse.[citation needed][12]
● 1975 – Sperry Univac introduces MAPPER (MAintain, Prepare, and Produce Executive Reports), a database management and
reporting system that includes the world's first 4GL. It is the first platform designed for building Information Centers (a forerunner
of contemporary data warehouse technology).
● 1983 – Teradata introduces the DBC/1012 database computer specifically designed for decision support. [13]
● 1984 – Metaphor Computer Systems, founded by David Liddle and Don Massaro, releases a hardware/software package and
GUI for business users to create a database management and analytic system.
● 1988 – Barry Devlin and Paul Murphy publish the article "An architecture for a business and information system" where they
introduce the term "business data warehouse".[14]
● 1990 – Red Brick Systems, founded by Ralph Kimball, introduces Red Brick Warehouse, a database management system
specifically for data warehousing.
● 1991 - James M. Kerr authors The IRM Imperative, which suggests data resources could be reported as an asset on a balance
sheet, furthering commercial interest in the establishment of data warehouses.
● 1991 – Prism Solutions, founded by Bill Inmon, introduces Prism Warehouse Manager, software for developing a data
warehouse.
● 1992 – Bill Inmon publishes the book Building the Data Warehouse.[15]
● 1995 – The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
● 1996 – Ralph Kimball publishes the book The Data Warehouse Toolkit.[16]
● 2000 – Dan Linstedt releases in the public domain the Data vault modeling, conceived in 1990 as an alternative to Inmon and
Kimball to provide long-term historical storage of data coming in from multiple operational systems, with emphasis on tracing,
auditing and resilience to change of the source data model.
● 2008 – Bill Inmon, along with Derek Strauss and Genia Neushloss, publishes "DW 2.0: The Architecture for the Next Generation
of Data Warehousing", explaining his top-down approach to data warehousing and coining the term, data-warehousing 2.0.
● 2012 – Bill Inmon develops and makes public technology known as "textual disambiguation". Textual disambiguation applies
context to raw text and reformats the raw text and context into a standard data base format. Once raw text is passed through
textual disambiguation, it can easily and efficiently be accessed and analyzed by standard business intelligence technology.
Textual disambiguation is accomplished through the execution of textual ETL. Textual disambiguation is useful wherever raw text
is found, such as in documents, Hadoop, email, and so forth.
Benefits
A data warehouse maintains a copy of information from the source transaction
systems. This architectural complexity provides the opportunity to:
● Integrate data from multiple sources into a single database and data model.
More congregation of data to single database so a single query engine can be
used to present data in an ODS.
● Mitigate the problem of database isolation level lock contention in transaction
processing systems caused by attempts to run large, long-running analysis
queries in transaction processing databases.
● Maintain data history, even if the source transaction systems do not.
● Integrate data from multiple source systems, enabling a central view across the
enterprise. This benefit is always valuable, but particularly so when the
organization has grown by merger.
● Improve data quality, by providing consistent codes and descriptions, flagging
or even fixing bad data.
● Present the organization's information consistently.
● Provide a single common data model for all data of interest regardless of
the data's source.
● Restructure the data so that it makes sense to the business users.
● Restructure the data so that it delivers excellent query performance, even
for complex analytic queries, without impacting the operational systems.
● Add value to operational business applications, notably customer
relationship management (CRM) systems.
● Make decision–support queries easier to write.
● Organize and disambiguate repetitive data.
Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build and maintain in data
warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Data Mart vs. Data Warehouse

A data mart is a subset of a data warehouse oriented to a specific business line.


Data marts contain repositories of summarized data collected for analysis on a
specific section or unit within an organization, for example, the sales department.
A data warehouse is a large centralized repository of data that contains
information from many sources within an organization. The collated data is used to
guide business decisions through analysis, reporting, and data mining tools.
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or primary data subject
which may be distributed to provide business needs. Data Marts are analytical record stores designed to focus on
particular business functions for a specific community within an organization. Data marts are derived from subsets of data
in a data warehouse, though in the bottom-up data warehouse design methodology, the data warehouse is created from the
union of organizational data marts.
reasons for creating a data mart

● Creates collective data by a group of users


● Easy access to frequently needed data
● Ease of creation
● Improves end-user response time
● Lower cost than implementing a complete data warehouses
● Potential clients are more clearly defined than in a comprehensive data warehouse
● It contains only essential business data and is less cluttered.
Types of Data Marts
There are mainly two approaches to designing data marts. These approaches are

● Dependent Data Marts


● Independent Data Marts

Dependent Data Marts


A dependent data marts is a logical subset of a physical subset of a higher data warehouse. According to this technique,
the data marts are treated as the subsets of a data warehouse. In this technique, firstly a data warehouse is created from
which further various data marts can be created. These data mart are dependent on the data warehouse and extract the
essential record from it. In this technique, as the data warehouse creates the data mart; therefore, there is no need for data
mart integration. It is also known as a top-down approach.
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created, and then a data
warehouse is designed using these independent multiple data marts. In this approach, as all the data marts are designed
independently; therefore, the integration of data marts is required. It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."

Hybrid Data Marts


It allows us to combine input from sources other than a data warehouse. This could be helpful for many situations;
especially when Adhoc integrations are needed, such as after a new group or product is added to the organizations.
Steps in Implementing a Data Mart
Designing

The design step is the first in the data mart process. This phase covers all of the functions
from initiating the request for a data mart through gathering data about the requirements
and developing the logical and physical design of the data mart.
It involves the following tasks:

1. Gathering the business and technical requirements


2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.
Constructing

This step contains creating the physical database and logical structures associated with the data
mart to provide fast and efficient access to the data.

It involves the following tasks:

1. Creating the physical database and logical structures such as tablespaces associated with
the data mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the right format
and level of detail, and moving it into the data mart.

It involves the following tasks:

1. Mapping data sources to target data sources


2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts
and graphs and publishing them.
It involves the following tasks:

1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates
database operations and objects names into business conditions so that the end-clients can
interact with the data mart using words which relates to the business functions.
2. Set up and manage database architectures like summarized tables which help queries agree
through the front-end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed as:

1. Providing secure access to the data.


2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
Data Mart and Data Warehouse Comparison

Data Mart

● Focus: A single subject or functional organization area


● Data Sources: Relatively few sources linked to one line of business
● Size: Less than 100 GB
● Normalization: No preference between a normalized and denormalized structure
● Decision Types: Tactical decisions pertaining to particular business lines and ways of doing things
● Cost: Typically from $10,000 upwards
● Setup Time: 3-6 months
● Data Held: Typically summarized data
Data Warehouse

● Focus: Enterprise-wide repository of disparate data sources


● Data Sources: Many external and internal sources from different areas of an organization
● Size: 100 GB minimum but often in the range of terabytes for large organizations
● Normalization: Modern warehouses are mostly denormalized for quicker data querying and read performance
● Decision Types: Strategic decisions that affect the entire enterprise
● Cost: Varies but often greater than $100,000; for cloud solutions costs can be dramatically lower as organizations
pay per use
● Setup Time: At least a year for on-premise warehouses; cloud data warehouses are much quicker to set up
● Data Held: Raw data, metadata, and summary data
Inmon vs. Kimball

Two data warehouse pioneers, Bill Inmon and Ralph Kimball differ in their views on how data warehouses should be
designed from the organization's perspective.

Bill Inmon's approach favours a top-down design in which the data warehouse is the centralized data repository and the
most important component of an organization's data systems.

The Inmon approach first builds the centralized corporate data model, and the data warehouse is seen as the physical
representation of this model. Dimensional data marts related to specific business lines can be created from the data
warehouse when they are needed.
In the Inmon model, data in the data warehouse is integrated, meaning the data warehouse is the source of the data that
ends up in the different data marts. This ensures data integrity and consistency across the organization.

Ralph Kimball's data warehouse design starts with the most important business processes. In this approach, an
organization creates data marts that aggregate relevant data around subject-specific areas. The data warehouse is the
combination of the organization’s individual data marts.

With the Kimball approach, the data warehouse is the conglomerate of a number of data marts. This is in contrast to
Inmon's approach, which creates data marts based on information in the warehouse. As Kimball said in 1997, “the data
warehouse is nothing more than the union of all data marts.”*

You might also like