DM Material
DM Material
DM Material
Subject
DATA MINING AND WAREHOUSING Semester V
Title
Subject
21UCA10 Specialization NA
Code
Type Core: Theory L:T:P:C 71:5:0:4
Unit Contents Levels Sessions
Introduction: Data mining application – data mining techniques
– data mining case studies- the future of data mining – data
mining software - Association rules mining: basics- task and a
I naïve algorithm- Apriori algorithm – improve the efficient of K1 12
the Apriori algorithm – mining frequent pattern without
candidate generation (FP-growth) – performance evaluation of
algorithms.
Classification : Introduction – decision tree – over fitting and
pruning - DT rules- Naive bayes method- estimation
II predictive accuracy of classification methods - other K2 12
evaluation criteria for classification method – classification
software.
Cluster analysis: cluster analysis – types of data – computing
distances-types of cluster analysis methods- partitioned methods
III – hierarchical methods – density based methods – dealing with K3 13
large databases – quality and validity of cluster analysis methods -
cluster analysis software.
Learning Resources
Text
Books G.K. Gupta, ―Introduction to Data mining with case studies‖, 2nd Edition, PHI
51
DATA MINING
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD).
The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge
presentation.
Relational Database:
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights.
The huge amount of data comes from multiple places such as Marketing and Finance.
The extracted data is utilized for analytical purposes and helps in decision- making for
a business organization.
The data warehouse is designed for the analysis of data rather than transaction
processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many
IT professionals utilize the term more clearly to refer to a specific kind of setup within
an IT structure.
For example, a group of databases, where an organization has kept various kinds of
information.
Object-Relational Database:
Transactional Database:
o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training
to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools
is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences
in certain conditions.
These are the following areas where data mining is widely used:
Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs.
Analysts use data mining approaches such as Machine learning, Multi-dimensional
database, Data visualization, Soft computing, and statistics.
Data Mining can be used to forecast patients in each category. The procedures ensure
that the patients get intensive care at the right place and at the right time.
Data mining also enables healthcare insurers to recognize fraud and abuse.
Data Mining in Market Basket Analysis:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated.
Data mining provides meaningful patterns and turning data into information. An ideal
fraud detection system should protect the data of all the users.
Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent.
A model is constructed using this data, and the technique is made to identify whether
the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task.
Law enforcement may use data mining techniques to investigate offenses, monitor
suspected terrorist communications, etc.
This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text.
The information collected from the previous investigations is compared, and a model
for lie detection is constructed.
Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques,
etc.
The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
The process of extracting useful data from large volumes of data is data mining. The data in
the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually
be inaccurate or unreliable. These problems may occur due to data measuring instrument or
because of human errors. Suppose a retail chain collects phone numbers of customers who
spend more than $ 500, and the accounting employees put the information into their system.
The person may make a digit mistake when entering the phone number, which results in
incorrect data. Even some customers may not be willing to disclose their phone numbers,
which results in incomplete data. The data could get changed due to human or system error.
All these consequences (noisy and incomplete data)makes data mining challenging.
Data Distribution:
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it reveals
data about buying habits and preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.
The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data
Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting and
generating revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events.
Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used to
solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets.
These tools can incorporate statistical models, machine learning techniques, and
mathematical algorithms, such as neural networks or decision trees.
Thus, data mining incorporates analysis and prediction.
Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
1. Classification:
This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering a
few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining techniques
used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.
2. Clustering:
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable.
Regression, primarily a form of planning and modeling. For example, we might use it
to project certain costs, depending on other factors such as availability, consumer
demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given
data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior.
This technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier is a
data point that diverges too much from the rest of the dataset.
The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field.
Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless
sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in a set
of sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns
in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
Data Mining is the set of techniques that utilize specific algorithms, statical analysis,
artificial intelligence, and database systems to analyze data from different dimensions
and perspectives.
Data Mining tools have the objective of discovering patterns/trends/groupings among
large sets of data and transforming data into more refined information.
Orange is a perfect machine learning and data mining software suite. It supports the
visualization and is a software-based on components written in Python computing
language and developed at the bioinformatics laboratory at the faculty of computer
and information science, Ljubljana University, Slovenia.
As it is a software-based on components, the components of Orange are called
"widgets." These widgets range from preprocessing and data visualization to the
assessment of algorithms and predictive modeling.
SAS stands for Statistical Analysis System. It is a product of the SAS Institute created
for analytics and data management. SAS can mine data, change it, manage
information from various sources, and analyze statistics. It offers a graphical UI for
non-technical users.
SAS data miner allows users to analyze big data and provide accurate insight for
timely decision-making purposes. SAS has distributed memory processing
architecture that is highly scalable. It is suitable for data mining, optimization, and
text mining purposes.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system which
is compatible with JVM (Java Virtual Machine). It consists of Science and mathematics
libraries.
4. Rattle:
Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle
exposes the statical power of R by offering significant data mining features.
While rattle has a comprehensive and well-developed user interface, It has an integrated log
code tab that produces duplicate code for any GUI operation.
The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to
review the code, use it for many purposes, and extend the code without any restriction.
5. Rapid Miner:
Rapid Miner is one of the most popular predictive analysis systems created by the company
with the same name as the Rapid Miner.
The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine
learning.
1. Astronomy
Clustering techniques is used in this study. There are 20 million stars which are
classified by their spectra & their surface temperature.
C1-Star type-A
C3-Star type-C
C4-Star type-D
It assesses future loan plans for suitable customers by using their profiles.
3. Climate
It is a study that reports on atmospheric and oceaning parameters that cause drought in
a particular state.
It checks whether the rainfall is possible to a particular area by using the measures.
4. Crime Prevention
In this case study, it prepares data that include customers that responding to direct
mail & those that didn`t responding it.
It also characterizes the customers who are likely to respond mails /phone calls.
6. Healthcare
Identifying the patients where the drug is most applicable is effective by drug testing.
Detecting the small cluster that describes how many customers are tended to go for
doctor shopping.
7. Manufacturing
Hewlett Packard Company has been reported to have used data mining to improve the
quality of their scanners.
9. Telecommunication
It deals with cost of local calls, international rates, customers’ satisfaction, etc.
Customers who do not pay their bills on time are also identified.
It is one of the DM technique. It is used to find the items that are purchased together
frequently.
It is also defined as the discover of association rules that describe which 2 items are
purchased together .
It is of the form ,
X=>Y {Support=α, Confidence=β}
Other Measures
=p(X)
=p(XηY)
=p(Y/X)
Naïve Algorithm
Step 1 : List out all possible combination of itemsets from the transaction DB
level by level.
Step 2: At the first level , We get 1-itemset followed by 2-itemset , and so on.
Step 3 : Remove the combinations that do not satisfy the particular specified
support & confidence value.
Step 5:Generate association rules from the frequent sets obtained in step 4.
Example
ID Items
200 Bread,Cheese,Juice
400 Cheese,Juice,Milk
We have to find out frequent item sets as well as association rule discovery using naïve
algorithm:
ID Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
2nd Item set
ID Frequency
Bread, Cheese 2
Bread, Milk 1
Bread,juice 1
Cheese, Juice 2
Cheese, Milk 1
Juice, Milk 1
ID Frequency
ID Frequency
Candidate set
ITEMSET
ID Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
Bread, Cheese 2
Cheese, Juice 2
{ Bread, Cheese}
{Cheese, Juice}
• Bread-> Cheese
• Cheese-> Bread
• Cheese-> Juice
• Juice-> Cheese
• Bread-> Cheese
C=p(Bread-> Cheese)/p(Bread)
=2/3=67%
• cheese-> Bread
C=p(cheese-> Bread)/p(cheese )
=2/3=67%
• Cheese-> Juice
C=p(Cheese-> Juice)/p(Cheese)
=2/3=67%
• Juice-> Cheese
=2/2=100%
The precious table of combinations with non-zero frequently are given below:
ID Items combinations
{Bread,Juice}
{Chease,Juice}
{Bread,Chease,Juice}
{Chease, Milk}
{Juice,Milk}
{Chease,Juice,Milk}
ID Frequency
Bread 3
Cheese 3
Juice 2
Milk 2
Bread, Cheese 2
Bread, Juice 1
Bread, Milk 1
Cheese, Juice 2
Juice, Milk 1
Bread,Cheese,Juice 1
Cheese,Juice,Milk 1
Apriori Algorithm
Step 1: Scan all transactions & find all i-itemsets. Let it be Lk.
Step 2:Generate candidate set with minimum support value.Let it be Ck.
In computing C3 from L2, We organize L2.So that items are sorted in their
lexicographic order.
i.Pruning
The process of removing some itemsets by checking that all its subsets are frequent or
not.
To improve searching efficiency,the candidate sets are stored in a hash tree. Head
nodes store itemsets and all internal nodes provide a road-map to reach the leaves.
iii.Transaction Storage
ID Items
10 A,B
20 C,D
30 C,B
10 1 1 0 0
20 0 0 1 1
30 0 1 1 0
ID TID
10 20 30
A 1 0 0
B 1 0 1
C 0 1 1
D 0 1 0
Apriori Example
ID Items
200 Bread,Cheese,Juice
400 Bread,Juice,Milk
500 Cheese,Juice,Milk
L1-itemset
ID Frequency
Bread 4
Cheese 3
Juice 4
Milk 3
Eggs 1
Chocolate 1
ID Frequency
Bread 4
Cheese 3
Juice 4
Milk 3
L2-itemset
ID Frequency
Bread,Cheese 2
Bread, Milk 2
Bread, Juice 3
Cheese, Juice 3
Cheese, Milk 1
Juice,Milk 2
C2 From l2
ID Frequency
Bread, Juice 3
Cheese, Juice 3
1. Bread, Juice
2.Cheese, Juice
Since all the rules have 75% confidence, all are valid association rules.
Improving the efficiency of Apriori Algorithm
This tree doesn`t generate the candidate set, but it only tests.
Only frequent items are needed to find association rules,so it ignores other
items.
If the frequent items can be stored in a compact structure,then original
structure need not be repeatedly.
If multiple transactions share a set of frequent items,we can merge them.
Generating FP Trees
4.Get 1st transaction DB; Remove all non-frequent items & re-arrange their order according
to frequency.
5.Use this transaction DB to construct the FP-Tree,from 1st transaction to last transaction.
Examplesupport 50%,confidence=75%):
ID Items
200 Bread,Cheese,Juice
400 Bread,Juice,Milk
500 Cheese,Juice,Milk
ID Frequency
Bread 4
Juice 4
Cheese 3
Milk 3
We remove items that are not frequently and order the items according to their frequency as,
ID Items
400 Bread,Juice,Milk
Advantages of FP-tree
It is the process of ordering and dividing the objects into predefined groups or classes
Eg: prediction of customer behaviour
Properties:
If classes are created by looking at data , it is called apriori classification otherwise, they are
posteriori classification
When the classification is based on training system, then it is called supervised learning.
Training set is nothing but picking up a few objects randomly from a database all other
remaining are treated as test data.
Each object is having no of objects, one attribute tells which class the object belong to. This
object is also called dependent/output attribute.
All other remaining object are called independent/input attributes
Types of attribute are:
1. Numerical
2. Categorical
Ordered(eg: grade)
Unordered(eg: gender)
Dependent attributes are categorical if the problem is classification ;otherwise they are
numerical for regression problems.
Decision tree
It is a flowchart like structure which each node denotes a test on the attributes and each branch
represent an outcome of the test.All the leaves are classes
yes feather no
human
bird mammol
Merits of DT:
Demerits of DT:
Step(1): let the training data be S .if some attributes are continued , they must be discretized.
Step(2): put all of S in a single tree node
Step(3): if all object in S are same, stop the process.
Step(4): select the best splitting attributes A among all independent attributes.
Step(5): split the node according to the values of A.
Step(6): stop the process if either of the following occurs:
1. If all objects belong to same class ,stop.
2. If there are no objects for further splitting.
“it is the process of removing unwanted branches when oveerfitting problem occurs”.
After pruning, the accuracy is then computed.
A pruned tree with enough accuracy is selected.
gender
male female
yes Married? no
Class B
Class C Class A
P(A)= 3/10=0.3
P(B)= 3/10=0.3
P(c)= 4/10=0.4
Posterior probabilit y measures:
Error= (T-C)/T
Eg: method
Confusion matric consider the matrix
Predicted class True class
1 2 3
1 8 1 1
2 2 9 2
3 0 0 7
Using this ,we can define the following measures for estimating the sensitivity & specificity
of the result.
(1). Sensitivity=TP/(TP+FN)
Fp= false positives ; these are the objects that did not belong to a class but were allocated to
it.
FN= false negatives ; these are those that did belong to a class but were not yet allocated.
TP= total positives ; these refers to total objects that are correctly classified.
TN= it is measured by, TN= remaining object-FP.
Consider there are 30 objects & are equaly portioned to allocate class1&2&3.
Let as noew calculate the measures for each class.
CLASS 1:
FP=2 FN=2 -out of 10
TN=18 TP=8 – out of 20
CLASS 2
FP=4 FN=1 -out of 10
TN=16 TP=9 -out of 20
CLASS 3
FP=0 FN=3
TN=20 TP=7
Sensitivity =(8+9+7)/(8+9+7)+[2+1+3] =24/30 = 80%
Specificity = (18+16+20)/(18+16+20)+(2+4+0)=54/60= 90%
Other methods:
(3)K-fold cross-validation:
(4)leave-one-out method:
(5)bootstrap method:
it randomly selected the n data set by sampling n times for building a model.
Once a model is built , it can be shown that only 63.2% of these samples are uniqe
& are used for getting classifier.
The remaining 36.8% of data set is used to estimate its accuracy.
Speed:
It refer to how much faster the method is to produce good results.
Robustness:
it refers to error rate that is how to produce good results without having errors.
Scalability:
it deals with extendability of the method . as the size of data objects grow, it is necessary to
extend the method in future.
Interpretability:
it deals with the understand ability of the method . the results shoud be easily interpretable to
multi-level users.
TYPES OF DATA
Datasets come in a number of different forms. The data may be quantitative, binary,
nominal or ordinal.
Quantitative (or numerical) data is quite common, for example, weight, marks, height,
price, salary, and count. There are a number of methods for computing similarity between
quantitative data.
Binary data is also quite common, for example, gender, marital status. As we have noted
earlier, computing similarity or distance between categorical variables is not as simple as
for quantitative data but a number of methods have been proposed.
Qualitative nominal data is similar to binary data which may take more than two values
but has no natural order, for example religion, foods or colours.
Qualitative ordinal (or ranked) data is similar to nominal data except that the data has an
order associated with it, for example, grades A, B. C. D, sizes S, M, L. and XL. The problem
of measuring distance between ordinal variables is different than for nominal variables
since the order of the values is important. One method of computing distance involves
transferring the values to numeric values according to their rank.
Other types of data are also possible. For example, data may include text strings or a
sequence of Web pages visited.
COMPUTING DISTANCE
Cluster analysis methods are based on measuring similarity between objects by
computing the distance between each pair.Distance is a well understood concept that
has a number of simple properties.
1. Distance is always positive.
3. Distance from point x to itself is always Zero.
3, Distance from point x to point y cannot be greater than the sum of the distance from x to
some other point z and distance from z to y.
4. Distance from x to y is always the same as from. y to x.
Let the distance between two points x and y (both vectors) be D(x,y). We now define a number of
distance measures.
1.Euclidean distance
Euclidean distance or the L2 norm of the difference vector is most commonly used to
compute distances and has an intuitive appeal but the largest valued attribute may
dominate the distance.
3.Chebychev distance
This distance metric is based on the maximum attribute difference. It is also called the L
norm of the difference vector.
D(x, y) = Max |xi - yi|
4.Categorical data distance
This distance measure may be used if many attributes have categorical values with only a
small number of values (e.g. binary values). Let N be the total number of categorical
attributes.
D(x, y) =- (number of xi - yi)/N
HIERARCHICAL METHODS
The Hierarchical methods attempt to capture the structure of the data by constructing a tree
of clusters.Two types of hierarchical approaches are possible,that is given below.
a)Agglomerative approach :
For merging groups (or bottom-up approach), each object at the start is a cluster by itself
and the nearby clusters are repeatedly merged resulting in larger and larger clusters until
some stopping criterion (often a given number of clusters) is met or
All the objects are merged into a single large cluster which is the highest level of the
hierarchy.
b)Divisive approach (or the top-down approach),:
All the objects are put in a single cluster to start.The method then repeatedly performs
splitting of clusters resulting in smaller and smaller clusters until a stopping criterion is
reached or each cluster has only one object in it.
Distance Between Clusters
The hierarchical clustering methods require distances between clusters to be computed. These
distance metrics are often called linkage metrics.Computing distances between large clusters can
be expensive.We will discuss the following methods for computing distances between clusters:
1. Single-link algorithm
2. Complete-link algorithm
3. Centroid algorithm
4. Average-link algorithm
5. Ward's minimum-variance algorithm
1.Single-link
The single-link (or the nearest neighbour) algorithm is the simplest algorithm for
computing distance between two clusters. The algorithm determines the distance between
two clusters as the minimum of the distances between all pairs of points (a, x) where a is
from the first cluster and x is from the second.
computed and the smallest distance (or the shortest link) found.
2.Complete-link
The complete-link algorithm is also called the farthest neighbours algorithm. In this
algorithm, thedistance between two clusters is defined as the maximum of the pairwise
distances (a, x}.
Therefore if there are m elements in one cluster and n in the other, all mn pairwise
distances therefore must be computed and the largest chosen.
Both single-link and complete-link measures have their difficulties.
In the single-link algorithm, each cluster may have an outlier and the two outliers may be
nearby and so the distance between the two clusters would be computed to be small.
Single-link can form a chain of objects as clusters are combined since there is no constraint
on the distance between objects
On the other hand, the two outliers may be very far away although the clusters are nearby
and the complete-link algorithm will compute the distance as large.
3.Centroid
The centroid algorithm computes the distance between two clusters as the distance between
the average point of each of the two clusters.
Usually the squared Euclidean distance between the centroids is used.
4.Average-link
The average-link algorithm computes the distance between two clusters as the average of
all pairwise distances between an object from one cluster and another from the other
cluster.
Therefore if there are m elements in one cluster and n in the other, there are mn distances to
be computed, added and divided by mn.
This approach also generally works well. It tends to join clusters with small
variances.Figure 4.5 shows two clusters A and B and the average-link distance between
them.
Figure 4.5 The average-link distance between 1wo clusters.
Agglomerative Method
The agglomerative clustering method tries to discover such structure given a dataset.
The basic idea of the agglomerative method is to start out with n clusters for n data points,
that is, each cluster consisting of a single data point.
Using a measure of distance, at each step of the method, the method merges two nearest
clusters, thus reducing the number of clusters and building successively larger clusters.
The process continues until the required number of clusters has been obtained or all the
data points are in one cluster.
The agglomerative method leads to hierarchical clusters in which at each step we build
larger and larger clusters that include increasingly dissimilar objects.
The agglomerative method is basically a bottom-up approach which involves the following
steps.
An implementation however may include some variation of these steps.
1. Allocate each point to a cluster of its own. Thus we start with n clusters for n objects.
2. Create a distance matrix by computing distances between all pairs of clusters either using,for
example, the single-link metric or the complete-link metric. Some other metric may also be
used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. If there is only one cluster left then stop.
6. Compute all distances from the new cluster and update the distance matrix after the merger
and go to Step 3.
Divisive Hierarchical Method
The divisive method is the opposite of the agglomerative method
In this method starts with the whole dataset as one cluster and then proceeds to recursively
Divide the cluster into two sub-clusters and continues until each cluster has only one object or
some other stopping criterion has been reached. There are two types of divisive methods:
1. Monothetic: It splits a cluster using only one attribute at a time. An attribute that has the
most variation could be selected.
2. Polythetic: It splits a cluster using all of the attributes together. Two clusters far apart could
be built based on distance between objects.
A typical polythetic divisive method works like the following:
1. Decide on a method of measuring the distance between two object.
2. Create a distance matrix by computing distances between all pairs of objects within the
cluster. Sort these distances in ascending order.
3. Find the two objects that have the largest distance between them. They are the most
dissimilar objects.
4. If the distance between the two objects is smaller than the pre-specified threshold and there
is no other cluster that needs to be divided then stop, otherwise continue.
5. Use the pair of objects as seeds of a K-means method to create two new clusters.
6. If there is only one object in each cluster then stop otherwise continue with Step 2.
Each data point in a cluster, at least a minimum number of points must exist within a given
distance.
Data that is not within such high-density clusters is regarded as outliers or noise.
Density-based clustering is that the clusters are dense regions of probability density in the
data sets.
DBSCAN (density based spatial clustering of applications with noise) is one example of a
density-based method for clustering.
It requires two input parameters; the size of the neighbourhood (R) and the minimum
points in the neighbourhood (N).Essentially these two parameters determine the density
within the clusters the user is willing to accept since they specify how many points must be
in a region.
The number of points not only determines the density of acceptable clusters but it also
determines which objects will be labelled outliers or noise.
Objects are declared to be outliers if there are few other objects in their neighbourhood. The
size parameter R determines the size of the clusters
found. If R is big enough, there would be one big cluster and no outliers. If R is small, there
will be small dense clusters and there might be many outliers.
Concepts of DBSCAN method
1. Neighbourhood: The neighbourhood of an object y is defined as all the objects that are
within the radius R from y.
2. Core object: An object y is called a core object if there are N objects within its neighbourhood.
3. Proximity: Two objects are defined to be in proximity to each other if they belong to the same
cluster. Object x 1 is in proximity to object x 2 if two conditions are satisfied:
(a) The objects are close enough to each other, i.e. within a distance of R.
(b) x 2 is a core object as defined above.
4. Connectivity: Two objects x 1 and x 2 are connected if there is a path or chain of objects x 1 ,x 2,
…. x n from x 1 to x n
Basic Algorithm for Density-based Clustering:
1. Select values of R and N.
2. Arbitrarily select an object p.
3. Retrieve all objects that are connected to p, given R and N.
4. If p is a core object, a cluster is formed.
5. If p is a border object, no objects are in its proximity. Choose another object. Go to Step 3.
6. Continue the process until all of the objects have been processed.
Essentially the above algorithm forms clusters by connecting neighbouring core objects and
those non-core objects that are boundaries of clustering. The remaining non-core objects are
labeled outliers.
DEALING WITH LARGE DATABASES
Most clustering methods implicitly assume that all data is accessible in the mainmemory.
Often the size of the database is not considered but a method requiring multiple scans of
data that is disk-resident could be quite inefficient for large problems.
K-Means Method for Large Databases
One modification of the K-means method for data that is too large to fit in the main memory.The
method first picks the number of clusters and their seed centroids and then attempts to classify
each object to belong to one of the following three groups:
(a) Those that are certain to belong to a cluster. These objects together are called the discard set.
Some information about these objects is computed and saved. This includes the number of
objects n, a vector sum of all attribute values of the n objects (a vector S) and a vector sum
of squares of all attribute values of the n objects (a vector Q).
(b) The objects are however sufficiently far away from each cluster's centroid that they cannot
yet be put in the discard set of objects. These objects together are called the compression
set.
(c) The remaining objects that are too difficult to assign to either of the two groups above.These
objects are called the retained set and are stored as individual objects.
Hierarchical Method for Large Databases—Concept of Fractionation
Dealing with large datasets is difficult using hierarchical methods since the methods require an N
x N distance matrix to be computed for N objects.
It is based on the idea of splitting the data into manageable subsets called "fractions" and then
applying a hierarchical method to each fraction. The concept is called fractionation.
Algorithm 1. Split the large dataset into fractions of size M.
2. The hierarchical clustering technique being used is applied to each fraction. Let the number
of clusters so obtained from all the fractions be C.
3. For each of the C clusters, compute the mean of the attribute values of the objects in it. Let
this mean vector be mi i = 1. , , , ,C. These cluster means are called meta-observations.
4. If the C meta-observations are too large (greater than M}, go to Step 1, otherwise apply the
same hierarchical clustering technique to the meta-observations obtained in Step 3.
5. Allocate each object of the original dataset to the cluster with the nearest mean obtained in
Step 4.
The cost of this algorithm is linear in terms of the size of the dataset. The accuracy of the results
is related to the accuracy of the clustering algorithm used.
QUALITY AND VALIDITY OF CLUSTER ANALYSIS METHODS
Evaluation is particularly difficult since all clustering methods will produce clusters even if
there are no meaningful clusters in the data.
In cluster analysis there is no test data available as there often is in classification. Also, even
if the process has been successful and clusters have been found, two different methods may
produce different clusters.
Different methods are biased towards different types of clusters. For example, minimizing
the maximum distance between the points in a cluster.
The results of K-means may in the first instance be evaluated by examining each attribute's
mean for each cluster in an attempt to assess how far each cluster is from the other. Another
approach is based on computing within cluster variation (I) and between clusters variation
(E). These variations may be computed as follows:
Let the number of clusters be k and let the clusters be Ci i = 1, ..., k. Let the total number of
objects be N and let the number of objects in cluster Ci, be Mi so that
M 1 + M 2 + …… + M k = N
The within-cluster variation between the objects in a cluster Ci is defined as the average
squared distance of each object from the centroid of the cluster. That is, if mi is the centroid of the
cluster Ci, then the mean of the cluster is given by
mi, = ∑{ xj }/Mi
The between cluster distances E may now be computed given the centroids of the clusters.
It is the average sum of squares of pairwise distances between the centroids of the k clusters.
Evaluating the quality of clustering methods or results of a cluster analysis is a challenging task.
The quality of a method involves a number of criteria:
1. Efficiency of the method
2. Ability of the method to deal with noisy and missing data
3. Ability of the method to deal with large problems
4. Ability of the method to deal with a variety of attribute types and magnitudes
Once several clustering results have been obtained, they can be combined, hopefully
providing better results than those from just one clustering of the given data.
Experiments using these techniques suggest that they can lead to improved quality and
robustness of results.
Graph Terminology
Define Directed & undirected graph in terms of web. (2m)
Define Diameter of a graph. (2m)
Explain basic graph terms & the components of the web. (5m)
“Directed graph” is a set of nodes (that corresponds to pages on the web) denoted by V and
edges (which corresponds to links on the web) denoted by E. (i.e.,) a graph is (V,E) where all
edges are directed.
“Undirected graph” may also be represented by nodes and edges (V,E) but the edges have
no direction specified.
Directed graph is like a link that points from one page to another in web, while undirected
graph is not like the pages & links on the web unless we assume the possibility of traversal in
both directions.
A graph may be searched either by a breadth-first search or by a depth-first search.
Breadth-first is based on first searching all the nodes that can be reached from the node
where the search is starting and once these nodes have been searched, searching the nodes at
the next level that can be reached from those nodes and so on.
Depth-first search is based on first searching any unvisited descendants of a given node first,
then visiting the node and then any brother nodes.
“Diameter of a graph” – Maximumof the minimum distances between all possible ordered
node pairs (u,v) i.e., maximum number of links that one would need to follow starting from
any page u to reach any page v assuming that the best path has been followed.
An investigation on the structure of the web was carried out & it was found that the web is
not a ball of highly connected nodes.
It displayed structural properties & the web could be divided into the following 5
components:
1. The Strongly Connected Core (SCC):
This part of the web was found to consist of about 30% of the web, which is still very
large given more than 4billion pages on the web in 2004.
This is the heart of the web & its main property is that pages in the core can reach
each other following directed edges (i.e., hyperlinks).
2. The IN Group:
This part of the web was found to consist of about 20% of the web.
Main property – pages in the group can reach the SCC but cannot be reached from it.
3. The OUT Group:
This part of the web was found to consist of about 20% of the web.
Main property – pages in this group can be reached from SCC but cannot reach the
SCC.
4. Tendrils:
This part of the web was found to consist of about 20% of the web.
Main property – pages cannot be reached by the SCC and cannot reach the SCC.
It does not imply that these pages have no linkages to pages outside the group since
they could well have linkages from the IN group & to the OUT group.
5. The Disconnected Group:
This part of the web was found to be less than 10% of the web & is essentially
disconnected from the rest of the web world.
Ex: personal pages at many sites that link to no other page and have no links to them.
Web Metrics
Write short notes on web metrics. (5m)
There are a number of properties (other than size & structure of the web) about the web that
are useful to measure.
Some of the measures are based on distances between the nodes of the web graph.
It is possible to define how well connected a node is by using the concept of centrality of a
node.
Centrality may be out-centrality, which is based on distances measured from the node using
its out-links while in-centrality is based on distances measured from other nodes that are
connected to the node using the in-links.
Based on these metrics, it is possible to define the concept of compactness that varies from 0
to 1, 0 for a completely disconnected web graph and 1 for a fully connected web graph.
Perhaps the most important measurements about web pages are about a page’s relevance and
its quality.
Relevance is usually related to a user query since it concerns the user finding pages that are
relevant to his query and may be defined in a number of ways.
A simple approach involves relevance to be measured by the number of query terms that
appear in the page.
Another approach is based on in-links from relevant pages. In this, relevance of a page may
be defined as the sum of the number of query terms in the pages that refer to the page.
Relevance may also use the concept of co-citation. In co-citation, if both pages ‘a’ & ‘b’
point to a page ‘c’ then it is assumed that ‘a’ and ‘b’ have a common interest. Similarly if ‘a’
points to both ‘b’ and ‘c’, then we assume that ‘b’ & ‘c’ also share a common interest.
Quality is not determined by page content since there is no automatic means of evaluating the
quality of content. And it is determined by the link structure.
Example: if page ‘a’ points to page ‘b’ then it is assumed that page ‘a’ is endorsing page ‘b’
and we can have some confidence in the quality of page ‘b’.
Finger Printing
What do you mean by shingles? (2m)
Define full fingerprinting. (2m)
Explain with an example about fingerprinting documents. (5m / 10m)
An approach for comparing large number of documents is based on the idea of fingerprinting
documents.
A document may be divided into all possible substrings of length L. These substrings are
called shingles. Based on the shingles we can define resemblance R(X,Y) and containment
C(X,Y) between two documents X & Y as follows.
We assume S(X) & S(Y) to be a set of shingles for documents X and Y respectively.
R(X,Y) = { S(X) S(Y) } / { S(X) U S(Y) }
C(X,Y) = { S(X) S(Y) } / { S(X) }
Following algorithm may be used to find similar documents:
1. Collect all the documents that one wishes to compare.
2. Choose a suitable shingle width and compute the shingles for each
document.
3. Compare the shingles for each pair of documents.
4. Identify those documents that are similar.
This algorithm is inefficient for large collection of documents. Web is very large & algorithm
requires enormous storage to store shingles and takes long time. This approach is called full
fingerprinting.
Example:
We find if the two documents with the following content are similar:
Document1: “the web continues to grow at a fast rate”
Document2: “the web is growing at a fast rate”
First step: making a set of shingles from the documents. We obtain the below shingles if we
assume the shingle length to be 3 words.
Shingles in Doc1 Shingles in Doc2
the web continues the web is
web continues to web is growing
continues to grow is growing at
to grow at growing at a
grow at a at a fast
at a fast a fast rate
a fast rate
Comparing two sets of shingles we find that only 2 of them are identical. So the documents
are not very similar. We illustrate the approach using 3 shingles that are the shortest in the
number of letters including spaces.
Example:
Assume a query has been specified. First step has been carried out & a set of pages that
contain most of the authorities has been found.
Let the set be given by the following 10 pages. Their out-links are listed below:
Page A (out-links to G, H, I, J)
Page B (out-links to A, G, H, I, J)
Page C (out-links to B, G, H, I, J)
Page D (out-links to B, G, H, J)
Page E (out-links to B, H, I, J)
Page F (out-links to B, G, I, J)
Page G (out-links to H, I)
Page H (out-links to G, I, J)
Page I (out-links to H)
Page J (out-links to F, G, H, I)
Connection matrix for this information is as follows:
Page A B C D E F G H I J
A 0 0 0 0 0 0 1 1 1 1
B 1 0 0 0 0 0 1 1 1 1
C 0 1 0 0 0 0 1 1 1 1
D 0 1 0 0 0 0 1 1 0 1
E 0 1 0 0 0 0 0 1 1 1
F 0 1 0 0 0 0 1 0 1 1
G 0 0 0 0 0 0 0 1 1 0
H 0 0 0 0 0 0 1 0 1 1
I 0 0 0 0 0 0 0 1 0 0
J 0 0 0 0 0 1 1 1 1 0
Every row in the matrix shows the out-links from the web page that is identified in the first
column. Every column shows the in-links from the web page that is identified in the first row
of the table.
Another representation is:
Weight x0 y0 x1 y1 x2 y2
Page
A 1 1 1 4 5 118
B 1 1 4 5 17 123
C 1 1 0 5 0 135
D 1 1 0 4 0 104
E 1 1 0 4 0 106
F 1 1 1 4 4 106
G 1 1 7 2 29 60
H 1 1 8 3 29 89
I 1 1 8 1 31 29
J 1 1 7 4 29 93
Web communities:
A web community is generated by a group of individuals that share a common interest. Ex:
religious group, sports.
Web communities may be discovered by exploring the web as a graph & identifying sub-
graphs that have a strong link structure within them but weak associations with other parts of
the graph. Such subgraphs may be called web-communities or cyber-communities.
The idea of cyber-communities is to find all web communities on any topic by processing the
whole web graph.
SEARCH ENGINES
Define search engines. (2m)
Write the goals of search engine.(2m)
A number of web sites provide a large amount of interesting information about search
engines.
Search engines are huge databases of web pages as well as software packages for indexing
and retrieving the pages that enable users to find information of interest to them.
Normally the search engine databases of web pages are built & updated automatically by web
crawlers.
Every search engine provides advertising (sometimes called sponsored links). There are a
variety of ways businesses advertise on the search engines.
Commonly, whenever a particular keyword is typed by the user, related sponsored links
appear with the results of the search.
The web
Query
Checking Representation
Strategy User
Selection profiling
Indexing
Query
execution
History
maintenance
The Indexer:
Given the size of the web & the number of documents that current search engines have in
their databases, an indexer is essential to reduce the cost of query evaluation.
Many researchers recommended an inverted file structure in which each document is
essentially indexed on every word other than the most common words. Google uses this
approach.
Indexing can be a very resource intensive process and is often CPU-bound since it involves a
lot of analysis of each page.
Building an index requires document analysis and term extraction. Term extraction may
involve extracting all the words from each page, elimination of stop words (common words
like “the”, “it”, “and”, “that”) and stemming (transforming words like “computer”,
“computing” and “computation” into one word, say “computer”) to help modify, for
example, the inverted file structure.
A good indexing approach is to create an index that will relate keywords to pages. The
overheads of maintaining such indexes can be high.
An inverted index, rather than listing keywords for each page, lists pages for each keyword.
This description is not quite accurate because an inverted database normally still maintains a
collection of pages including its content in some way.
As an example for inverted index consider the data structure shown below:
Words Page ID
data 10,18,26,41
mining 26,30,31,41
search 72,74,75,76
engine 72,74,79,82
google 72,74,90
This index specifies a number of keywords and the pages that contain the keywords. If we
were searching for “data mining” we look for pages for each keyword “data” and “mining”
and then find the intersection of the two lists and obtain page numbers 26 & 41.
Challenging part is when the two lists are too large. Usual approach is to split the lists across
many machines to find the intersection quickly.
When the index is distributed over many machines, the distribution may be based on a local
inverted index or global inverted index.
Local inverted index results in each machine searching for candidate documents for each
query while the global inverted index is so distributed that each machine is responsible for
only some of the index terms.
Query server
First of all, a search engine needs to receive the query and check the spelling of keywords
that the user has typed.
If the search engine cannot recognize the keywords as words in the language or proper nouns
it is desirable to suggest alternative spellings to the user.
Once the keywords are found to be acceptable, the query may need to be transformed.
Stemming is then carried out and stop words are removed unless they form part of a phrase
that the user wants to search for.
Query composition
Query composition is a major problem. Often a user has to submit a query a number of times
using somewhat different keywords before more or less the “right” result is obtained.
A search engine providing query refinement based on user feedback would be useful. Search
engines often cache the results of a query and can then use the cached results if the refined
query is a modification of a query that has already been processed.
Query processing
Search engine query processing is different from query processing in relational database
systems. In database systems, query processing requires that attribute values match exactly
the values provided in the query.
In search engine query processing, an exact match is not always necessary. A partial match
or a fuzzy match may be sufficient.
Some search engines provide an indication of how well the document matches the user
query.
A major search engine will run a collection of machines, several replicating the index. A
query may then be broadcast to all that are not heavily loaded.
If the query consists of a number of keywords, then the indexes will need to be searched a
number of times (intersection of results to be found).
A search engine may put weights on different words in a query. Ex: if query has words
“going” “to” “paris”, then the search engine will put greater weight on “paris”.
Results presentation
All search engines present results ordered according to relevance. Therefore, before
presentation of results to user, the results must be ranked.
Relevance ranking is based on page popularity in terms of number of back links to the page.
How much of what has been found should be presented to the user. Most search engines
present the retrieved items as a list on the screen with some information for each item, ex:
type of document, its URL & a brief extract from the document.
♠ Page rank equations may be written as: PR(A) = 0.2 + 0.4PR(B) + 0.8PR(C)
PR(B) = 0.2 + 0.8PR(A)
PR(C) = 0.2 + 0.4PR(B)
♠ These are three linear equations in 3 unknowns, we may solve them. We write them as
follows if we replace PR(A) by a & others similarly.
a – 0.4b – 0.8c = 0.2
b – 0.8a = 0.2
c – 0.4b = 0.2
♠ Solution of the above equations is: a = PR(A) = 1.19 ; b = PR(B) = 1.15; c = PR(C) =
0.66
Data Warehouse:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection ofdata in support of management's decision making process.
Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data
in a data warehouse should never be altered.
combination of both.
The top-down approach starts with the overall design and planning. It is useful in
cases where the technology is mature and well known, and where the business
problems that must be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in
the early stage of business modeling and technology development. It allows an
organization to move forward at considerably less expense and to evaluate the
benefits of the technology before making significant commitments.
In the combined approach, an organization can exploit the planned and strategic
nature of the top-down approach while retaining the rapid implementation and
opportunistic application of the bottom-up approach.
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments,
inventory, account administration, sales, or the general ledger. If the business
process is organizational and involves multiple complex object collections, a data
warehouse model should be followed. However, if the process is departmental and
focuses on the analysis of one kind of business process, a data mart model should be
chosen.
Choose the dimensions that will apply to each fact table record. Typical
dimensions are time, item, customer, supplier, warehouse, transaction type, and
status.
Choose the measures that will populate each fact table record. Typical measures are
numericadditive quantities like dollars sold and units sold.
A Three Tier Data Warehouse Architecture:
Tier-1:
The bottom tier is a warehouse database server that is almost always a
relationaldatabase system. Back-end tools and utilities are used to feed data into
the bottomtier from operational databases or other external sources (such as
customer profileinformation provided by external consultants). These tools and
utilities performdataextraction, cleaning, and transformation (e.g., to merge
similar data from differentsources into a unified format), as well as load and
refresh functions to update thedata warehouse . The data are extracted using
application programinterfaces known as gateways.
A gateway is supported by the underlying DBMS andallows client programs to
generate SQL code to be executed at a server.
Examplesof gateways include ODBC (Open Database Connection) and OLEDB
(Open Linkingand Embedding for Databases) by Microsoft and JDBC (Java
Database Connection).
This tier also contains a metadata repository, which stores information aboutthe
data warehouse and its contents.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on
multidimensionaldata to standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose
server thatdirectly implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and
reporting tools,analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).
2. Data mart:
3. Virtual warehouse:
While in the ODS, data can be scrubbed, resolved for redundancy and checked
for compliance with the corresponding business rules. An ODS can be used for integrating
disparate data from multiple sources so that business operations, analysis and reporting can
be carried out while business operations are occurring. This is where most of the data used in
current operations is housed before it's transferred to the data warehouse for longer-term
storage or archiving.
ODSes are commonly used in online transaction processing applications, which involve
processing transactional data. These applications use the quick, lightweight processing that
ODS tools provide. ODS systems enable more comprehensive trend analysis reporting and
processing across many different systems simultaneously.
An operational data store usually stores and processes data in real time. An ODS is connected
to multiple data sources and pulls data into a central location.
ETL is often used in conjunction with operational data stores to prep raw data for a data
warehouse.
The way operational data stores work is comparable to the extract, transform and load (ETL)
process. ODS systems import raw data from production systems and store it in its original
form.
In the ETL process, data is extracted from target sources, transformed and loaded to its
destination. In the ODS process, data is not transformed, but rather it's presented as is to
business intelligence (BI) applications for analysis and operational decision-making.
In some cases, data from an ODS is replicated and then ETL is used to transport the
replicated data to a data warehouse.
As operational data stores ingest data, new incoming data overwrites existing data.
An operational data store typically pulls data from multiple transactional systems for
operational reporting and business reporting. They combine various real-time data sources
together in their original format in a central location.
ODS tools contain up-to-date versions of business data integrated from data sources, which is
useful for BI tasks such as managing logistics, tracking orders and monitoring customer
activity. ODSes are also useful for troubleshooting integration issues with data when they
occur. They can compare recent versions of data to copies on other systems to determine if
there is a continuity error.
ODSes also lend themselves to easy systems integration. Administrators can program rules
into an ODS that synchronize data across multiple systems. When it changes on one system,
it can trigger a change on another system.
ODSes can also facilitate a real-time data stream from data sources into the data pipeline.
For a real-world example, an ODS could pull batches of data from a billing application at
weekly intervals, ingest transaction data in real time and integrate them into a relational
database.
An ODS usually focuses on the operational requirements of a specific business process like
customer service, for example. ODSes allow updates and propagate those updates back to the
operational system that the data originated from.
What are the differences between operational data stores and data warehouses?
Like data warehouses and data lakes, operational data stores can be used as a repository to
import and consolidate different types of operational data from various locations and systems.
However, there are significant differences.
An ODS can be used as an intermediate area for a data warehouse. It can sit in between data
sources and enterprise data warehouses, where it can be used to prepare data for storage in
the data warehouse. In this way, the ODS works as a part of a data warehousing strategy
instead of replacing it. The ODS can become a data source for data warehouses.
Data Warehouse Implementation
1. Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity planning, and
selecting the hardware and software tools. This step will contain be consulting senior
management as well as the different stakeholder.
2. Hardware integration: Once the hardware and software has been selected, they require to
be put by integrating the servers, the storage methods, and the user software tools.
3. Modeling: Modelling is a significant stage that involves designing the warehouse schema
and views. This may contain using a modeling tool if the data warehouses are sophisticated.
4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is
needed. This contains designing the physical data warehouse organization, data placement,
data partitioning, deciding on access techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from several data
sources. This step contains identifying and connecting the sources using the gateway, ODBC
drives, or another wrapper.
6. ETL: The data from the source system will require to go through an ETL phase. The
process of designing and implementing the ETL phase may contain defining a suitable ETL
tool vendors and purchasing and implementing the tools. This may contains customize the
tool to suit the need of the enterprises.
7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the
tools will be needed, perhaps using a staging area. Once everything is working adequately,
the ETL tools may be used in populating the warehouses given the schema and view
definition.
8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required by the end-
users.
9. Roll-out the warehouses and applications: Once the data warehouse has been populated
and the end-client applications tested, the warehouse system and the operations may be rolled
out for the user's community to use.
Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally. Generally, it is
recommended that a data marts may be created with one particular project in mind, and once
it is implemented, several other sections of the enterprise may also want to implement similar
systems. An enterprise data warehouses can then be implemented in an iterative manner
allowing all data marts to extract information from the data warehouse.
2. Need a champion: A data warehouses project must have a champion who is active to
carry out considerable researches into expected price and benefit of the project. Data
warehousing projects requires inputs from many units in an enterprise and therefore needs to
be driven by someone who is needed for interacting with people in the enterprises and can
actively persuade colleagues.
4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit
by the organizations should be loaded in the data warehouses.
5. Corporate strategy: A data warehouse project must be suitable for corporate strategies
and business goals. The purpose of the project must be defined before the beginning of the
projects.
6. Business plan: The financial costs (hardware, software, and peopleware), expected
advantage, and a project plan for a data warehouses project must be clearly outlined and
understood by all stakeholders. Without such understanding, rumors about expenditure and
benefits can become the only sources of data, subversion the projects.
7. Training: Data warehouses projects must not overlook data warehouses training
requirements. For a data warehouses project to be successful, the customers must be trained
to use the warehouses and to understand its capabilities.
8. Adaptability: The project should build in flexibility so that changes may be made to the
data warehouses if and when required. Like any system, a data warehouse will require to
change, as the needs of an enterprise change.
9. Joint management: The project must be handled by both IT and business professionals in
the enterprise. To ensure that proper communication with the stakeholder and which the
project is the target for assisting the enterprise's business, the business professional must be
involved in the project along with technical professionals.
Data Warehousing - Metadata Concepts
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.
Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-
stamping any extracted data, the source of extracted data.
Categories of Metadata
Business Metadata − It has the data ownership information, business definition, and
changing policies.
Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
Metadata acts as a directory.
This directory helps the decision support system to locate the contents of the data
warehouse.
Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
Metadata helps in summarization between current detailed data and highly
summarized data.
Metadata also helps in summarization between lightly detailed data and highly
summarized data.
Metadata is used for query tools.
Metadata is used in extraction and cleansing tools.
Metadata is used in reporting tools.
Metadata is used in transformation tools.
Metadata plays an important role in loading functions.
Metadata Repository
Metadata repository is an integral part of a data warehouse system. It has the following
metadata −
The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below.
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is
rapidly enhancing the essential foundation for Intelligent Solutions containing Business
Performance Management, Planning, Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting.
OLAP enables end-clients to perform ad hoc analysis of record in multiple dimensions,
providing the insight and understanding they require for better decision making.
o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing
Production
o Production planning
o Defect analysis
OLAP Guidelines
Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:
3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and
perform any necessary transformations. The OLAP operations should be sitting between data
sources (e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any
significant degradation in documenting performance as the number of dimensions or the size
of the database increases. That is, the performance of OLAP should not suffer as the number
of dimensions is increased. Users must observe consistent run time, response time, or
machine utilization every time a given query is run.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific
analytical model being created and loaded that optimizes sparse matrix handling. When
encountering the sparse matrix, the system must be easy to dynamically assume the
distribution of the information and adjust the storage and access to obtain and maintain a
consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of
data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should
be unlimited. Each of these common dimensions must allow a practically unlimited number
of customer-defined aggregation levels within any given consolidation path.
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of
the characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the client within about
five seconds, with the elementary analysis taking no more than one second and very few
taking more than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although
some preprogramming may be needed we do not think it acceptable if all application
definitions have to be allow the user to define new Adhoc calculations as part of the analysis
and to document on the data in any desired method, without having to program so we
excludes products (like Oracle Discoverer) that do not allow the user to define new Adhoc
calculation as part of the analysis and to document on the data in any desired product that do
not allow adequate end user-oriented calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding and, if
multiple write connection is needed, concurrent update location at an appropriated level, not
all functions need customer to write data back, but for the increasing number which does, the
system should be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual
view of the data, including full support for hierarchies, as this is certainly the most logical
method to analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity
should be handled in an efficient manner.
Benefits of OLAP
The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the store's
sales for the dimension time, item, and location. These dimensions allow the save to keep
track of things, for example, monthly sales of items and the locations at which the items were
sold. Each dimension has a table related to it, called a dimensional table, which describes the
dimension further. For example, a dimensional table for an item may contain the attributes
item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown
in the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item
sold). The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai,
Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table
are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can
be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest. Another
attributes are selected as dimensions or functional attributes. The measure attributes are
aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for
the dimensions time, item, branch, and location. These dimensions enable the store to keep
track of things like monthly sales of items, and the branches and locations at which the items
were sold. Each dimension may have a table identify with it, known as a dimensional table,
which describes the dimensions. For example, a dimension table for items may contain the
attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be
sparse in many cases because not every cell in each dimension may have corresponding data
in the database.
If a query contains constants at even lower levels than those provided in a data cube, it is not
clear how to make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.
Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in
thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location for the
cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in
thousands). These 3-D data are shown in the table. The 3-D data of the table are represented
as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Data cube operations:
Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.
Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis. It
zooms into more detail. For example- if India is an attribute of a country column and we
wish to see villages in India, then the drill-down operation splits India into states,
districts, towns, cities, villages and then displays the required information.
Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a result,
it looks more like a subcube out of the whole cube(as depicted in the figure). For
example- the user wants to see the annual salary of Jharkhand state employees.
Pivot: this operation is very important from a viewing point of view. It basically
transforms the data cube in terms of view. It doesn’t change the data present in the data
cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item type.
Advantages of data cubes: