DM Material

Download as pdf or txt
Download as pdf or txt
You are on page 1of 98

BCA Syllabus under CBCS Pattern with effect from 2021-2022 Onwards

Subject
DATA MINING AND WAREHOUSING Semester V
Title
Subject
21UCA10 Specialization NA
Code
Type Core: Theory L:T:P:C 71:5:0:4
Unit Contents Levels Sessions
Introduction: Data mining application – data mining techniques
– data mining case studies- the future of data mining – data
mining software - Association rules mining: basics- task and a
I naïve algorithm- Apriori algorithm – improve the efficient of K1 12
the Apriori algorithm – mining frequent pattern without
candidate generation (FP-growth) – performance evaluation of
algorithms.
Classification : Introduction – decision tree – over fitting and
pruning - DT rules- Naive bayes method- estimation
II predictive accuracy of classification methods - other K2 12
evaluation criteria for classification method – classification
software.
Cluster analysis: cluster analysis – types of data – computing
distances-types of cluster analysis methods- partitioned methods
III – hierarchical methods – density based methods – dealing with K3 13
large databases – quality and validity of cluster analysis methods -
cluster analysis software.

Web data mining: Introduction- web terminology and


characteristics- locality and hierarchy in the web- web content
IV mining-web usage mining- web structure mining – web mining K4 17
software - Search engines: Search engines functionality- search
engines architecture – ranking of web pages.
Data warehousing: Introduction – Operational data sources-
data warehousing - Data warehousing design – Guidelines for
data warehousing implementation - Data warehousing metadata
V - Online analytical processing (OLAP): Introduction – OLAP K5 17
characteristics of OLAP system – Multidimensional view and
data cube - Data cube implementation - Data cube operations
OLAP implementation guidelines.

Learning Resources
Text
Books G.K. Gupta, ―Introduction to Data mining with case studies‖, 2nd Edition, PHI
51
DATA MINING

 Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD).
 The knowledge discovery process includes Data cleaning, Data integration, Data
selection, Data transformation, Data mining, Pattern evaluation, and Knowledge
presentation.

Types of Data Mining

Data mining can be performed on the following types of data:

Relational Database:

 A relational database is a collection of multiple data sets formally organized by tables,


records, and columns from which data can be accessed in various ways without
having to recognize the database tables.
 Tables convey and share information, which facilitates data searchability, reporting,
and organization.

Data warehouses:

 A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights.
 The huge amount of data comes from multiple places such as Marketing and Finance.
The extracted data is utilized for analytical purposes and helps in decision- making for
a business organization.
 The data warehouse is designed for the analysis of data rather than transaction
processing.

Data Repositories:

 The Data Repository generally refers to a destination for data storage. However, many
IT professionals utilize the term more clearly to refer to a specific kind of setup within
an IT structure.
 For example, a group of databases, where an organization has kept various kinds of
information.

Object-Relational Database:

 A combination of an object-oriented database model and relational database model is


called an object-relational model. It supports Classes, Objects, Inheritance, etc.
 One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.

Transactional Database:

 A transactional database refers to a database management system (DBMS) that has


the potential to undo a database transaction if it is not performed appropriately.
 Even though this was a unique capability a very long while back, today, most of the
relational database systems support transactional database activities.

Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based data.


o Data mining enables organizations to make lucrative modifications in operation and
production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of
data in a short time.

Disadvantages of Data Mining

o There is a probability that the organizations may sell useful data of customers to other
organizations for money. As per the report, American Express has sold credit card
purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance training
to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools
is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences
in certain conditions.

Data Mining Applications

 Data Mining is primarily used by organizations with intense consumer demands-


Retail, Communication, Financial, marketing company, determine price, consumer
preferences, product positioning, and impact on sales, customer satisfaction, and
corporate profits.
 Data mining enables a retailer to use point-of-sale records of customer purchases to
develop products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare:

 Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs.
 Analysts use data mining approaches such as Machine learning, Multi-dimensional
database, Data visualization, Soft computing, and statistics.
 Data Mining can be used to forecast patients in each category. The procedures ensure
that the patients get intensive care at the right place and at the right time.
 Data mining also enables healthcare insurers to recognize fraud and abuse.
Data Mining in Market Basket Analysis:

 Market basket analysis is a modeling method based on a hypothesis. If you buy a


specific group of products, then you are more likely to buy another group of products.
 This technique may enable the retailer to understand the purchase behavior of a buyer.
This data may assist the retailer in understanding the requirements of the buyer and
altering the store's layout accordingly.
 Using a different analytical comparison of results between various stores, between
customers in different demographic groups can be done.

Data mining in Education:

 Education data mining is a newly emerging field, concerned with developing


techniques that explore knowledge from the data generated from educational
Environments.
 EDM objectives are recognized as affirming student's future learning behavior,
studying the impact of educational support, and promoting learning science. An
organization can use data mining to make precise decisions and also to predict the
results of the student.
 With the results, the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering:

 Knowledge is the best asset possessed by a manufacturing company. Data mining


tools can be beneficial to find patterns in a complex manufacturing process.
 Data mining can be used in system-level designing to obtain the relationships between
product architecture, product portfolio, and data needs of the customers.
 It can also be used to forecast the product development period, cost, and expectations
among the other tasks.

Data Mining in CRM (Customer Relationship Management):

 Customer Relationship Management (CRM) is all about obtaining and holding


Customers, also enhancing customer loyalty and implementing customer-oriented
strategies.
 To get a decent relationship with the customer, a business organization needs to
collect data and analyze the data. With data mining technologies, the collected data
can be used for analytics.

Data Mining in Fraud detection:

 Billions of dollars are lost to the action of frauds. Traditional methods of fraud
detection are a little bit time consuming and sophisticated.
 Data mining provides meaningful patterns and turning data into information. An ideal
fraud detection system should protect the data of all the users.
 Supervised methods consist of a collection of sample records, and these records are
classified as fraudulent or non-fraudulent.
 A model is constructed using this data, and the technique is made to identify whether
the document is fraudulent or not.

Data Mining in Lie Detection:

 Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task.
 Law enforcement may use data mining techniques to investigate offenses, monitor
suspected terrorist communications, etc.
 This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text.
 The information collected from the previous investigations is compared, and a model
for lie detection is constructed.

Data Mining Financial Banking:

 The Digitalization of the banking system is supposed to generate an enormous amount


of data with every new transaction.
 The data mining technique can help bankers by solving business-related problems in
banking and finance by identifying trends, casualties, and correlations in business
information and market costs that are not instantly evident to managers or executives
because the data volume is too large or are produced too rapidly on the screen by
experts.
 The manager may find these data for better targeting, acquiring, retaining,
segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining

 Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques,
etc.
 The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.

Incomplete and noisy data:

The process of extracting useful data from large volumes of data is data mining. The data in
the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually
be inaccurate or unreliable. These problems may occur due to data measuring instrument or
because of human errors. Suppose a retail chain collects phone numbers of customers who
spend more than $ 500, and the accounting employees put the information into their system.
The person may make a digit mistake when entering the phone number, which results in
incorrect data. Even some customers may not be willing to disclose their phone numbers,
which results in incomplete data. The data could get changed due to human or system error.
All these consequences (noisy and incomplete data)makes data mining challenging.

Data Distribution:

Real-worlds data is usually stored on various platforms in a distributed computing


environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository mainly
due to organizational and technical concerns. For example, various regional offices may have
their servers to store their data. It is not feasible to store, all the data from all the offices on a
central server. Therefore, data mining requires the development of tools and algorithms that
allow the mining of distributed data.

Complex Data:

Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.

Performance:

The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.

Data Privacy and Security:

Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it reveals
data about buying habits and preferences of the customers without their permission.

Data Visualization:

In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

DATA MINING FUTURE

The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data
Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting and
generating revenue.

Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events.
Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such as
text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes of
information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used to
solve a problem or for company development. There are many powerful instruments and
techniques available to mine data and find better insight from it.

Data Mining Techniques

 Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets.
 These tools can incorporate statistical models, machine learning techniques, and
mathematical algorithms, such as neural networks or decision trees.
 Thus, data mining incorporates analysis and prediction.
 Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen?
 In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.

1. Classification:

This technique is used to obtain important and relevant information about data and metadata.
This data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial
data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering a
few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining techniques
used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.

2. Clustering:

 Clustering is a division of information into groups of connected objects. Describing


the data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters.
 Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters
relate to hidden patterns, the search for clusters is unsupervised learning, and the
subsequent framework represents a data concept.
 From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology,
medical diagnostics, and much more.
 In other words, we can say that Clustering analysis is a data mining technique to
identify similar data. This technique helps to recognize the differences and similarities
between the data.
 Clustering is very similar to the classification, but it involves grouping chunks of data
together based on their similarities.

3. Regression:

 Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable.
 Regression, primarily a form of planning and modeling. For example, we might use it
to project certain costs, depending on other factors such as availability, consumer
demand, and competition.
 Primarily it gives the exact relationship between two or more variables in the given
data set.

4. Association Rules:

 This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
 Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
 Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
 The way the algorithm works is that you have various data, For example, a list of
grocery items that you have been buying for the last six months. It calculates a
percentage of items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)

5. Outer detection:

 This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior.
 This technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier is a
data point that diverges too much from the rest of the dataset.
 The majority of the real-world datasets have an outlier. Outlier detection plays a
significant role in the data mining field.
 Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless
sensor network data, etc.
6. Sequential Patterns:

The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in a set
of sequences, where the stake of a sequence can be measured in terms of different criteria like
length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns
in transaction data over some time.

7. Prediction:

Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

Data Mining tools

 Data Mining is the set of techniques that utilize specific algorithms, statical analysis,
artificial intelligence, and database systems to analyze data from different dimensions
and perspectives.
 Data Mining tools have the objective of discovering patterns/trends/groupings among
large sets of data and transforming data into more refined information.

These are the most popular data mining tools:


1. Orange Data Mining:

 Orange is a perfect machine learning and data mining software suite. It supports the
visualization and is a software-based on components written in Python computing
language and developed at the bioinformatics laboratory at the faculty of computer
and information science, Ljubljana University, Slovenia.
 As it is a software-based on components, the components of Orange are called
"widgets." These widgets range from preprocessing and data visualization to the
assessment of algorithms and predictive modeling.

Widgets deliver significant functionalities such as:

o Displaying data table and allowing to select features


o Data reading
o Training predictors and comparison of learning algorithms
o Data element visualization, etc.

2. SAS Data Mining:

 SAS stands for Statistical Analysis System. It is a product of the SAS Institute created
for analytics and data management. SAS can mine data, change it, manage
information from various sources, and analyze statistics. It offers a graphical UI for
non-technical users.
 SAS data miner allows users to analyze big data and provide accurate insight for
timely decision-making purposes. SAS has distributed memory processing
architecture that is highly scalable. It is suitable for data mining, optimization, and
text mining purposes.

3. DataMelt Data Mining:

DataMelt is a computation and visualization environment which offers an interactive


structure for data analysis and visualization. It is primarily designed for students, engineers,
and scientists. It is also known as DMelt.

DMelt is a multi-platform utility written in JAVA. It can run on any operating system which
is compatible with JVM (Java Virtual Machine). It consists of Science and mathematics
libraries.

4. Rattle:

Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle
exposes the statical power of R by offering significant data mining features.
While rattle has a comprehensive and well-developed user interface, It has an integrated log
code tab that produces duplicate code for any GUI operation.

The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to
review the code, use it for many purposes, and extend the code without any restriction.

5. Rapid Miner:

Rapid Miner is one of the most popular predictive analysis systems created by the company
with the same name as the Rapid Miner.

It is written in JAVA programming language. It offers an integrated environment for text


mining, deep learning, machine learning, and predictive analysis.

The instrument can be used for a wide range of applications, including company applications,
commercial applications, research, education, training, application development, machine
learning.

Data mining case studies

1. Astronomy

Clustering techniques is used in this study. There are 20 million stars which are
classified by their spectra & their surface temperature.

In clustering, 4 clusters are classified as,

C1-Star type-A

C2-Star type-B // According to its brightness, galaxy place, etc.

C3-Star type-C

C4-Star type-D

2. Banking & Finance

It builds customer profiles to better understand the customer.

It is also used to identify frauds.

It is applicable to forecast stock prices of various products.

It assesses future loan plans for suitable customers by using their profiles.

3. Climate

It is a study that reports on atmospheric and oceaning parameters that cause drought in
a particular state.
It checks whether the rainfall is possible to a particular area by using the measures.

4. Crime Prevention

DM were used to link serious crimes to group of offenders.

The data used here related to more than 2000 offences.

Clusters are used to find out the crime patterns.

5. Direct mail service

In this case study, it prepares data that include customers that responding to direct
mail & those that didn`t responding it.

It also characterizes the customers who are likely to respond mails /phone calls.

6. Healthcare

Identifying the patients where the drug is most applicable is effective by drug testing.

Detecting the small cluster that describes how many customers are tended to go for
doctor shopping.

7. Manufacturing

It predicts the failures occurring are the critical manufacturing components.

Hewlett Packard Company has been reported to have used data mining to improve the
quality of their scanners.

8. Marketting & Electronic commerce

It recommends to customers what other products they might consider buying.

9. Telecommunication

It involves Churn analysis.

It deals with cost of local calls, international rates, customers’ satisfaction, etc.

Customers who do not pay their bills on time are also identified.

Association Rule Mining:


Association Rule Mining

 It is one of the DM technique. It is used to find the items that are purchased together
frequently.
 It is also defined as the discover of association rules that describe which 2 items are
purchased together .
 It is of the form ,
X=>Y {Support=α, Confidence=β}

Where, X,Y => Items

Support =>Indicates the presence of X,Y together in the n transaction

Confidence=>Indicates the chance of having Y whenever X appers.

α ,β=> Integer Values

Other Measures

Support(X) = No.of times X appears /N

=p(X)

Support(XY) = No.of times X &Y appears /N

=p(XηY)

Confidence(X->Y)= Support(XY) / Support(X)

=p(Y/X)

=Called Conditional Probability

Naïve Algorithm

It is a brute force algorithm. It has the following:

Step 1 : List out all possible combination of itemsets from the transaction DB
level by level.

Step 2: At the first level , We get 1-itemset followed by 2-itemset , and so on.

Step 3 : Remove the combinations that do not satisfy the particular specified
support & confidence value.

Step 4 : These sets are treated as frequent itemsets

Step 5:Generate association rules from the frequent sets obtained in step 4.
Example

Consider the following transaction DB

ID Items

100 Bread, Cheese

200 Bread,Cheese,Juice

300 Bread, Milk

400 Cheese,Juice,Milk

We have to find out frequent item sets as well as association rule discovery using naïve
algorithm:

• Finding frequent sets with minimum support say S=50% (min-s)

• Discover association rules with minimum confidence say C=75%

Finding Frequent Sets With Minimum Support

Lets generate all possible combinations:

1st Item set

ID Frequency

Bread 3

Cheese 3

Juice 2

Milk 2
2nd Item set

ID Frequency

Bread, Cheese 2

Bread, Milk 1

Bread,juice 1

Cheese, Juice 2

Cheese, Milk 1

Juice, Milk 1

3rd Item set

ID Frequency

Bread, Cheese, Juice 1

Bread, Cheese, Milk 0

Bread, Juice, Milk 0

Cheese, Juice, Milk 1

4th Item set

ID Frequency

Bread, Cheese, Juice, Milk 0


The list of frequent item set for the minimum support of 50% is given below:

Candidate set

ITEMSET

ID Frequency

Bread 3

Cheese 3

Juice 2

Milk 2

Bread, Cheese 2

Cheese, Juice 2

Discover Association Rules With Minimum Confidence

There are two itemsets found at this level.

{ Bread, Cheese}

{Cheese, Juice}

From this, We can form 4 kind of association rules.

• Bread-> Cheese

• Cheese-> Bread

• Cheese-> Juice

• Juice-> Cheese

Lets us examine the confidence of 75% for all the rules.

• Bread-> Cheese

C=p(Bread-> Cheese)/p(Bread)
=2/3=67%

• cheese-> Bread

C=p(cheese-> Bread)/p(cheese )

=2/3=67%

• Cheese-> Juice

C=p(Cheese-> Juice)/p(Cheese)

=2/3=67%

• Juice-> Cheese

C=p(Juice-> Bread Cheese)/p(Juice )

=2/2=100%

Rule 4 is having the higher confidence.

Solution Juice-> Cheese{Support=50%,Confidence=75%}

Improved naïve algorithm

We can improve naïve algorithm by removing items with 0-Frequency.

The precious table of combinations with non-zero frequently are given below:

ID Items combinations

100 Bread, Chease { Bread, Chease}

200 Bread,Chease,Juice {Bread,Chease}

{Bread,Juice}

{Chease,Juice}

{Bread,Chease,Juice}

300 Bread, Milk {Bread, Milk}

400 Chease,Juice,Milk {Chease,Juice}

{Chease, Milk}
{Juice,Milk}

{Chease,Juice,Milk}

By removing 0-Frequency items, We have only 11 itemsets instead of having 15 itemsets

ID Frequency

Bread 3

Cheese 3

Juice 2

Milk 2

Bread, Cheese 2

Bread, Juice 1

Bread, Milk 1

Cheese, Juice 2

Juice, Milk 1

Bread,Cheese,Juice 1

Cheese,Juice,Milk 1

Apriori Algorithm

It is the basic algorithm for finding associationrules. It has 2 parts:

Part I : Finding Frequent Item sets

Part II : Finding the Rules

Part I : Finding Frequent Item sets

Step 1: Scan all transactions & find all i-itemsets. Let it be Lk.
Step 2:Generate candidate set with minimum support value.Let it be Ck.

Step 3:Call Apriori – Gen Function .It works as follows:

It takes the argument Lk-1 & generate Ck from Lk-1.

In computing C3 from L2, We organize L2.So that items are sorted in their
lexicographic order.

Step 4: Terminate the process when no further frequent-itemset are found ;

Otherwise repeat step 2

Other Issues In Aprior

i.Pruning

The process of removing some itemsets by checking that all its subsets are frequent or
not.

ii.Apriori subset Function

To improve searching efficiency,the candidate sets are stored in a hash tree. Head
nodes store itemsets and all internal nodes provide a road-map to reach the leaves.

iii.Transaction Storage

It can be done in variety of representation.

Consider the Transaction:

ID Items

10 A,B

20 C,D

30 C,B

We can represent its storageis the following 2 ways:

i.Representing Transaction as Binary-Item List


ID A B C D

10 1 1 0 0

20 0 0 1 1

30 0 1 1 0

ii..Representing Transaction as Binary Columns

ID TID

10 20 30

A 1 0 0

B 1 0 1

C 0 1 1

D 0 1 0

Apriori Example

ID Items

100 Bread, Cheese,Eggs,Juice

200 Bread,Cheese,Juice

300 Bread, Milk,Chocolate

400 Bread,Juice,Milk

500 Cheese,Juice,Milk

Min Support=50% , Confidence=75%


i.Find Frequent –Itemset with S=50%

L1-itemset

ID Frequency

Bread 4

Cheese 3

Juice 4

Milk 3

Eggs 1

Chocolate 1

Generate candidate C1From L1

ID Frequency

Bread 4

Cheese 3

Juice 4

Milk 3

L2-itemset

ID Frequency

Bread,Cheese 2

Bread, Milk 2

Bread, Juice 3
Cheese, Juice 3

Cheese, Milk 1

Juice,Milk 2

C2 From l2

ID Frequency

Bread, Juice 3

Cheese, Juice 3

ii.Finding Association rules with confidence of 75%

We get 2 itemset from part I:

1. Bread, Juice

2.Cheese, Juice

Among these set, we get following association rules:

1. Bread-> Juice with C=3/4=75%


2. Juice->Bread with C=3/4=75%
3. Cheese-> Juice with C=3/3=100%
4. Juice->Cheese with C=3/4=75%

Since all the rules have 75% confidence, all are valid association rules.
Improving the efficiency of Apriori Algorithm

We can improve the efficiency of Apriori by doing :

1.Reading the number of candidates itemset.

2.Using pruning technique that removes unwanted items.

3.Reducing number of transactions.

4.Avoiding number of Comparisions.

5.Generating efficient candidate sets.

6.Minimizing the number of scans over the transaction DB.

Mining Frequent patterns without candidate generation(FP-Growth)

This tree doesn`t generate the candidate set, but it only tests.

It`s motivation is:

 Only frequent items are needed to find association rules,so it ignores other
items.
 If the frequent items can be stored in a compact structure,then original
structure need not be repeatedly.
 If multiple transactions share a set of frequent items,we can merge them.

Generating FP Trees

The algorithm works as ,

1.Scan the DB once ; Find Frequent item with S.

2.Sort them in ascending order.

3.Start creating FP-Tree with NULL root.

4.Get 1st transaction DB; Remove all non-frequent items & re-arrange their order according
to frequency.

5.Use this transaction DB to construct the FP-Tree,from 1st transaction to last transaction.

Examplesupport 50%,confidence=75%):
ID Items

100 Bread, Cheese,Eggs,Juice

200 Bread,Cheese,Juice

300 Bread, Milk,Chocolate

400 Bread,Juice,Milk

500 Cheese,Juice,Milk

Frequent items found is givan below:(Sorted Order)

ID Frequency

Bread 4

Juice 4

Cheese 3

Milk 3

We remove items that are not frequently and order the items according to their frequency as,

ID Items

100 Bread, Juice, Cheese

200 Bread, Juice ,Cheese

300 Bread, Milk

400 Bread,Juice,Milk

500 Juice ,Cheese,Milk


Fp tree for this table is shown below:

We use B,J,C,M for bread , Juice ,Cheese & Milk

 Each node contains 3 fields : an itemname, a count and a node list.


 Count tells the number of occurances the path is having in the DB,
 Tree is built by making a root node labeled NULL.
 During 1st transaction ,BJC is inserted & count is set to 1.Next, the 2 nd
transaction BJC occurs by incrementing the count by 1.
 Once the FP tree is constructed,the transaction DB is not required.

Advantages of FP-tree

 It avoids scanning the database more than twice

 It completely eliminates the costly candidate generations

Performance Evaluation of Algorithms

1. The FP-growth method is best one compared to Apriori Algorithm


2. Apriori is better when support value is high which leads to small no of frequent items
3. At a very low support, no of frequent items become large

Best performance algorithm:

1. An efficient implementation of the FP-tree algorithm


2. An algorithm that combined a number of algorithms using multiple heuristics
Classification

 It is the process of ordering and dividing the objects into predefined groups or classes
 Eg: prediction of customer behaviour

Properties:

 If classes are created by looking at data , it is called apriori classification otherwise, they are
posteriori classification
 When the classification is based on training system, then it is called supervised learning.
 Training set is nothing but picking up a few objects randomly from a database all other
remaining are treated as test data.
 Each object is having no of objects, one attribute tells which class the object belong to. This
object is also called dependent/output attribute.
 All other remaining object are called independent/input attributes
 Types of attribute are:
1. Numerical
2. Categorical

Ordered(eg: grade)
Unordered(eg: gender)
 Dependent attributes are categorical if the problem is classification ;otherwise they are
numerical for regression problems.

Decision tree

 It is a flowchart like structure which each node denotes a test on the attributes and each branch
represent an outcome of the test.All the leaves are classes

Eg: consider the table

name feathers eggs sense class


cuckoo yes Yes No Bird
Dog No No No Mammol
Kumar No No Yes Human
Peacock Yes Yes No Bird
Fox No No No Mammol
Crow Yes Yes No Bird
ramesh no no yes Human
animal

It’s decision tree would be


Yes
no

yes feather no
human

bird mammol

 It is a model that describes relationship found in training data.


 Each internal node be a decision node to have two or more children.
 Training process that generates the tree is called DT-induction.

Merits of DT:

 It is easy to describe and popular one.


 This technique is faster unless the data is too large.

Demerits of DT:

 Complexity of decision tree increases when no of attributes increases.


 It does not produce accurate results in same situations.

Building a DT-induction algorithm:


 The algorithm works as follows:

Step(1): let the training data be S .if some attributes are continued , they must be discretized.
Step(2): put all of S in a single tree node
Step(3): if all object in S are same, stop the process.
Step(4): select the best splitting attributes A among all independent attributes.
Step(5): split the node according to the values of A.
Step(6): stop the process if either of the following occurs:
1. If all objects belong to same class ,stop.
2. If there are no objects for further splitting.

Over fitting and pruning:


Over fitt ing:

 DT produce the anomalies data


 “ it occurs when the object being classified have a large no of attributes and a tree of maximal
depth is built.”
 In such situation, tree quality may not be high.
 It may become quit complex with uneven paths.
 It results poor accuracy results.
Pruning:

 “it is the process of removing unwanted branches when oveerfitting problem occurs”.
 After pruning, the accuracy is then computed.
 A pruned tree with enough accuracy is selected.

Decision tree rules:

 Each branch of DT from root to leaf describes a test.


 It is very easy to understand the rules given on the DT links.
 In RDBMS, sql queries can retrieve such results.
 Eg: sample IF-THEN rules may be derived based on the various paths of DT.
 IF Gender=”male” then class=B
 IF Gender=”female” and married =”yes” then class=C, else class=A.

gender
male female

yes Married? no
Class B

Class C Class A

Naïve bayes method:


 In this method, we have a hypothesis that the given data belongs to a particular class we
then calculate is to be true.
 Bayes theorem,
P(Ci/X)= (p[x/ci]) . (p(ci))/p(x)

 P(ci/x)= probability of object x belonging ci


 P(x/ci)= probability of obtaining attributes values x if we know that if belongs to class ci.
 P(ci)= probability of any object belonging to class ci.
 P(x)= probability of obtaining attribute values x whatever class the object belongs to .
 To compute the class of any object we need to calculate only,
P(x/ci).p(ci)
Example : consider the table

Owns home married gender employed crate Risk class


Yes Yes M Yes A B
No No F Yes A A
Yes Yes F Yes B C
Yes No M No B B
No Yes F Yes B C
No No F Yes B A
Yes No F Yes A A
No Yes F Yes A C
Yes Yes F Yes A C
no no m no b b

 There are 10 samples and 3 classes given data={yes,no,F,yes,A}


 Class A=3
 Class B=3
 Class C =4

Porior probabilit y measure:

P(A)= 3/10=0.3
P(B)= 3/10=0.3
P(c)= 4/10=0.4
Posterior probabilit y measures:

P(x/ci)=p{yes,no,F,yes,A}/ci = p{ownshome=yes, married=no, gender=F,


employed=yes, crate=A}/ci
 [p (owns home =yes)/ci] x
[p(married=no)ci)] x
[p(gender=female)/ci] x
[p(employed = yes)/ci] x
[p(crate =A)/ci]
 These probability are given below by ordering the objects by classwise:
Owns home married gender employed crate Risk class
No No F Y A A
No No F Y B A
Yes No F Y A A
1/3 3/3=1 3/3=1 3/3=1 2/3 P(classA)
Yes Yes M Y A B
Yes No M No B B
No No M No B B
2/3 2/3 0 1/3 1/3 P(class B)
Yes Yes F Yes B C
No Yes F Yes B C
No Yes F Yes A C
Yes Yes F Yes A C
2/4=0.5 0 4/4=1 1.0 0.5 P(class c)

 WITH THIS WE COMPUTE


P(x/A)=2/9 ; 1/3*2/3*1*1*1
P(x/B)=0 ; 2/3*2/3*0*1/3*1/3
P(x/C)=0 ; 0.5*0*1*1.0**0.5
Values of class B & C are zero
Values of class A would be
P(x/A).p(A)0.3*2/9=0.0666
Given X is assigned to class A.

Estimating the predictive accuracy of classification methods:


 It is the probability of estimating correcting classification the unseen data
 Accuracy may be achieved by,
1. Sensitively measure
2. Specificity measure
3. Precision & accuracy measure
 Let as assume that test data has a total of T objects when testing a method, we find that C of T
objects that are correctly classified . the error rate may then defined as,

Error= (T-C)/T

 Eg: method
 Confusion matric consider the matrix
Predicted class True class

1 2 3

1 8 1 1

2 2 9 2

3 0 0 7

 Using this ,we can define the following measures for estimating the sensitivity & specificity
of the result.
(1). Sensitivity=TP/(TP+FN)

(2). Specificity= TN/(TN+FP)

 Fp= false positives ; these are the objects that did not belong to a class but were allocated to
it.
 FN= false negatives ; these are those that did belong to a class but were not yet allocated.
 TP= total positives ; these refers to total objects that are correctly classified.
 TN= it is measured by, TN= remaining object-FP.
 Consider there are 30 objects & are equaly portioned to allocate class1&2&3.
 Let as noew calculate the measures for each class.
 CLASS 1:
FP=2 FN=2 -out of 10
TN=18 TP=8 – out of 20
 CLASS 2
FP=4 FN=1 -out of 10
TN=16 TP=9 -out of 20
 CLASS 3
FP=0 FN=3
TN=20 TP=7
 Sensitivity =(8+9+7)/(8+9+7)+[2+1+3] =24/30 = 80%
 Specificity = (18+16+20)/(18+16+20)+(2+4+0)=54/60= 90%
Other methods:

(1)Hold out method:


 in this, data set is divided into two subset:
 (1). The training subset.
 (2).the holdout subset.
 The training subset produce the best classifier and holdout subset produces the
better estimate of its accuracy.

(2).random sub-sampling method:

 It is same as holdout method .


 In this, holdout subset is repeated several times to have number of accuracy values
 Computing the mean of several trials gives the final accuracy.

(3)K-fold cross-validation:

 In this, available data is divided into K disjoint subset of equal size.


 One set is used as subset ; remaining K-1 sets are used for building the classifier.
 This single set is used to estimate the accuracy of the classifier.

(4)leave-one-out method:

 it is the simpler version of K-fold cross validation


 we can model all the K-1 sets as asingle set to find out the classifier
 once the model is built the remaining one set checks its accuracy

(5)bootstrap method:

 it randomly selected the n data set by sampling n times for building a model.
 Once a model is built , it can be shown that only 63.2% of these samples are uniqe
& are used for getting classifier.
 The remaining 36.8% of data set is used to estimate its accuracy.

Other evaluation criteria for classification methods:


It includes
 Speed
 Robustness
 Scalability
 Interpretability
 Goodness of the problem
 Flexibility & time complexibility.

Speed:
 It refer to how much faster the method is to produce good results.

Robustness:
 it refers to error rate that is how to produce good results without having errors.

Scalability:
 it deals with extendability of the method . as the size of data objects grow, it is necessary to
extend the method in future.

Interpretability:
 it deals with the understand ability of the method . the results shoud be easily interpretable to
multi-level users.

Goodness of the model:


 effectively fit to the problem.

Flexibility & time complexity:


 it should be portable to implement any kind of problem, it shouldn’t be time consuming.
CLUSTER ANALYSIS
 Cluster analysis is a collection of methods that assists the user in putting different objects
from a collection of objects into different groups.
 The aim of cluster analysis is exploratory, to find if data naturally falls into meaningful
groups with small within-group variations and large between-group variation. Often we
may not have a hypothesis that we are trying to test.
 The aim is to find any interesting grouping of the data. It is possible to define cluster
analysis as an optimization problem in which a given function consisting of within cluster
(intra-cluster) similarity and between clusters (inter-c luster) dissimilarity needs to be
optimized.Clustering methods therefore only try to find an approximate or local optimum
solution.

DESIRED FEATURES OF CLUSTER ANALYSIS


 (For large datasets) Scalability : Data mining problems can be large and therefore it is
desirable that a cluster analysis method be able to deal with small as well as large problems
gracefully.The method should also scale well to datasets in which the number of attributes
is large.
 (For large datasets) Only one scan of the dataset : For large problems, the data must be
stored on the disk and the cost of I/O from the disk can then become significant in solving
the problem.
 (For large datasets) Ability to stop and resume : When the dataset is very large, cluster
analysis may require considerable processor time to complete the task. In such cases, it is
desirable that the task be able to be stopped and then resumed when convenient.
 Minimal input parameters: The cluster analysis method should not expect too much
guidance from the user.
 Robustness : Most data obtained from a variety of sources has errors. It is therefore
desirable that a cluster analysis method be able to deal with noise, outliers and missing
values gracefully.
 Ability to discover different cluster shapes : Clusters come in different shapes and not all
clusters are spherical.Some applications require that various shapes be considered.
 Different data types : Many problems have a mixture of data types, for example,numerical,
categorical and even textual. It is therefore desirable that a cluster analysis method be able
to deal with not on]y numerical data but also Boolean and categorical data.
 Result independent of data input order : Although this is a simple requirement, not all
methods satisfy it. It is therefore desirable that a cluster analysis method not be sensitive to
data input order.

TYPES OF DATA
 Datasets come in a number of different forms. The data may be quantitative, binary,
nominal or ordinal.
 Quantitative (or numerical) data is quite common, for example, weight, marks, height,
price, salary, and count. There are a number of methods for computing similarity between
quantitative data.
 Binary data is also quite common, for example, gender, marital status. As we have noted
earlier, computing similarity or distance between categorical variables is not as simple as
for quantitative data but a number of methods have been proposed.
 Qualitative nominal data is similar to binary data which may take more than two values
but has no natural order, for example religion, foods or colours.
 Qualitative ordinal (or ranked) data is similar to nominal data except that the data has an
order associated with it, for example, grades A, B. C. D, sizes S, M, L. and XL. The problem
of measuring distance between ordinal variables is different than for nominal variables
since the order of the values is important. One method of computing distance involves
transferring the values to numeric values according to their rank.
 Other types of data are also possible. For example, data may include text strings or a
sequence of Web pages visited.
COMPUTING DISTANCE
Cluster analysis methods are based on measuring similarity between objects by
computing the distance between each pair.Distance is a well understood concept that
has a number of simple properties.
1. Distance is always positive.
3. Distance from point x to itself is always Zero.
3, Distance from point x to point y cannot be greater than the sum of the distance from x to
some other point z and distance from z to y.
4. Distance from x to y is always the same as from. y to x.
Let the distance between two points x and y (both vectors) be D(x,y). We now define a number of
distance measures.
1.Euclidean distance
 Euclidean distance or the L2 norm of the difference vector is most commonly used to
compute distances and has an intuitive appeal but the largest valued attribute may
dominate the distance.

D(x. y) = (∑(xi - yi)2)1/2


 Euclidean distance measure is more appropriate when the data is not standardized, but as
noted above the distance measure can be greatly affected by the scale of the data.
2.Manhattan distance.
 Anothather commonly used distance metric is the Manhattan distance or the L\ norm of
the difference vector. In most cases, the results obtained by the Manhattan distance are
similar to those obtained by using the Euclidean distance.
D(x, y) =∑| xi - yi|

3.Chebychev distance
 This distance metric is based on the maximum attribute difference. It is also called the L
norm of the difference vector.
D(x, y) = Max |xi - yi|
4.Categorical data distance
 This distance measure may be used if many attributes have categorical values with only a
small number of values (e.g. binary values). Let N be the total number of categorical
attributes.
D(x, y) =- (number of xi - yi)/N

TYPES OF CLUSTER ANALYSIS METHODS


1.Partitional methods
 Partitional methods obtain a single level partition of objects. These methodusually are
based on greedy heuristics that are used iteratively to obtain a local optimum solution.
 Each cluster has at least one object and each object belongs to only one cluster. Objects may
be relocated between clusters as the clusters are refined.
 Often these methods require that the number of cluster be specified apriori and this
number usually does not change during the processing.
2.Hierarchical methods
 Hierarchical methods obtain a nested partition of the objects resulting in a tree of clusters.
 These methods either start with one cluster and then split into smaller and smaller clusters
(called divisive or top down) or start with each object in an individual cluster and then try
to merge similar clusters into larger and larger clusters (called agglomerative or bottom
up).
3.Density-based methods
 In this class of methods, typically for each data point in a cluster, at least a minimum
number point must exist within a given radius.
4.Grid-based methods
 In this class of methods, the object space rather than the data is divided into a grid.
 Grid partitioning is based on characteristics of the data and such methods can deal with
non-numeric data more easily.Grid-based methods are not affected by data ordering.
5.Model-based methods
 A model is assumed, perhaps based on a probability distribution. Essentially the algorithm
tries to build clusters with a high level of similarity within them and a low level of
similarity between them.
 Similarity measurement is based on the mean values and the algorithm tries to minimize
the squared-error function.
A simple taxonomy of cluster analysis methods is presented in Figure 4.1.

Figure 4.1 Taxonomy of cluster analysis methods,


PARTITIONAL METHODS
 Partitional methods are more easily adapted for very large datasets.The hierarchical
methods tend to be computationally more expensive.iterative refinement methods
(sometimes called Hill-cliimbing or greedy methods) and they converge to a local
minimum rather than the global minimum.
 These methods not only require specifying the number of clusters apriori, they also
require the user to normally specify the starting states of the clusters. This may be
difficult for a user since the user may not have such knowledge and there is no simple
reliable method for finding initial conditions for the clusters. We do, however, discuss
some possible methods for obtaining initial conditions when the user is not able to
provide them.
The K-Means Method
 K-means is the simplest and most popular classical clustering method that is easy to
implement. The classical method can only be used if the data about all the objects is located
in the main memory
 .The method is called K-means since each of the K clusters is represented by the mean of
the objects(called the centroid) within it. It is also called the centroid method.
 Each step the centroid point of each cluster is assumed to be known and each of the
remaining points are allocated to the cluster whose centroid is closest to it.
 Once this allocation is completed, the centroids of the clusters are recomputed using simple
means and the process of allocating points to each cluster is repeated until there is no
change in the clusters (or some other stopping criterion, e.g. no significant reduction in the
squared error, is met).
The K-means method uses the Euclidean distance measure, which appears to work well with
compact clusters. If instead of the Euclidean distance, the Manhattan distance is used the method
is called the K-median method.
The K-means method may be described as follows:

1. Select the number of clusters. Let this number be k.


2. Pick k seeds as centroids of the k clusters. The seeds may be picked randomly unless the
user has some insight into the data.
3. Compute the Euclidean distance of each object in the dataset from each of the centroids
4. Allocate each object to the cluster it is nearest to based on the distances computed in the
previous step.
5. Compute the centroids of the clusters by computing the means of the attribute values of the
objects in each cluster.
6. Check if the stopping criterion has been met (e.g. the cluster membership is unchanged). If
yes, go to Step 7. If not, go to Step 3.
7. [Optional] One may decide to stop at this stage or to split a cluster or combine two clusters
heuristically until a stopping criterion is met.
The method is scalable and efficient (the time complexity is of O(n) and is guaranted
to find a local minimum.
Scaling and weighting
 For clustering to be effective, all attributes should be converted to a similar scale unless we
want o give more weight to some attributes that are relatively large in scale. There are a
number of ways to transform the attributes.
 One possibility is to transform them all to a normalized score or to a range (0, 1). Such
transformations are called scaling. Some other approaches to scaling are given below:
1. Divide each attribute by the mean value of that attribute. This reduces the mean of
each attribute to 1. It does not control the variation; some values may still be large,
others small,
2. Divide each attribute by the difference between the largest value and the smallest
value. This reduces the mean of attributes that have a large range of values and
increases the mean of attributes that have a small range of values but does not
reduce each attribute's mean to the same value. The scaling reduces the difference
between the largest value and the smallest value to 1 and therefore does control the
variation.
3. Convert the attribute values to "standardized scores" by subtracting the mean of the
attribute from each attribute value and dividing it by the standard deviation.
The purpose of giving weights to some attributes essentially is to introduce bias in the clustering
process by purposely giving higher weights to some important attributes,
Starting values for the K-means method
 This problem may be overcome by using an iterative approach.Also,during the iterative
process if two clusters are found to be close together, it may be desirable to merge them.
Also, a large cluster may be split in two if the variance within the cluster is above some
threshold value.
 Another approach recommends using a hierarchical method like the agglomerative method
on the data first, since that method does not require starting values, and then using the
results of that method as the basis for specifying the number of clusters and starting seeds.
 Another approach is based on using a number of small samples of the given data and using
random seeds for each sample.
Summary of the K-means method
 K-means is an iterative-improvement greedy method. A number of iterations are normally
needed for convergence and therefore the dataset is processed a number of times.
 If the data is very large and cannot be accommodated in the main memory the process may
become inefficient.
 Although the K-means method is most widely known and used, there are a number of
issues related to the method that should be understood:
1. The K-means method needs to compute Euclidean distances and means of the attribute
values of objects within a cluster. The classical algorithm therefore is only suitable for
continuous data..
2. The K-means method implicitly assumes spherical probability distributions.
3. The K-means method can be sensitive to outliers.
4. Although some local optimum solutions discovered by the K-means method are
satisfactory,often the local optimum is not as good as the global optimum.
5. The K-means method does not consider the size of the clusters. Some clusters may be large
and some very small.
6. The K-means method does not deal with overlapping clusters.

Expectation Maximization Method


 The Expectation Maximization (EM) method is based on the assumption that the
objects in the dataset have attributes whose values are distributed according to
some(unknown)linear combination (or mixture) of simple probability distributions.
While the K-means method involves assigning objects to clusters to minimize
within-group variation, the EM method assigns objects to different clusters with
certain probabilities in an attempt to maximize expectation (or likelihood) of
assignment.
 .For every individual we may assume that it comes from distribution 1 with
probability p and therefore from distribution 2 with probability
1-p.The EM method consists of a two-step iterative algorithm.
1. The first step, called the estimation step or the E-step, involves estimating
the probability distributions of the clusters given the data.
2. The second step. called the maximization step or the M-step, involves
finding the model parameters that maximize the likelihood of the
solution.The EM method assumes that all attributes are independent
random variables. The EM method requires that we now estimate the
following parameters:
1. The mean and standard deviation of the normal distribution for cluster 1
2. The mean and standard deviation of the normal distribution for cluster 2
3. The probability p of a sample belonging to cluster 1 and therefore probability 1 - p of belonging
to cluster 2.

The EM method then works as follows:


1. Guess the initial values of the five parameters (the two means, the two standard deviations
and the probability p) given above.
2. Use the two normal distributions (given the two guesses of means and two guesses of
standard deviations) and compute the probability of each object belonging to each of the
two clusters.
3. Compute the likelihood of data coming from these two clusters by multiplying the sum of
probabilities of each object.
4. Re-estimate the five parameters and go to Step 2 until a stopping criterion has been met

HIERARCHICAL METHODS
 The Hierarchical methods attempt to capture the structure of the data by constructing a tree
of clusters.Two types of hierarchical approaches are possible,that is given below.
a)Agglomerative approach :
 For merging groups (or bottom-up approach), each object at the start is a cluster by itself
and the nearby clusters are repeatedly merged resulting in larger and larger clusters until
some stopping criterion (often a given number of clusters) is met or
 All the objects are merged into a single large cluster which is the highest level of the
hierarchy.
b)Divisive approach (or the top-down approach),:
 All the objects are put in a single cluster to start.The method then repeatedly performs
splitting of clusters resulting in smaller and smaller clusters until a stopping criterion is
reached or each cluster has only one object in it.
Distance Between Clusters
The hierarchical clustering methods require distances between clusters to be computed. These
distance metrics are often called linkage metrics.Computing distances between large clusters can
be expensive.We will discuss the following methods for computing distances between clusters:
1. Single-link algorithm
2. Complete-link algorithm
3. Centroid algorithm
4. Average-link algorithm
5. Ward's minimum-variance algorithm

1.Single-link
 The single-link (or the nearest neighbour) algorithm is the simplest algorithm for
computing distance between two clusters. The algorithm determines the distance between
two clusters as the minimum of the distances between all pairs of points (a, x) where a is
from the first cluster and x is from the second.

 The algorithm therefore requires that all pairwise distances be

computed and the smallest distance (or the shortest link) found.

2.Complete-link

 The complete-link algorithm is also called the farthest neighbours algorithm. In this
algorithm, thedistance between two clusters is defined as the maximum of the pairwise
distances (a, x}.
 Therefore if there are m elements in one cluster and n in the other, all mn pairwise
distances therefore must be computed and the largest chosen.
 Both single-link and complete-link measures have their difficulties.
 In the single-link algorithm, each cluster may have an outlier and the two outliers may be
nearby and so the distance between the two clusters would be computed to be small.
Single-link can form a chain of objects as clusters are combined since there is no constraint
on the distance between objects
 On the other hand, the two outliers may be very far away although the clusters are nearby
and the complete-link algorithm will compute the distance as large.
3.Centroid
 The centroid algorithm computes the distance between two clusters as the distance between
the average point of each of the two clusters.
 Usually the squared Euclidean distance between the centroids is used.
4.Average-link
 The average-link algorithm computes the distance between two clusters as the average of
all pairwise distances between an object from one cluster and another from the other
cluster.
 Therefore if there are m elements in one cluster and n in the other, there are mn distances to
be computed, added and divided by mn.
 This approach also generally works well. It tends to join clusters with small
variances.Figure 4.5 shows two clusters A and B and the average-link distance between
them.


Figure 4.5 The average-link distance between 1wo clusters.

5.Ward's minimum-variance method


 The method generally works well and results in creating small tight clusters.
 Ward's distance is the difference between the total within the cluster sum of squares for the
two clusters separately and within the cluster sum of squares resulting from merging the
two clusters.
An expression for Ward's distance may be derived. It may be expressed as follows;

DW (A,B) ==NAN B D C ( A,B ) / ( N A +N B )


Where DW (A,B) is the 'Warrd's minimum-variance distance between clusters A and B with NA
and N B objects in them respectively. D C ( A,B ) is the centroid distance between the two clusters
computed as squared Euclidean distance between the centroids.

Agglomerative Method
 The agglomerative clustering method tries to discover such structure given a dataset.
 The basic idea of the agglomerative method is to start out with n clusters for n data points,
that is, each cluster consisting of a single data point.
 Using a measure of distance, at each step of the method, the method merges two nearest
clusters, thus reducing the number of clusters and building successively larger clusters.
 The process continues until the required number of clusters has been obtained or all the
data points are in one cluster.
 The agglomerative method leads to hierarchical clusters in which at each step we build
larger and larger clusters that include increasingly dissimilar objects.
 The agglomerative method is basically a bottom-up approach which involves the following
steps.
An implementation however may include some variation of these steps.
1. Allocate each point to a cluster of its own. Thus we start with n clusters for n objects.
2. Create a distance matrix by computing distances between all pairs of clusters either using,for
example, the single-link metric or the complete-link metric. Some other metric may also be
used. Sort these distances in ascending order.
3. Find the two clusters that have the smallest distance between them.
4. Remove the pair of objects and merge them.
5. If there is only one cluster left then stop.
6. Compute all distances from the new cluster and update the distance matrix after the merger
and go to Step 3.
Divisive Hierarchical Method
The divisive method is the opposite of the agglomerative method
In this method starts with the whole dataset as one cluster and then proceeds to recursively
Divide the cluster into two sub-clusters and continues until each cluster has only one object or
some other stopping criterion has been reached. There are two types of divisive methods:
1. Monothetic: It splits a cluster using only one attribute at a time. An attribute that has the
most variation could be selected.
2. Polythetic: It splits a cluster using all of the attributes together. Two clusters far apart could
be built based on distance between objects.
A typical polythetic divisive method works like the following:
1. Decide on a method of measuring the distance between two object.
2. Create a distance matrix by computing distances between all pairs of objects within the
cluster. Sort these distances in ascending order.
3. Find the two objects that have the largest distance between them. They are the most
dissimilar objects.
4. If the distance between the two objects is smaller than the pre-specified threshold and there
is no other cluster that needs to be divided then stop, otherwise continue.
5. Use the pair of objects as seeds of a K-means method to create two new clusters.
6. If there is only one object in each cluster then stop otherwise continue with Step 2.

In the above method, we need to resolve the following two issues:

• Which cluster to split next?


• How to split a cluster?

Which cluster to split next?


1. Split the clusters in some sequential order.
2. Split the cluster that has the largest number of objects.
3. Split the cluster that has the largest variation within it.
How to split a cluster?
We used a simple approach for splitting a cluster based on distance between the objects in the
cluster. A distance matrix is created and the two most dissimilar objects are selected as seeds of
two new clusters. The K-means method is then used to split the cluster.
Advantages of the Hierarchical Approach
1. Hierarchical methods are conceptually simpler and can be implemented easily.

2. Hierarchical methods can provide clusters at different levels of granularity

Disadvantages of the Hierarchical Approach


1. The hierarchical methods do not include a mechanism by which objects that have been
incorrectly put in a cluster may be reassigned to another cluster.
2. The time complexity of hierarchical methods can be shown to be 0(n3).
3. The distance matrix requires 0(n2) space and becomes very large for a large number of
objects.
4. Different distance metrics and scaling of data can significantly change the results.
DENSITY-BASED METHODS
 In this method clusters are high density collections of data that are separated by a large
space of low density data (which is assumed to be noise).

 Each data point in a cluster, at least a minimum number of points must exist within a given
distance.

 Data that is not within such high-density clusters is regarded as outliers or noise.
 Density-based clustering is that the clusters are dense regions of probability density in the
data sets.
 DBSCAN (density based spatial clustering of applications with noise) is one example of a
density-based method for clustering.
 It requires two input parameters; the size of the neighbourhood (R) and the minimum
points in the neighbourhood (N).Essentially these two parameters determine the density
within the clusters the user is willing to accept since they specify how many points must be
in a region.
 The number of points not only determines the density of acceptable clusters but it also
determines which objects will be labelled outliers or noise.
 Objects are declared to be outliers if there are few other objects in their neighbourhood. The
size parameter R determines the size of the clusters
 found. If R is big enough, there would be one big cluster and no outliers. If R is small, there
will be small dense clusters and there might be many outliers.
Concepts of DBSCAN method
1. Neighbourhood: The neighbourhood of an object y is defined as all the objects that are
within the radius R from y.
2. Core object: An object y is called a core object if there are N objects within its neighbourhood.
3. Proximity: Two objects are defined to be in proximity to each other if they belong to the same
cluster. Object x 1 is in proximity to object x 2 if two conditions are satisfied:
(a) The objects are close enough to each other, i.e. within a distance of R.
(b) x 2 is a core object as defined above.
4. Connectivity: Two objects x 1 and x 2 are connected if there is a path or chain of objects x 1 ,x 2,
…. x n from x 1 to x n
Basic Algorithm for Density-based Clustering:
1. Select values of R and N.
2. Arbitrarily select an object p.
3. Retrieve all objects that are connected to p, given R and N.
4. If p is a core object, a cluster is formed.
5. If p is a border object, no objects are in its proximity. Choose another object. Go to Step 3.
6. Continue the process until all of the objects have been processed.
Essentially the above algorithm forms clusters by connecting neighbouring core objects and
those non-core objects that are boundaries of clustering. The remaining non-core objects are
labeled outliers.
DEALING WITH LARGE DATABASES
 Most clustering methods implicitly assume that all data is accessible in the mainmemory.
 Often the size of the database is not considered but a method requiring multiple scans of
data that is disk-resident could be quite inefficient for large problems.
K-Means Method for Large Databases
One modification of the K-means method for data that is too large to fit in the main memory.The
method first picks the number of clusters and their seed centroids and then attempts to classify
each object to belong to one of the following three groups:
(a) Those that are certain to belong to a cluster. These objects together are called the discard set.
Some information about these objects is computed and saved. This includes the number of
objects n, a vector sum of all attribute values of the n objects (a vector S) and a vector sum
of squares of all attribute values of the n objects (a vector Q).
(b) The objects are however sufficiently far away from each cluster's centroid that they cannot
yet be put in the discard set of objects. These objects together are called the compression
set.
(c) The remaining objects that are too difficult to assign to either of the two groups above.These
objects are called the retained set and are stored as individual objects.
Hierarchical Method for Large Databases—Concept of Fractionation
Dealing with large datasets is difficult using hierarchical methods since the methods require an N
x N distance matrix to be computed for N objects.
It is based on the idea of splitting the data into manageable subsets called "fractions" and then
applying a hierarchical method to each fraction. The concept is called fractionation.
Algorithm 1. Split the large dataset into fractions of size M.
2. The hierarchical clustering technique being used is applied to each fraction. Let the number
of clusters so obtained from all the fractions be C.
3. For each of the C clusters, compute the mean of the attribute values of the objects in it. Let
this mean vector be mi i = 1. , , , ,C. These cluster means are called meta-observations.
4. If the C meta-observations are too large (greater than M}, go to Step 1, otherwise apply the
same hierarchical clustering technique to the meta-observations obtained in Step 3.
5. Allocate each object of the original dataset to the cluster with the nearest mean obtained in
Step 4.
The cost of this algorithm is linear in terms of the size of the dataset. The accuracy of the results
is related to the accuracy of the clustering algorithm used.
QUALITY AND VALIDITY OF CLUSTER ANALYSIS METHODS
 Evaluation is particularly difficult since all clustering methods will produce clusters even if
there are no meaningful clusters in the data.
 In cluster analysis there is no test data available as there often is in classification. Also, even
if the process has been successful and clusters have been found, two different methods may
produce different clusters.
 Different methods are biased towards different types of clusters. For example, minimizing
the maximum distance between the points in a cluster.
 The results of K-means may in the first instance be evaluated by examining each attribute's
mean for each cluster in an attempt to assess how far each cluster is from the other. Another
approach is based on computing within cluster variation (I) and between clusters variation
(E). These variations may be computed as follows:
Let the number of clusters be k and let the clusters be Ci i = 1, ..., k. Let the total number of
objects be N and let the number of objects in cluster Ci, be Mi so that
M 1 + M 2 + …… + M k = N
The within-cluster variation between the objects in a cluster Ci is defined as the average
squared distance of each object from the centroid of the cluster. That is, if mi is the centroid of the
cluster Ci, then the mean of the cluster is given by
mi, = ∑{ xj }/Mi

The between cluster distances E may now be computed given the centroids of the clusters.
It is the average sum of squares of pairwise distances between the centroids of the k clusters.
Evaluating the quality of clustering methods or results of a cluster analysis is a challenging task.
The quality of a method involves a number of criteria:
1. Efficiency of the method
2. Ability of the method to deal with noisy and missing data
3. Ability of the method to deal with large problems
4. Ability of the method to deal with a variety of attribute types and magnitudes
 Once several clustering results have been obtained, they can be combined, hopefully
providing better results than those from just one clustering of the given data.
 Experiments using these techniques suggest that they can lead to improved quality and
robustness of results.

CLUSTER ANALYSIS SOFTWARE


 A more comprehensive list of cluster analysis software is available at ClustanGraphics7
from Clustan offers a variety of clustering methods including K-means,density-based and
hierarchical cluster analysis.
 The software provides facilities to display results of clustering including dendograms and
scatterplots. For more details visit:
http://www.clustan.com/index.html
 CViz Cluster Visualization from IBM is a visualization tool designed for analysing high-
dimensional data in large, complex data sets. It displays the most important factors relating
clusters of records. For more details visit: http://www.alphaworks.ibm.com/tech/cviz
 Cluster 3.0, open source software originally developed by Michael Eisen at Stanford. It uses
the K-means method, which includes multiple trials to find the best clustering solution. For
more details visit:
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
 CLUTO provides a set of clustering methods including partitional, agglomerative, and
graph-partitioning based on a variety of similarity/distance metrics.
UNIT – IV
WEB DATA MINING
INTRODUCTION
What do you mean by Web Data Mining? (2m)
Explain about the various categories of web data mining. (5m)
* The web has become the number one source for information for Internet users. The
information is very diverse in content as well as in topics.
* A feature of web is the hyperlink by which one web page provides links to other pages
allowing a user to traverse the web following links of interest to him.
* Web is an open medium. This means that the web has grown exponentially which is its
strength as well as its weakness.
* Strength – one can find information on just about anything
* Weakness – there is the problem of abundance of information.
* Ex: searching for the phrase “data mining” on Google will find you more than 2.5 million
pages. How to find the most useful one? How reliable is the information? It is the aim of web
mining to help with problems like this.
* “Web mining is the application of data mining techniques to find interesting and
potentially useful knowledge from web data. It is normally expected that either the
hyperlink structure of the web or the web log data or both have been used in the mining
process.”
* Web mining may be divided into several categories:
Web Content Mining:
 It deals with discovering useful information or knowledge from web page contents.
 Web content mining focuses on the web page content rather than the links.
 Web content is a rich information resource consisting of many types of information, for
example: unstructured free text, image, audio, video & metadata as well as hyperlinks.
 Content of web pages includes no machine readable semantic information.
 Search engines, subject directories, intelligent agents, cluster analysis & portals are employed
to find what a user might be looking for.

Web Structure Mining:


 It deals with discovering & modeling the link structure of the web. Work has been carried out
to model the web based on the topology of the hyperlinks.
 This can help in discovering similarities between sites or in discovering important sites for a
particular topic or discipline or in discovering web communities.
Web Usage Mining:
 It deals with understanding user behaviour in interacting with the web or with a web site.
 One of the aims is to obtain information that may assist website reorganization or assist site
adaptation to better suite the user.
 Mined data often includes data logs of user’s interactions with the web.
 Logs include information about the referring pages, user identification, time a user spends as
a website and the sequence of pages visited.
 Information is also collected via cookie files.
 While web structure mining shows that page A has a link to page B, web usage mining shows
who or how many people took that link, which site they came from & where they went when
they left page B.
(UNIT – IV CONT..)
MUTHAYAMMALARTS & SCIENCECOLLEGE, RASIPURAM.
Class: III-B.C.A-’C’ Date:
13/03/2015
Subject: Data Mining and Warehousing UNIT-IV (contd..)
 Following are the major differences between searching conventional text and searching the
web:
1. Hyperlink:
 Text documents do not have hyperlinks. In hard copy documents in a library, documents
are usually structured, no linkage between them is identified.
 But web hyperlinks provide important information. A link from page A to page B
includes that the author of page A considered page B of some value.
2. Types of information:
Web pages can consist of text, frames, multimedia objects, animation & other types of
information quite different from text documents which mainly consist of text but may
have some other objects like tables, diagrams & some images.
3. Dynamics:
Text documents do not change unless new edition of a book appears while web pages
change frequently because the information on the web including linkage information is
updated all the time & new pages appear almost every second.
4. Quality:
Text documents are of high quality, but much of information on the web is of low quality.
Only 10% of web pages are really useful & of high quality.
5. Huge Usage:
Size of web is approaching 100terabytes which is equivalent to about 200million books.
6. Document Use:
Compared to the use of conventional documents, the use of web documents is very
difficult. The web users tend to pose short queries, browse perhaps the first page of the
results and then move on.

WEB TERMINOLOGY AND CHARACTERISTICS


Define the term URL. (2m)
What do you mean by proxy & domain name server? (2m)
Explain briefly the web terminologies used in web data mining. (5m)
Explain in detail about web terminology & its characteristics. (5m / 10m)
 We define some web terminology based on the work of World Wide Web consortium
(W3C).
 WWW (World Wide Web) is the set of all the nodes which are interconnected by
hypertext links.
WEB TERMINOLOGY AND CHARACTERISTICS (contd..)
 Link expresses one or more relationships between two or more resources. Links may also
be established within a document by using anchors.
 A Web page is a collection of information, consisting of one or more web resources,
intended to be rendered simultaneously & identified by a single URL.
 A Website is a collection of interlinked web pages, including a home page, residing at
the same network location.
 A client browser is the primary user interface to the web. It is a program which allows a
person to view the contents of web pages & for navigating from one page to another.
 A Uniform Resource Locator (URL) is an identifier for an abstract or physical
resource, for example a server and a file path or index. URLs are location dependent &
each URL contains 4 distinct parts:
 Protocol types (usually http)
 Name of web server
 Directory path
 File name
If a file is not specified, index.html is assumed.
 A Web server serves web pages using http to client machines so that a browser can
display them.
 A client is the role adopted by an application when it is retrieving a web resource.
 A proxy is an intermediary which acts as both a server and a client for the purpose of
retrieving resources on behalf of other clients.
 A domain name server is a distributed database of name to address mappings. When a
DNS server looks up a computer name, it either finds it in its list, or asks another DNS
server which knows more names.
 A cookie is the data sent by a web server to a web client, to be stored locally by the client
and sent back to the server on subsequent requests.
 Obtaining information from the web using a search engine is called information “pull”
while information sent to users is called information “push”.
Ex: users may register with a site & then information is sent (pushed) to such users
without them requesting it.

Graph Terminology
Define Directed & undirected graph in terms of web. (2m)
Define Diameter of a graph. (2m)
Explain basic graph terms & the components of the web. (5m)
 “Directed graph” is a set of nodes (that corresponds to pages on the web) denoted by V and
edges (which corresponds to links on the web) denoted by E. (i.e.,) a graph is (V,E) where all
edges are directed.
 “Undirected graph” may also be represented by nodes and edges (V,E) but the edges have
no direction specified.
 Directed graph is like a link that points from one page to another in web, while undirected
graph is not like the pages & links on the web unless we assume the possibility of traversal in
both directions.
 A graph may be searched either by a breadth-first search or by a depth-first search.
 Breadth-first is based on first searching all the nodes that can be reached from the node
where the search is starting and once these nodes have been searched, searching the nodes at
the next level that can be reached from those nodes and so on.
 Depth-first search is based on first searching any unvisited descendants of a given node first,
then visiting the node and then any brother nodes.
 “Diameter of a graph” – Maximumof the minimum distances between all possible ordered
node pairs (u,v) i.e., maximum number of links that one would need to follow starting from
any page u to reach any page v assuming that the best path has been followed.
 An investigation on the structure of the web was carried out & it was found that the web is
not a ball of highly connected nodes.
 It displayed structural properties & the web could be divided into the following 5
components:
1. The Strongly Connected Core (SCC):
 This part of the web was found to consist of about 30% of the web, which is still very
large given more than 4billion pages on the web in 2004.
 This is the heart of the web & its main property is that pages in the core can reach
each other following directed edges (i.e., hyperlinks).
2. The IN Group:
 This part of the web was found to consist of about 20% of the web.
 Main property – pages in the group can reach the SCC but cannot be reached from it.
3. The OUT Group:
 This part of the web was found to consist of about 20% of the web.
 Main property – pages in this group can be reached from SCC but cannot reach the
SCC.
4. Tendrils:
 This part of the web was found to consist of about 20% of the web.
 Main property – pages cannot be reached by the SCC and cannot reach the SCC.
 It does not imply that these pages have no linkages to pages outside the group since
they could well have linkages from the IN group & to the OUT group.
5. The Disconnected Group:
 This part of the web was found to be less than 10% of the web & is essentially
disconnected from the rest of the web world.
 Ex: personal pages at many sites that link to no other page and have no links to them.

Size of the Web


List the categories of web pages. (2m)
Write the categories of web pages. (2m)
 The Deep web includes information stored in searchable databases often inaccessible to
search engines. This information can often only be accessed by using each web site’s
interface. Some of this information may be available only to subscribers.
 The Shallow web (or indexed web) is the information on the web that the search engines can
access without accessing the web databases.
 It has been estimated that the deep web is about 500 times the size of the shallow web. The
web pages are very dynamic, changing almost daily. Many new web pages are added to the
web every day.
 Perhaps web pages are used for other purposes as well, for example, communicating
information to small number of individuals via the web rather than via email.
 In many cases, this makes good sense. (Wasting disk storage in mailboxes is avoided)
 But if this grows, then a very large number of such web pages with a short life span & low
connectivity to other pages are generated each day.
 Large numbers of websites disappear every day & create many problems on the web. Links
from even well known sites do not always work.
 Not all results of a search engine search are guaranteed to work. To overcome these
problems, web pages are categorized as follows:
1. A web page that is guaranteed not to change over.
2. A web page that will not delete any content, may add content / links but the page will
not disappear.
3. A web page that may change content / links but the page will not disappear.
4. A web page without any guarantee.

Web Metrics
Write short notes on web metrics. (5m)
 There are a number of properties (other than size & structure of the web) about the web that
are useful to measure.
 Some of the measures are based on distances between the nodes of the web graph.
 It is possible to define how well connected a node is by using the concept of centrality of a
node.
 Centrality may be out-centrality, which is based on distances measured from the node using
its out-links while in-centrality is based on distances measured from other nodes that are
connected to the node using the in-links.
 Based on these metrics, it is possible to define the concept of compactness that varies from 0
to 1, 0 for a completely disconnected web graph and 1 for a fully connected web graph.
 Perhaps the most important measurements about web pages are about a page’s relevance and
its quality.
 Relevance is usually related to a user query since it concerns the user finding pages that are
relevant to his query and may be defined in a number of ways.
 A simple approach involves relevance to be measured by the number of query terms that
appear in the page.
 Another approach is based on in-links from relevant pages. In this, relevance of a page may
be defined as the sum of the number of query terms in the pages that refer to the page.
 Relevance may also use the concept of co-citation. In co-citation, if both pages ‘a’ & ‘b’
point to a page ‘c’ then it is assumed that ‘a’ and ‘b’ have a common interest. Similarly if ‘a’
points to both ‘b’ and ‘c’, then we assume that ‘b’ & ‘c’ also share a common interest.
 Quality is not determined by page content since there is no automatic means of evaluating the
quality of content. And it is determined by the link structure.
 Example: if page ‘a’ points to page ‘b’ then it is assumed that page ‘a’ is endorsing page ‘b’
and we can have some confidence in the quality of page ‘b’.

LOCALITY AND HIERARCHY IN THE WEB


What are the basic principles in designing the content & structure of web? (2m)
Explain briefly about the locality & hierarchy in the web. (5m)
 The web shows a strong hierarchical structure. A website of any enterprise has the homepage
as the root of the tree and as we go down the tree we find more detailed information about
the enterprise.
 Example: the homepage of a university will provide basic information and then links, for
example, to: Prospective students, Staff, research, information for current students,
information for current staff.
 Prospective students node will have number of links like: courses offered, admission
requirements, scholarships available, semester dates, etc..
 Web also has a strong locality feature to the extent that almost two-thirds of all links are to
sites within the enterprise domain.
 One-third links to sites outside the enterprise domain have a higher percentage of broken
links.
 Websites fetch information from a database to ensure that the information is accurate &
timely.
 Web pages can be classified as:
1. Homepage or the head page: These pages represent an entry point for the web site of
an enterprise (so frequently visited) or a section within the enterprise or an
individual’s web page.
2. Index page: These pages assist the user to navigate through the enterprises website. A
homepage may also act as an index page.
3. Reference page: These pages provide some basic information that is used by a
number of other pages. Ex: each page in a website may have link to a page that
provides enterprise’s privacy policy.
4. Content page: These pages only provide content & have little role in assisting a
user’s navigation. Often they are larger in size, have few out-links & are often the leaf
nodes of a tree.
 Web site structure & content are based on careful analysis of what the most common user
questions are. The content should be organized into categories in a way that, traversing the
site is simple & logical.
 Careful web user data mining can help. A number of simple principles have been developed
to help design the structure & content of a web site.
 Three basic principles are:
1. Relevant linkage principle: It is assumed that links from a page point to other
relevant resources. Links are assumed to reflect the judgment of the page creator. By
providing link to another page means that, he recommends for the other relevant
page.
2. Topical unity principle: It is assumed that web pages that are co-cited (i.e., linked
from the same pages) are related.
3. Lexical affinity principle: It is assumed that the text and the links within a page are
relevant to each other.
 Unfortunately not all pages follow these basic principles, resulting in difficulties for web
users & web mining researchers.
WEB CONTENT MINING
Define web content mining. (2m)
What are intelligent web agents? (2m)
Explain about web content mining & DIPRE algorithm. (5m)
Explain web content mining in detail. (10m)
 Web content mining deals with discovering useful information from the web.
 When we use search engines like Google to search contents on the web, search engines find
pages based on location & frequency of keywords on the page although some now use the
concept of page rank.
 If we consider the type of queries that are posed to search engines, we find that if the query is
very specific we face scarcity problem.
 If the query is for a broad topic, we face the abundance problem. Because of these problems,
keyword search using a search engine is called suggestive of the document’s relevance. User
is then left with the task of finding relevant documents.
 Brin presents an example of finding content on the web. It shows how relevant information
from a wide variety of sources presented in a wide variety of formats may be integrated for
the user.
 Example involves extracting a relation of books in the form of (author, title) from the web
starting with a small sample list.
 Problem is defined as: we wish to build a relation R that has a number of attributes. The
information about tuples of R is found on web pages but is unstructured.
 Aim is to extract the information and structure it with low error rate. Algorithm proposed is
called Dual Iterative Pattern Relation Extraction (DIPRE). It works as follows:
1. Sample: Start with a sample S provided by the user.
2. Occurrence: Find occurrences of tuples starting with those in S. once tuples are
found the context of every occurrence is saved. Let these be O.
O S
3. Patterns: Generate patterns based on the set of occurrences O. this requires
generating patterns with similar contexts.
P O
4. Match patterns: Web is now searched for the patterns.
5. Stop if enough matches found, else go to step 2.
 Here the pattern is defined as a tuple like (order, URL prefix, prefix, middle, suffix). It may
then match strings like (author, middle, title) or (title, middle, author).
 In step 2 it is assumed that the occurrence consists of title and author with a variety of other
information.
 Web pages are usually semi-structured (ex: HTML documents). Database generated HTML
pages are more structured, however web pages consist of unstructured free text data which
makes it more difficult to extract information from them.
 The semi-structured web pages may be represented based on the HTML structures inside the
documents. Some may also use hyperlink structure.
 A different approach called database approach involves inferring the structure of a website
& transforming the website content into a database.
 Web pages are published without any control on them. Some of the sophisticated searching
tools being developed are based on the work of the artificial intelligence community. These
are called Intelligent Web Agents.
 Some web agents replace the work of search engines in finding information. Ex: ShopBot
finds information about a product that the user is searching for.

Web Document Clustering


Define the terms browsable summaries & snippet tolerance. (2m)
How will you cluster web documents? Explain. (5m)
 Web document clustering is an approach to find relevant documents on a topic or about
query keywords. Search engines return huge, unmanageable list of documents, and finding
the useful one is often tedious.
 User could apply clustering to a set of documents returned by search engines with the aim of
finding meaningful clusters that are easier to interpret.
 It is not necessary to insist that a document can only belong to one cluster since in some
cases it is justified to have document belong to two or more clusters.
 Web clustering may be based on content alone, may be based on both content & links or
based only on links.
 One approach that is specifically designed for web document cluster analysis is Suffix Tree
Clustering (STC)& it uses a phrase-based clustering approach rather than using a single
word frequency.
 In STC, the key requirements of a web document clustering algorithm include:
 Relevance: This is most obvious requirement. We want clusters that are relevant
to user query.
 Browsable summaries:The cluster must be easy to understand. User should be
quickly able to browse the description of a cluster.
 Snippet tolerance:The clustering method should not require whole documents &
should be able to produce relevant clusters based only on the information that the
search engine returns.
 Performance:The clustering method should be able to process the results of
search engine quickly & provide the resulting clusters to the user.
 STC algorithm consists of document cleaning, identifying base clusters & combining base
clusters.
 STC only requires the snippets of text that are returned by the search engine and removes
unwanted text that might get in the way of identifying common phrases. (ex: prefixes,
punctuations,..)
 The second step of algorithm uses the cleaned snippets to mark sentence boundaries and then
build a suffix tree to efficiently identify a set of documents that share common phrases.
 Once a suffix tree has been constructed, number of base clusters can be derived, by reading
the tree from root to each leaf which gives a phrase.
 If such phrase belongs to more than two snippets, then the phrase is considered a base cluster.
Each base cluster is given a score based on number of words in the phrase and the number of
snippets that have that phrase.
 Many base clusters will be overlapping, since documents often share more than just one
phrase. So clusters must be consolidated.
 A similarity index for every pair of base clusters is created based on the documents that are
in each cluster. If the base clusters are similar, they are combined.

Finding similar web pages


Define replication & mirroring. (2m)
Write the reasons for existence of identical pages in the web. (2m)
Define resemblance, containment. (2m)
It has been found that almost 30% of all web pages are very similar to other pages. There are
many reasons for identical pages. For example:
1. A local copy may have been made to enable faster access to the material.
2. FAQs are duplicated since such pages may be used frequently.
3. Online documentation of popular software like Unix may be duplicated for
local use.
4. There are mirror sites that copy highly accessed sites to reduce traffic.
In some cases, documents are not identical because different formatting might be used at
different sites. A larger document may be split into smaller ones or a composite document
may be joined together to build a single document.
Copying a single web page is called replication and copying an entire web site is called
mirroring.
Similarity between web pages usually means content-based similarity. It is also possible to
consider link-based & usage-based similarity.
Link-based is related to the concept of co-citation & is used for discovering a core set of web
pages on a topic.
Usage-based is useful in grouping pages or users into meaningful groups.
Content-based is based on comparing textual content of the web pages. Non-text contents are
not considered. We define two concepts:
 Resemblance:Resemblance of two documents is defined to be a number between 0
and 1 with 1 indicating that the two documents are virtually identical & any value
close to 1 indicating that the documents are very similar.
 Containment:Containment of one document in another is defined as a number
between 0 and 1 with 1 indicating that the first document is completely contained in
the second.
Number of approaches is there to assess the similarity of documents. One Brute force
approachis to compare two documents using software like ‘diff’ in Unix OS, which
compares two documents as files.
Other string comparison algorithms may be used to find how many characters need to be
deleted, changed or added to transform one document to other, but this is expensive for
comparing millions of documents.
Issues in document matching are:
 If we are looking to compare millions of documents then the storage requirement of the
method should not be large.
 Documents may be in HTML, PDF, MS Word. They need to be converted to text for
comparison, which may introduce errors.
 Method should be robust. i.e., it should not be possible to avoid the matching process
with the changes to a document.

Finger Printing
What do you mean by shingles? (2m)
Define full fingerprinting. (2m)
Explain with an example about fingerprinting documents. (5m / 10m)
 An approach for comparing large number of documents is based on the idea of fingerprinting
documents.
 A document may be divided into all possible substrings of length L. These substrings are
called shingles. Based on the shingles we can define resemblance R(X,Y) and containment
C(X,Y) between two documents X & Y as follows.
 We assume S(X) & S(Y) to be a set of shingles for documents X and Y respectively.
R(X,Y) = { S(X) S(Y) } / { S(X) U S(Y) }
C(X,Y) = { S(X) S(Y) } / { S(X) }
 Following algorithm may be used to find similar documents:
1. Collect all the documents that one wishes to compare.
2. Choose a suitable shingle width and compute the shingles for each
document.
3. Compare the shingles for each pair of documents.
4. Identify those documents that are similar.
 This algorithm is inefficient for large collection of documents. Web is very large & algorithm
requires enormous storage to store shingles and takes long time. This approach is called full
fingerprinting.
Example:
 We find if the two documents with the following content are similar:
Document1: “the web continues to grow at a fast rate”
Document2: “the web is growing at a fast rate”
 First step: making a set of shingles from the documents. We obtain the below shingles if we
assume the shingle length to be 3 words.
Shingles in Doc1 Shingles in Doc2
the web continues the web is
web continues to web is growing
continues to grow is growing at
to grow at growing at a
grow at a at a fast
at a fast a fast rate
a fast rate

 Comparing two sets of shingles we find that only 2 of them are identical. So the documents
are not very similar. We illustrate the approach using 3 shingles that are the shortest in the
number of letters including spaces.

Shingles in Doc1 Number of Shingles in Doc2 Number of


Letters Letters
the web continues 17 the web is 10
web continues to 16 web is growing 14
continues to grow 17 is growing at 13
to grow at 10 growing at a 12
grow at a 9 at a fast 9
at a fast 9 a fast rate 11
a fast rate 11
 We select 3 shortest shingles for comparison. For first document, these are “to grow at”,
“grow at a”, “at a fast”.
 For second document, these are “the web is”, “at a fast”, “a fast rate”. There is only one
match out of 3 shingles.
 False positives with the original document would be obtained for documents like “the
Australian economy is growing at a fast rate”.
 False negatives would be obtained for strings like “the web is growing at quite a rapid rate”.
 So, small length shingles cause many false positives while larger shingles result in more false
negatives. A better approach involves randomly selecting shingles of various lengths.
 Issues in comparing documents using fingerprinting are:
 Should the shingle length be in number of words or characters?
 How much storage would be needed to store shingles?
 Should upper & lower-case letters be treated differently?
 Should punctuation marks be ignored?
 Should end of paragraph be ignored?

WEB USAGE MINING


Mention some information obtained from web usage mining. (2m)
Explain web usage mining. (5m)
 Objective in web usage mining is to be able to understand and be able to predict user
behavior in interacting with the web or with a web site in oreder to improve the quality of
service.
 Other aims are to obtain information & discover usage patterns that may assist web site
design / redesign, perhaps to assist navigation through the site.
 Mined data includes web data repositories which include data logs of user’s interactions with
the web, web server logs, proxy server logs, browser server logs, etc..
 Information collected in the web server logs includes information about the access, referrer,
etc..access information includes the server’s interaction with the server, referrer information
is about refeere page & the agent information is about the browser used.
 Some of the routine information may be obtained using tools that are available at a low cost.
Using such tools, it is possible to find atleast the following information:
1. Number of hits – number of times each page in the web site has been viewed.
2. Number of visitors – number of users who came to the site.
3. Visitor referring web site – web site URL of the site the user came from.
4. Visitor referral web site – web site URL of the site where the user went when he left
the web site.
5. Entry point – which web site page the user entered from.
6. Visitor time & duration – time & day of visit and how long the visitor browsed the
site.
7. Path analysis – list of path of pages that the user took.
8. Visitor IP address – this helps in finding which part of the world the user came from.
9. Browser type
10. Platform
11. Cookies
 This simple information can assist an enterprise to achieve the following:
1. Shorten the paths to high visit pages
2. Eliminate or combine low visit pages
3. Redesign some pages including homepage to help user navigation
4. Redesign some pages so that the search engines can find them
5. Help evaluate effectiveness of an advertising campaign
 Web usage mining may be desirable to collect information on:
 Path traversed: What paths do the customers traverse? What are the most commonly
traversed paths? These patterns need to be analysed & acted upon.
 Conversion rates: What are the basket-to-buy rates for each product?
 Impact of advertising: Which banners are pulling in the most traffic? What is their
conversion rate?
 Impact of promotions: Which promotions generate the most sales?
 Web site design: Which links do the customers click most frequently? What links do
they buy from most frequently?
 Customer segmentation: What are the features of customers who stop without
buying? Where do most profitable customers come from?
 Enterprise search: Which customers use enterprise search? Are they likely to
purchase? What they search?
 Web usage mining also deals with caching & proxy servers. Not all page requests are
recorded in the server log because pages are often cached and served from the cache when
the user requests them again.
 Caching occurs at several levels. Browser itself has a cache to keep copies of recently visited
pages which may be retained only for a short time.
 Enterprises maintain a proxy server to reduce Internet traffic & to improve performance.
 Proxy server interprets any requests for web pages from any user in the enterprise & serve
them from the proxy if a copy that has not expired is resident in the proxy.
 Use of caches & proxies reduces the number of hits that the web server log records.
 One may be interested in the behavior of users, not just in one session but over several
sessions. Returning visitors may be recognized using cookies (which is issued by the web
server & stored on the client). The cookie is then presented to web server on return visits.
 Log data analysis has been investigated using the following techniques:
 Using association rules
 Using cluster analysis
 In usage analysis, we want to know the sequence of nodes visited by a user, which is not
possible using association rules analysis.
 It should be noted that, in association rules mining, patterns do not play any role. We are only
interested in finding which items appear with which other items frequently.
 In fact, if pages A and B appear together in an association rule, it is difficult to tell if A was
traversed first or B was.

WEB STRUCTURE MINING


Mention the two steps in HITS algorithm. (2m)
Explain HITS algorithm in detail. (10m)
 Aim of web structure mining is to discover the link structure or the model that is assumed to
underline the web. Model may be based on the topology of hyperlinks.
 This can help in discovering similarity between sites or in discovering authority sites for a
particular topic or in discovering overview or survey sites that point to many authority sites
(such sites are called hubs).
 The links on web pages provide a useful source of information that may be bound together in
web searches.
 Kleinberg has developed a connectivity analysis algorithm called Hyperlink-Induced Topic
Search (HITS) based on the assumption that links represent human judgement.
 Algorithm also assumes that for any topic, there are a set of “authoritative” sites that are
relevant on the topic and there are “hub” sites that contain useful links to relevant sites on the
topic.
 HITS algorithm is based on the idea that if the creator of page p provides a link to page q,
then p gives some authority on page q.
 But not all links give authority, since some may be for navigational purposes. Ex:links to the
home page. Definitions required for HITS:
 A graph is a set of vertices & edges (V,E). the vertices correspond to the pages & edges
correspond to the links.
 A directed edge (p,q) indicates a link from page p to page q.
 HITS algorithm has 2 major steps:
1. Sampling step: It collects a set of relevant web pages given a topic.
2. Iterative step: It finds hubs & authorities using the information collected during
sampling.
 Example: for a query “web mining” the algorithm involves carrying out the sampling step by
querying a search engine and then using the most highly ranked web pages retrieved for
determining the authorities & hubs.
 Posing a query to a search engine often results in abundance. In some cases, the search
engine may not retrieve all relevant pages for the query.
 Query for Java may not retrieve pages for object-oriented programming, some of which may
also be relevant.

Mention the steps in HITS algorithm. (2m)


Define web communities. (2m)
Explain the problems with the HITS algorithm. (5m)
Explain HITS algorithm in detail with an example. (10m)
Step – 1: Sampling step
 First step involves finding a subset of nodes or a sub graph S, which are relevant
authoritative pages. To obtain such a sub graph, the algorithm starts with a root set of, say
200 pages selected from the result of searching for the query in a traditional search engine.
 Let the root set be R. Starting from R, we wish to obtain a set S that has the following
properties:
1. S is relatively small.
2. S is rich in relevant pages given the query.
3. S contains most of the strongest authorities.
 HITS expand the root set R into a base set S by using the following algorithm:
1. Let S=R
2. For each page in S, do steps 3 to 5.
3. Let T be the set of all pages S points to.
4. Let F be the set of all pages that point to S
5. Let S=S+T+ some or all of F. (some if F is large)
6. Delete all links with the same domain name.
7. This S is returned.
 Set S is called base set for the given query. We find the hubs & authorities from the base set
as follows:
 One approach is ordering them by the count of their out-degree & the count of their in-
degree.
 Before starting hubs & authorities step of algorithm, HITS removes all links between pages
on the same web site (or same domain) in step 6. i.e. links between pages on same site are for
navigational purposes, not for conferring authority.

Step – 2: Finding Hubs & Authorities


Algorithm for finding hubs & authorities is:
1. Let a page p have a non-negative authority weight xp& a non-negative hub weight yp.
Pages with large weights xpwill be classified to be the authorities (similarly for hubs).
2. Weights are normalized so their squared sum for each type of weight is 1 since only the
relative weights are important.
3. For a page p, value of xpis updated to be the sum of yq over all pages q that link to p.
4. For a page p, value of ypis updated to be the sum of xqover all pages q that p links to.
5. Continue with step 2 unless a termination condition has been reached.
6. On termination, the output of the algorithm is a set of pages with the largest x pweights
that can be assumed to be authorities & those with the largest ypweights that can be
assumed to be hubs.

Example:
 Assume a query has been specified. First step has been carried out & a set of pages that
contain most of the authorities has been found.
 Let the set be given by the following 10 pages. Their out-links are listed below:
Page A (out-links to G, H, I, J)
Page B (out-links to A, G, H, I, J)
Page C (out-links to B, G, H, I, J)
Page D (out-links to B, G, H, J)
Page E (out-links to B, H, I, J)
Page F (out-links to B, G, I, J)
Page G (out-links to H, I)
Page H (out-links to G, I, J)
Page I (out-links to H)
Page J (out-links to F, G, H, I)
 Connection matrix for this information is as follows:

Page A B C D E F G H I J
A 0 0 0 0 0 0 1 1 1 1
B 1 0 0 0 0 0 1 1 1 1
C 0 1 0 0 0 0 1 1 1 1
D 0 1 0 0 0 0 1 1 0 1
E 0 1 0 0 0 0 0 1 1 1
F 0 1 0 0 0 0 1 0 1 1
G 0 0 0 0 0 0 0 1 1 0
H 0 0 0 0 0 0 1 0 1 1
I 0 0 0 0 0 0 0 1 0 0
J 0 0 0 0 0 1 1 1 1 0
 Every row in the matrix shows the out-links from the web page that is identified in the first
column. Every column shows the in-links from the web page that is identified in the first row
of the table.
 Another representation is:

Web page In-links Out-links


A B G, H, I, J
B C,D,E,F A, G, H, I, J
C None B, G, H, I, J
D None B, G, H, J
E None B, H, I, J
F J B, G, I, J
G A,B,C,D,F,H,J H, I
H A,B,C,D,E,G,I,J G, I, J
I A,B,C,E,F,G,H,J H
J A,B,C,D,E,F,H F, G, H, I
 Next step involves iterations to compute the hubs & authorities. Initially, for ease of
understanding, we carry out the iterations without normalization as in the following table.
We assume all initial weights to be 1.
 Note that the x values are the authority weights and y values are the hub weights.

Weight x0 y0 x1 y1 x2 y2
Page
A 1 1 1 4 5 118
B 1 1 4 5 17 123
C 1 1 0 5 0 135
D 1 1 0 4 0 104
E 1 1 0 4 0 106
F 1 1 1 4 4 106
G 1 1 7 2 29 60
H 1 1 8 3 29 89
I 1 1 8 1 31 29
J 1 1 7 4 29 93

Problems with the HITS algorithm:


 The algorithm works well for most queries, it does not work well for some others. There are
a number of reasons for this:
Hubs and authorities:
A clear-cut distinction between hubs & authorities may not be appropriate since many
sites are hubs as well as authorities.
Topic drift:
Certain arrangements of tightly connected documents, perhaps due to mutually
reinforcing relationships between hosts, can dominate HITS computation. These documents may
not be the most relevant to query sometimes.
Automatically generated links:
Some of the links are computer generated & represent no human judgment but HITS still
gives them equal importance.
Non-relevant documents:
Some queries can return non-relevant documents in the highly ranked queries and this
can then lead to erroneous results from HITS.
Efficiency:
The real-time performance of the algorithm is not good given the steps that involve
finding sites that are pointed to by pages in the root pages.
 Proposals to modify HITS are:
o More careful selection of the base set to reduce the possibility of topic drift.
o One may argue that the in-link information is more important than the out-link
information. A hub can become important just by pointing to a lot of authorities.

Web communities:
 A web community is generated by a group of individuals that share a common interest. Ex:
religious group, sports.
 Web communities may be discovered by exploring the web as a graph & identifying sub-
graphs that have a strong link structure within them but weak associations with other parts of
the graph. Such subgraphs may be called web-communities or cyber-communities.
 The idea of cyber-communities is to find all web communities on any topic by processing the
whole web graph.

WEB MINING SOFTWARE


Mentio some web mining softwares. (2m)
Some software packages listed below are commercial products:
 123LogAnalyzer – low cost web mining software that provides an overview of a web site’s
performance, statistics about web server activity, web pages viewed, files downloaded,
images that were accessed,..
 Analog – an ultra fast, scalable, highly configurable, works on any operating system & is
free.
 Azure web log analyzer – finds usual web information includingpopular pages, number of
visitors, what browser & computer they used.
 ClickTracks – it offers a number of modules to provide web site analysis.
 Datanautics – web mining software for data collection, processing, analysis and reporting.
 NetTracker web analytics – analyses log files. Comprehensive analysis of behaviour of
visitors to Internet, intranet sites.
 Nihuo web log analyzer – provides reports on how many visitors came to the web site, where
they came from, which pages they viewed, how long they spent on the site & more.
 Weblog Expert 3.5 – produces reports that includes these information: accessed files, paths
through the site,search engines, browsers, OS & more.
 WUM: Web Utilization Miner – an open source project. A java-based web mining software
for log file preparation, discovery of sequentail pattrens, etc..

SEARCH ENGINES
Define search engines. (2m)
Write the goals of search engine.(2m)
 A number of web sites provide a large amount of interesting information about search
engines.
 Search engines are huge databases of web pages as well as software packages for indexing
and retrieving the pages that enable users to find information of interest to them.
 Normally the search engine databases of web pages are built & updated automatically by web
crawlers.
 Every search engine provides advertising (sometimes called sponsored links). There are a
variety of ways businesses advertise on the search engines.
 Commonly, whenever a particular keyword is typed by the user, related sponsored links
appear with the results of the search.

Goals of web search:


It has been suggested that the information needs of the user may be divided into three classes:
1. Navigational:Primary information need in these queries is to reach a web site that the
user has in mind. User may know that the site exists or may have visited the site earlier
but does not the URL.
2. Informational:Primary information need in such queries is to find a web site that
provides useful information about a topic of interest. User does not have a particular web
site in mind. Ex: user wants to investigate IT career prospects in Kolkata.
3. Transactional:Primary need in such queries is to perform some kind of transaction. User
may or may not know the target web site. Ex: to buy a book, users wish to go to
amazon.com.

Quality of search results:


Results from a search engine ideally should satisfy the following quality requirements:
1. Precision: Only relevant documents should be returned.
2. Recall: All the relevant documents should be returned.
3. Ranking: A ranking of the documents providing some indication of the relative ranking
of the results should be returned.
4. First Screen: First page of results should include the most relevant results.
5. Speed: Results should be provided quickly since users have little patience.

SEARCH ENGINES FUNCTIONALITY


Explain the functionalities of search engines. (5m)
 A search engine is a rather complex collection of software modules. A search engine carries
out a variety of tasks. These include:
1. Collecting information:
A search engine would normally collect web pages or information about them by web
crawling or by human submission of pages.
2. Evaluating and categorizing information:
In some cases, for example, when web pages are submitted to a directory, it may be
necessary to evaluate and decide whether a submitted page should be selected.
3. Creating a database and creating indexes:
The information collected needs to be stored either in a database or some kind of file
system. Indexes must be created so that the information may be searched efficiently.
4. Computing ranks of the web documents:
A variety of methods are being used to determine the rank of each page retrieved in
response to a user query. The information used may include frequency of keywords,
value of in-links & out-links from the page and frquency of use of the page.
5. Checking queries and executing them:
Queries posedby users need to be checked, for example, for spelling errors and whether
words in the query are recognizable. Once checked, a query is executed by searching the
search engine database.
6. Presenting results:
How the search engine presents the results to the user is important. The search engine
must determine what results to present and how to display them.
7. Profiling the users:
To improve search performance, the search engines carry out user profiling that deals
with the way users use search engines.

SEARCH ENGINES ARCHITECHTURE


What are the major components of search engines? (2m)
Define crawler & Term extraction. (2m)
What are the advantages of proxy cache. (2m)
Explain about the crawler and indexer of search engine. (5m)
With a neat diagram explain the architecture of search engines. (10m)
 No two search engines are exactly the same in terms of size, indexing techniques, page
ranking algorithms or speed of search.
 Typical search engine architecture is shown below. It consists of many components including
the following three major components to carry out the functions that were listed above.
 The crawler and the indexer: It collects pages from the web, creates and maintains the
index.
 The user interface: It allows users to submit queries and enables result presentation.
 The database and the query server: It stores information about the web pages and
processes the query and returns results.
The crawler:
 The crawler (or spider or robot or bot) is an application program that carries out a task
similar to graph traversal.
 It is given a set of starting URLs that it uses to automatically traverse the web by retrieving a
page, initially from the starting set.
 Some search engines use a number of distributed crawlers. Since a crawler has a relatively
simple task, it is not CPU-bound. It is bandwidth bound.
 A web crawler must take into account the load (bandwidth, storage) on the search engine
machines and sometimes also on the machines being traversed in guiding its traversal.
 A web crawler starts with a given set of URLs and fetches the pages that the URLs refer to.
The crawler then reads the out-links of the pages so fetched and fetches those pages. This
continues until no new pages are found.
 Each page found by the crawler is often not stored as a separate file otherwise four billion
pages would require managing four billion files. Usually lots of pages are stuffed into one
file.
 Crawlers follow an algorithm like the following:
1. Find base URLs – a set of known & working hyperlinks are collected.
2. Build a queue – put the base URLs in the queue and add new URLs to the queue as more
are discovered.
3. Retrieve the next page – retrieve the next page in the queue, process and store in the
search engine database.
4. Add to the queue – check if the out-links of the current page have already been
processed. Add the unprocessed out-links to the queue of URLs.
5. Continue the process until some stopping criteria are met.
6. A crawler would often have a traversal strategy. It may traverse the web breadth first,
depth first or the traversal may be priority based.

The web

Users Web crawler Page submission

Query Page evaluation

Query
Checking Representation

Strategy User
Selection profiling
Indexing

Query
execution

Typical architecture of a search engine


Result
presentation

History
maintenance

The Indexer:
 Given the size of the web & the number of documents that current search engines have in
their databases, an indexer is essential to reduce the cost of query evaluation.
 Many researchers recommended an inverted file structure in which each document is
essentially indexed on every word other than the most common words. Google uses this
approach.
 Indexing can be a very resource intensive process and is often CPU-bound since it involves a
lot of analysis of each page.
 Building an index requires document analysis and term extraction. Term extraction may
involve extracting all the words from each page, elimination of stop words (common words
like “the”, “it”, “and”, “that”) and stemming (transforming words like “computer”,
“computing” and “computation” into one word, say “computer”) to help modify, for
example, the inverted file structure.
 A good indexing approach is to create an index that will relate keywords to pages. The
overheads of maintaining such indexes can be high.
 An inverted index, rather than listing keywords for each page, lists pages for each keyword.
 This description is not quite accurate because an inverted database normally still maintains a
collection of pages including its content in some way.
 As an example for inverted index consider the data structure shown below:

Words Page ID
data 10,18,26,41
mining 26,30,31,41
search 72,74,75,76
engine 72,74,79,82
google 72,74,90
 This index specifies a number of keywords and the pages that contain the keywords. If we
were searching for “data mining” we look for pages for each keyword “data” and “mining”
and then find the intersection of the two lists and obtain page numbers 26 & 41.
 Challenging part is when the two lists are too large. Usual approach is to split the lists across
many machines to find the intersection quickly.
 When the index is distributed over many machines, the distribution may be based on a local
inverted index or global inverted index.
 Local inverted index results in each machine searching for candidate documents for each
query while the global inverted index is so distributed that each machine is responsible for
only some of the index terms.

Updating the index


 As the crawler updates the search engine database, the inverted index must also be updated.
 Depending on how the index is stored, incremental updating may be relatively easy but at
some stage after many incremental updates it may become necessary to rebuild the whole
index, which normally will take substantial resources.
User profiling
 Currently most search engines provide just one type of interface to the user. They provide an
input box in which the user types in the keywords & then waits for the results.
 Interfaces other than those currently provided are being investigated. They include form fill-
in or a menu or even a simple natural language interface.

Query server
 First of all, a search engine needs to receive the query and check the spelling of keywords
that the user has typed.
 If the search engine cannot recognize the keywords as words in the language or proper nouns
it is desirable to suggest alternative spellings to the user.
 Once the keywords are found to be acceptable, the query may need to be transformed.
 Stemming is then carried out and stop words are removed unless they form part of a phrase
that the user wants to search for.

Query composition
 Query composition is a major problem. Often a user has to submit a query a number of times
using somewhat different keywords before more or less the “right” result is obtained.
 A search engine providing query refinement based on user feedback would be useful. Search
engines often cache the results of a query and can then use the cached results if the refined
query is a modification of a query that has already been processed.

Query processing
 Search engine query processing is different from query processing in relational database
systems. In database systems, query processing requires that attribute values match exactly
the values provided in the query.
 In search engine query processing, an exact match is not always necessary. A partial match
or a fuzzy match may be sufficient.
 Some search engines provide an indication of how well the document matches the user
query.
 A major search engine will run a collection of machines, several replicating the index. A
query may then be broadcast to all that are not heavily loaded.
 If the query consists of a number of keywords, then the indexes will need to be searched a
number of times (intersection of results to be found).
 A search engine may put weights on different words in a query. Ex: if query has words
“going” “to” “paris”, then the search engine will put greater weight on “paris”.

Caching query results


 Although more bandwidth is available & cheap, the delay in fetching pages from the web
continues.
 Common approach is to use web caches & proxies as intermediates between the client
browser processes and the machines serving the web pages.
 Web cache or proxy is an intermediary that acts as server on behalf of some other server by
supplying web resources without contacting server.
 Essentially a cache is supposed to catch a user’s request, get the page that the user has
requested if one is not available in the cache & then save a copy of it in the cache.
 It is assumed that if a page is saved, there is a strong likelihood that the same/other user will
request it again.
 Web has very popular sites which are requested very frequently. The proxy satisfies such
requests without going to the web server.
 A proxy cache must have algorithms to replace pages in the hope that the pages being stored
in the cache are fresh and are likely to be pages that users are likely to request.
 Advantages of proxy caching are:
► It reduces average latency in fetching web pages, reduces network traffic & reduces
load on busy web servers.
► Since all user requests go through cache, it is possible for site managers to monitor
who is accessing what information & if necessary block some sites.
 A part of user’s hard disk is dedicated to cache to store pages that the user has browsed.
Browser cache is useful when the user hits the back button since the most recently visited
pages are likely to be in the cache and do not need to be fetched again.

Results presentation
 All search engines present results ordered according to relevance. Therefore, before
presentation of results to user, the results must be ranked.
 Relevance ranking is based on page popularity in terms of number of back links to the page.
 How much of what has been found should be presented to the user. Most search engines
present the retrieved items as a list on the screen with some information for each item, ex:
type of document, its URL & a brief extract from the document.

RANKING OF WEB PAGES


Why page ranking is required? (2m)
Define damping factor. (2m)
Write short notes on ranking of web pages. (5m)
Explain the Page Ranking algorithm with an example. (10m)
► Web contains a large number of documents published without any quality control. It is
important to find methods for determining importance & quality of pages, given a topic.
► Search engines differ in size & so number of documents that they index may be quite
different. No two search engines will have exactly the same pages on a given topic.

Page Rank Algorithm


 Google has the most well-known ranking algorithm, called the Page Rank Algorithmthat
has been claimed to supply top ranking pages that are relevant.
 This is based on using the hyperlinks as indicators of a page’s importance. It is almost like
vote counting in election.
 Every unique page is assigned a Page Rank. If a lot of pages vote for a page by linking to it
then the page that is being pointed to will be considered important.
 Votes cast by a link farm (a page with many links) are given less importance than votes cast
by an article that only links to a few pages.
 Internal site links also count in assessing page rank. Page rank has no relationship with the
content of the page.
 Page Rank was developed based on a probability model of a random surfer visiting a web
page. The probability of a random surfer clicking on a link may be estimated based on the
number of links on that page.
 Probability of the surfer clicking any link is assumed to be inversely proportional to the
number of number of links on the page. More links a page has the less chance that a surfer
will click on a particular link.
 Probability for the random surfer reaching one page is the sum of probabilities for the
random surfer following links to that page.
 Model also introduces a damping factor– that the random surfer sometimes does not click
any link & jumps to another page.
 If the damping factor is d (assumed to be between 0 and 1) then the probability of the surfer
jumping off to other page is assumed to be 1-d.
 Higher the value of d, surfer will follow one of the links. Given that the surfer has 1-d
probability of jumping to some random page, every page has a minimum page rank of 1-d.
 Algorithm works as follows:
 Let A be the page whose Page Rank PR(A) is required. Let C(T1) be the number of links
going out from page T1. Page Rank of A is then given by:
PR(A) = (1-d) + d(PR(T1) / C(T1) + PR(T2) / C(T2) + …. Where d –
damping factor.
 To compute a page rank of A, we need to know the page rank of each page that points to it.
And the number of out-links from each of those pages.
Example:
♠ Let us consider an example of three pages. We are given the following information:
 Damping factor = 0.8
 Page A has an out-link to B
 Page B has an out-link to A & another to C
 Page C has an out-link to A
 Starting page rank for each page is 1
♠ We may show the links between the pages as below:
A B
Three web pages

♠ Page rank equations may be written as: PR(A) = 0.2 + 0.4PR(B) + 0.8PR(C)
PR(B) = 0.2 + 0.8PR(A)
PR(C) = 0.2 + 0.4PR(B)
♠ These are three linear equations in 3 unknowns, we may solve them. We write them as
follows if we replace PR(A) by a & others similarly.
a – 0.4b – 0.8c = 0.2
b – 0.8a = 0.2
c – 0.4b = 0.2
♠ Solution of the above equations is: a = PR(A) = 1.19 ; b = PR(B) = 1.15; c = PR(C) =
0.66

Weaknesses of the algorithm


Mention some weaknesses of the page rank algorithm. (2m)
♦ Weakness is that a link cannot always be considered a recommendation & people are now
trying to influence the links.
♦ Major weakness - this algorithm focuses on the importance of a page rather on its relevance
given the user query.
♦ A side effect of page rank therefore can be something like this. Ex: suppose amazon.com
decides to get into the real estate business.
♦ Obviously its real estate web site will have links from all other amazon.com sites which have
no relevance whatsoever with real estate. When a user then searches for real estate,
amazon.com real estate site is likely to be served high in the results because of its high page
rank while most other real estate businesses whatever their quality will appear further down
the list.
♦ Because of the nature of the algorithm, page rank does not deal with new pages fairly since it
makes high page rank pages even more popular by serving them at the top of the results.
Thus the rich get richer & the poor get poorer. It takes time for a new page to become
popular, even if the new page is of high quality.
Other issues in Page Ranking
♦ Some page rank algorithms consider the following in determining page rank:
 Page title including keywords
 Page content including links, keywords, keywords in link text, spelling errors, length
of the page.
 Out-links including relevance of links to the content
 In-links & their importance
 In-links text – keywords
 In-linking page content
 Traffic including the number of unique visitors & the time spent on the page.
Yahoo! Web Rank
Yahoo! Has developed its own page ranking algorithm. Algorithm is called Web Rank.
Rank is calculated by analyzing the web page text, title & description as well as its associated
links & other unique document characteristics.
Reference:
UNIT-V

Data Warehouse:
A data warehouse is a subject-oriented, integrated, time-variant and non-volatile
collection ofdata in support of management's decision making process.

Subject-Oriented: A data warehouse can be used to analyze a particular subject area.


For example, "sales" can be a particular subject.

Integrated: A data warehouse integrates data from multiple data sources. For example,
source A and source B may have different ways of identifying a product, but in a data
warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve
data from 3 months, 6 months, 12 months, or even older data from a data warehouse.
This contrasts with a transactions system, where often only the most recent data is kept.
For example, a transaction system may hold the most recent address of a customer,
where a data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is in the data warehouse, it will not change. So, historical data
in a data warehouse should never be altered.

Data Warehouse Design Process:


A data warehouse can be built using a top-down approach, a bottom-up approach, or a

combination of both.

The top-down approach starts with the overall design and planning. It is useful in
cases where the technology is mature and well known, and where the business
problems that must be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in
the early stage of business modeling and technology development. It allows an
organization to move forward at considerably less expense and to evaluate the
benefits of the technology before making significant commitments.
In the combined approach, an organization can exploit the planned and strategic
nature of the top-down approach while retaining the rapid implementation and
opportunistic application of the bottom-up approach.
The warehouse design process consists of the following steps:
Choose a business process to model, for example, orders, invoices, shipments,
inventory, account administration, sales, or the general ledger. If the business
process is organizational and involves multiple complex object collections, a data
warehouse model should be followed. However, if the process is departmental and
focuses on the analysis of one kind of business process, a data mart model should be
chosen.
Choose the dimensions that will apply to each fact table record. Typical
dimensions are time, item, customer, supplier, warehouse, transaction type, and
status.
Choose the measures that will populate each fact table record. Typical measures are
numericadditive quantities like dollars sold and units sold.
A Three Tier Data Warehouse Architecture:

Tier-1:
The bottom tier is a warehouse database server that is almost always a
relationaldatabase system. Back-end tools and utilities are used to feed data into
the bottomtier from operational databases or other external sources (such as
customer profileinformation provided by external consultants). These tools and
utilities performdataextraction, cleaning, and transformation (e.g., to merge
similar data from differentsources into a unified format), as well as load and
refresh functions to update thedata warehouse . The data are extracted using
application programinterfaces known as gateways.
A gateway is supported by the underlying DBMS andallows client programs to
generate SQL code to be executed at a server.
Examplesof gateways include ODBC (Open Database Connection) and OLEDB
(Open Linkingand Embedding for Databases) by Microsoft and JDBC (Java
Database Connection).
This tier also contains a metadata repository, which stores information aboutthe
data warehouse and its contents.

Tier-2:

The middle tier is an OLAP server that is typically implemented using either a
relational OLAP (ROLAP) model or a multidimensional OLAP.
OLAP model is an extended relational DBMS thatmaps operations on
multidimensionaldata to standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose
server thatdirectly implements multidimensional data and operations.

Tier-3:
The top tier is a front-end client layer, which contains query and
reporting tools,analysis tools, and/or data mining tools (e.g., trend analysis,
prediction, and so on).

Data Warehouse Models:

There are three data warehouse models.


1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning
the entireorganization.
It provides corporate-wide data integration, usually from one or more operational
systemsor external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in
size from afew gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes,
computer superservers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.

2. Data mart:

A data mart contains a subset of corporate-wide data that is of value to aspecific


group of users. The scope is confined to specific selected subjects. For example,a
marketing data mart may confine its subjects to customer, item, and sales.
Thedata contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that
areUNIX/LINUX- or Windows-based. The implementation cycle of a data mart
ismore likely to be measured in weeks rather than months or years. However,
itmay involve complex integration in the long run if its design and planning
werenot enterprise-wide.

Depending on the source of data, data marts can be categorized as independent


ordependent. Independent data marts are sourced fromdata captured fromone or
moreoperational systems or external information providers, or fromdata
generated locallywithin a particular department or geographic area. Dependent
data marts are sourceddirectly from enterprise data warehouses.

3. Virtual warehouse:

A virtual warehouse is a set of views over operational databases.


Forefficient queryprocessing, only some of the possible summary views
may be materialized.
A virtual warehouse is easy to build but requires excess capacity on operational
databaseservers.

Operational Data Store (ODS)

What is an operational data store?


An operational data store (ODS) is a type of database that's often used as an interim logical
area for a data warehouse. ODSes are designed to integrate data from multiple sources for
lightweight data processing activities such as operational reporting and real-time analysis.

While in the ODS, data can be scrubbed, resolved for redundancy and checked
for compliance with the corresponding business rules. An ODS can be used for integrating
disparate data from multiple sources so that business operations, analysis and reporting can
be carried out while business operations are occurring. This is where most of the data used in
current operations is housed before it's transferred to the data warehouse for longer-term
storage or archiving.

ODSes are commonly used in online transaction processing applications, which involve
processing transactional data. These applications use the quick, lightweight processing that
ODS tools provide. ODS systems enable more comprehensive trend analysis reporting and
processing across many different systems simultaneously.

How do operational data stores work?

An operational data store usually stores and processes data in real time. An ODS is connected
to multiple data sources and pulls data into a central location.

ETL is often used in conjunction with operational data stores to prep raw data for a data
warehouse.

The way operational data stores work is comparable to the extract, transform and load (ETL)
process. ODS systems import raw data from production systems and store it in its original
form.
In the ETL process, data is extracted from target sources, transformed and loaded to its
destination. In the ODS process, data is not transformed, but rather it's presented as is to
business intelligence (BI) applications for analysis and operational decision-making.

In some cases, data from an ODS is replicated and then ETL is used to transport the
replicated data to a data warehouse.

As operational data stores ingest data, new incoming data overwrites existing data.

How are operational data stores used?

An operational data store typically pulls data from multiple transactional systems for
operational reporting and business reporting. They combine various real-time data sources
together in their original format in a central location.

ODS tools contain up-to-date versions of business data integrated from data sources, which is
useful for BI tasks such as managing logistics, tracking orders and monitoring customer
activity. ODSes are also useful for troubleshooting integration issues with data when they
occur. They can compare recent versions of data to copies on other systems to determine if
there is a continuity error.

ODSes also lend themselves to easy systems integration. Administrators can program rules
into an ODS that synchronize data across multiple systems. When it changes on one system,
it can trigger a change on another system.

ODSes can also facilitate a real-time data stream from data sources into the data pipeline.

For a real-world example, an ODS could pull batches of data from a billing application at
weekly intervals, ingest transaction data in real time and integrate them into a relational
database.

An ODS usually focuses on the operational requirements of a specific business process like
customer service, for example. ODSes allow updates and propagate those updates back to the
operational system that the data originated from.
What are the differences between operational data stores and data warehouses?

Like data warehouses and data lakes, operational data stores can be used as a repository to
import and consolidate different types of operational data from various locations and systems.
However, there are significant differences.

An ODS can be used as an intermediate area for a data warehouse. It can sit in between data
sources and enterprise data warehouses, where it can be used to prepare data for storage in
the data warehouse. In this way, the ODS works as a part of a data warehousing strategy
instead of replacing it. The ODS can become a data source for data warehouses.
Data Warehouse Implementation

There are various implementation in data warehouses which are as follows

1. Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity planning, and
selecting the hardware and software tools. This step will contain be consulting senior
management as well as the different stakeholder.

2. Hardware integration: Once the hardware and software has been selected, they require to
be put by integrating the servers, the storage methods, and the user software tools.

3. Modeling: Modelling is a significant stage that involves designing the warehouse schema
and views. This may contain using a modeling tool if the data warehouses are sophisticated.

4. Physical modeling: For the data warehouses to perform efficiently, physical modeling is
needed. This contains designing the physical data warehouse organization, data placement,
data partitioning, deciding on access techniques, and indexing.
5. Sources: The information for the data warehouse is likely to come from several data
sources. This step contains identifying and connecting the sources using the gateway, ODBC
drives, or another wrapper.

6. ETL: The data from the source system will require to go through an ETL phase. The
process of designing and implementing the ETL phase may contain defining a suitable ETL
tool vendors and purchasing and implementing the tools. This may contains customize the
tool to suit the need of the enterprises.

7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing the
tools will be needed, perhaps using a staging area. Once everything is working adequately,
the ETL tools may be used in populating the warehouses given the schema and view
definition.

8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required by the end-
users.

9. Roll-out the warehouses and applications: Once the data warehouse has been populated
and the end-client applications tested, the warehouse system and the operations may be rolled
out for the user's community to use.

Implementation Guidelines
1. Build incrementally: Data warehouses must be built incrementally. Generally, it is
recommended that a data marts may be created with one particular project in mind, and once
it is implemented, several other sections of the enterprise may also want to implement similar
systems. An enterprise data warehouses can then be implemented in an iterative manner
allowing all data marts to extract information from the data warehouse.

2. Need a champion: A data warehouses project must have a champion who is active to
carry out considerable researches into expected price and benefit of the project. Data
warehousing projects requires inputs from many units in an enterprise and therefore needs to
be driven by someone who is needed for interacting with people in the enterprises and can
actively persuade colleagues.

3. Senior management support: A data warehouses project must be fully supported by


senior management. Given the resource-intensive feature of such project and the time they
can take to implement, a warehouse project signal for a sustained commitment from senior
management.

4. Ensure quality: The only record that has been cleaned and is of a quality that is implicit
by the organizations should be loaded in the data warehouses.

5. Corporate strategy: A data warehouse project must be suitable for corporate strategies
and business goals. The purpose of the project must be defined before the beginning of the
projects.

6. Business plan: The financial costs (hardware, software, and peopleware), expected
advantage, and a project plan for a data warehouses project must be clearly outlined and
understood by all stakeholders. Without such understanding, rumors about expenditure and
benefits can become the only sources of data, subversion the projects.

7. Training: Data warehouses projects must not overlook data warehouses training
requirements. For a data warehouses project to be successful, the customers must be trained
to use the warehouses and to understand its capabilities.

8. Adaptability: The project should build in flexibility so that changes may be made to the
data warehouses if and when required. Like any system, a data warehouse will require to
change, as the needs of an enterprise change.

9. Joint management: The project must be handled by both IT and business professionals in
the enterprise. To ensure that proper communication with the stakeholder and which the
project is the target for assisting the enterprise's business, the business professional must be
involved in the project along with technical professionals.
Data Warehousing - Metadata Concepts

What is Metadata?

Metadata is simply defined as data about data. The data that is used to represent other data is
known as metadata. For example, the index of a book serves as a metadata for the contents in
the book. In other words, we can say that metadata is the summarized data that leads us to
detailed data. In terms of data warehouse, we can define metadata as follows.

 Metadata is the road-map to a data warehouse.


 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.

Note − In a data warehouse, we create metadata for the data names and definitions of a given
data warehouse. Along with this metadata, additional metadata is also created for time-
stamping any extracted data, the source of extracted data.

Categories of Metadata

Metadata can be broadly categorized into three categories −

 Business Metadata − It has the data ownership information, business definition, and
changing policies.
 Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes structural
information such as primary and foreign key attributes and indices.
 Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.

Role of Metadata

Metadata has a very important role in a data warehouse. The role of metadata in a warehouse
is different from the warehouse data, yet it plays an important role. The various roles of
metadata are explained below.
 Metadata acts as a directory.
 This directory helps the decision support system to locate the contents of the data
warehouse.
 Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly
summarized data.
 Metadata also helps in summarization between lightly detailed data and highly
summarized data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.

The following diagram shows the roles of metadata.

Metadata Repository

Metadata repository is an integral part of a data warehouse system. It has the following
metadata −

 Definition of data warehouse − It includes the description of structure of data


warehouse. The description is defined by schema, view, hierarchies, derived data
definitions, and data mart locations and contents.
 Business metadata − It contains has the data ownership information, business
definition, and changing policies.
 Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means the
history of data migrated and transformation applied on it.
 Data for mapping from operational environment to data warehouse − It includes
the source databases and their contents, data extraction, data partition cleaning,
transformation rules, data refresh and purging rules.
 Algorithms for summarization − It includes dimension algorithms, data on
granularity, aggregation, summarizing, etc.

Challenges for Metadata Management

The importance of metadata can not be overstated. Metadata helps in driving the accuracy of
reports, validates data transformation, and ensures the accuracy of calculations. Metadata also
enforces the definition of business terms to business end-users. With all these uses of
metadata, it also has its challenges. Some of the challenges are discussed below.

 Metadata in a big organization is scattered across the organization. This metadata is


spread in spreadsheets, databases, and applications.
 Metadata could be present in text files or multimedia files. To use this data for
information management solutions, it has to be correctly defined.
 There are no industry-wide accepted standards. Data management solution vendors
have narrow focus.
 There are no easy and accepted methods of passing metadata.
What is OLAP (Online Analytical Processing)?
OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views of
data that has been transformed from raw information to reflect the real dimensionality of the
enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is
rapidly enhancing the essential foundation for Intelligent Solutions containing Business
Performance Management, Planning, Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting.
OLAP enables end-clients to perform ad hoc analysis of record in multiple dimensions,
providing the insight and understanding they require for better decision making.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

o Budgeting
o Activity-based costing
o Financial performance analysis
o And financial modeling
Sales and Marketing

o Sales analysis and forecasting


o Market research analysis
o Promotion analysis
o Customer analysis
o Market and customer segmentation

Production

o Production planning
o Defect analysis

OLAP Guidelines
Dr E.F. Codd, the "father" of the relational model, has formulated a list of 12 guidelines and
requirements as the basis for selecting OLAP systems:

1) Multidimensional Conceptual View: This is the central features of an OLAP system. By


needing a multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing
operations, and the dissimilar nature of source data totally transparent to users. Such
transparency helps to improve the efficiency and productivity of the users.

3) Accessibility: It provides access only to the data that is actually required to perform the
particular analysis, present a single, coherent, and consistent view to the clients. The OLAP
system must map its own logical schema to the heterogeneous physical data stores and
perform any necessary transformations. The OLAP operations should be sitting between data
sources (e.g., data warehouses) and an OLAP front-end.

4) Consistent Reporting Performance: To make sure that the users do not feel any
significant degradation in documenting performance as the number of dimensions or the size
of the database increases. That is, the performance of OLAP should not suffer as the number
of dimensions is increased. Users must observe consistent run time, response time, or
machine utilization every time a given query is run.

5) Client/Server Architecture: Make the server component of OLAP tools sufficiently


intelligent that the various clients to be attached with a minimum of effort and integration
programming. The server should be capable of mapping and consolidating data between
dissimilar databases.

6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in


both is structure and operational capabilities. Additional operational capabilities may be
allowed to selected dimensions, but such additional tasks should be grantable to any
dimension.

7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific
analytical model being created and loaded that optimizes sparse matrix handling. When
encountering the sparse matrix, the system must be easy to dynamically assume the
distribution of the information and adjust the storage and access to obtain and maintain a
consistent level of performance.

8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and
access security.

9) Unrestricted cross-dimensional Operations: It provides the ability for the methods to


identify dimensional order and necessarily functions roll-up and drill-down methods within a
dimension or across the dimension.

10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation


direction like as reorientation (pivoting), drill-down and roll-up, and another manipulation to
be accomplished naturally and precisely via point-and-click and drag and drop methods on
the cells of the scientific model. It avoids the use of a menu or multiple trips to a user
interface.

11) Flexible Reporting: It implements efficiency to the business clients to organize columns,
rows, and cells in a manner that facilitates simple manipulation, analysis, and synthesis of
data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should
be unlimited. Each of these common dimensions must allow a practically unlimited number
of customer-defined aggregation levels within any given consolidation path.

Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of
the characteristics are:

Fast

It defines which the system targeted to deliver the most feedback to the client within about
five seconds, with the elementary analysis taking no more than one second and very few
taking more than 20 seconds.

Analysis

It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although
some preprogramming may be needed we do not think it acceptable if all application
definitions have to be allow the user to define new Adhoc calculations as part of the analysis
and to document on the data in any desired method, without having to program so we
excludes products (like Oracle Discoverer) that do not allow the user to define new Adhoc
calculation as part of the analysis and to document on the data in any desired product that do
not allow adequate end user-oriented calculation flexibility.

Share

It defines which the system tools all the security requirements for understanding and, if
multiple write connection is needed, concurrent update location at an appropriated level, not
all functions need customer to write data back, but for the increasing number which does, the
system should be able to manage multiple updates in a timely, secure manner.

Multidimensional

This is the basic requirement. OLAP system must provide a multidimensional conceptual
view of the data, including full support for hierarchies, as this is certainly the most logical
method to analyze business and organizations.

Information

The system should be able to hold all the data needed by the applications. Data sparsity
should be handled in an efficient manner.

The main characteristics of OLAP are as follows:

1. Multidimensional conceptual view: OLAP systems let business users have a


dimensional and logical view of the data in the data warehouse. It helps in carrying
slice and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation
should provide normal database operations, containing retrieval, update, adequacy
control, integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and
an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or
database size should not significantly degrade the reporting performance of the OLAP
system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of
metrics along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.

Benefits of OLAP

OLAP holds several benefits for businesses: -

1. OLAP helps managers in decision-making through the multidimensional record views


that it is efficient in providing, thus increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent flexibility support to the
organized databases.
3. It facilitates simulation of business models and problems, through extensive
management of analysis-capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a reduction in the
application backlog, faster data retrieval, and reduction in query drag.

What is Multi-Dimensional Data Model?


A multidimensional model views data in the form of a data-cube. A data cube enables data to
be modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the store's
sales for the dimension time, item, and location. These dimensions allow the save to keep
track of things, for example, monthly sales of items and the locations at which the items were
sold. Each dimension has a table related to it, called a dimensional table, which describes the
dimension further. For example, a dimensional table for an item may contain the attributes
item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown
in the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item
sold). The fact or measure displayed in rupee_sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai,
Kolkata, Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table
are represented as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:

What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can
be materialized into a set of eight views as shown in fig, where psc indicates a view
consisting of aggregate function value (such as total-sales) computed by grouping three
attributes part, supplier, and customer, p indicates a view composed of the corresponding
aggregate function values calculated by grouping part alone, etc.

A data cube is created from a subset of attributes in the database. Specific attributes are
chosen to be measure attributes, i.e., the attributes whose values are of interest. Another
attributes are selected as dimensions or functional attributes. The measure attributes are
aggregated according to the dimensions.

For example, XYZ may create a sales data warehouse to keep records of the store's sales for
the dimensions time, item, branch, and location. These dimensions enable the store to keep
track of things like monthly sales of items, and the branches and locations at which the items
were sold. Each dimension may have a table identify with it, known as a dimensional table,
which describes the dimensions. For example, a dimension table for items may contain the
attributes item_name, brand, and type.

Data cube method is an interesting technique with many applications. Data cubes could be
sparse in many cases because not every cell in each dimension may have corresponding data
in the database.

Techniques should be developed to handle sparse cubes efficiently.

If a query contains constants at even lower levels than those provided in a data cube, it is not
clear how to make the best use of the precomputed results stored in the data cube.

The model view data in the form of a data cube. OLAP tools are based on the
multidimensional data model. Data cubes usually model n-dimensional data.

A data cube enables data to be modeled and viewed in multiple dimensions. A


multidimensional data model is organized around a central theme, like sales and transactions.
A fact table represents this theme. Facts are numerical measures. Thus, the fact table contains
measure (such as Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used
for analyzing the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data for items
sold per quarter in the city of Vancouver. The measured display in dollars sold (in
thousands).

3-Dimensional Cuboids

Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location for the
cities Chicago, New York, Toronto, and Vancouver. The measured display in dollars sold (in
thousands). These 3-D data are shown in the table. The 3-D data of the table are represented
as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Data cube operations:

Data cube operations are used to manipulate data to meet the needs of users. These
operations help to select particular data for the analysis purpose. There are mainly 5
operations listed below-
 Roll-up: operation and aggregate certain similar data attributes having the same
dimension together. For example, if the data cube displays the daily income of a
customer, we can use a roll-up operation to find the monthly income of his salary.

 Drill-down: this operation is the reverse of the roll-up operation. It allows us to take
particular information and then subdivide it further for coarser granularity analysis. It
zooms into more detail. For example- if India is an attribute of a country column and we
wish to see villages in India, then the drill-down operation splits India into states,
districts, towns, cities, villages and then displays the required information.

 Slicing: this operation filters the unnecessary portions. Suppose in a particular


dimension, the user doesn’t need everything for analysis, rather a particular attribute.
For example, country=”jamaica”, this will display only about jamaica and only display
other countries present on the country list.

 Dicing: this operation does a multidimensional cutting, that not only cuts only one
dimension but also can go to another dimension and cut a certain range of it. As a result,
it looks more like a subcube out of the whole cube(as depicted in the figure). For
example- the user wants to see the annual salary of Jharkhand state employees.

 Pivot: this operation is very important from a viewing point of view. It basically
transforms the data cube in terms of view. It doesn’t change the data present in the data
cube. For example, if the user is comparing year versus branch, using the pivot
operation, the user can change the viewpoint and now compare branch versus item type.
Advantages of data cubes:

 Multi-dimensional analysis: Data cubes enable multi-dimensional analysis of business


data, allowing users to view data from different perspectives and levels of detail.
 Interactivity: Data cubes provide interactive access to large amounts of data, allowing
users to easily navigate and manipulate the data to support their analysis.
 Speed and efficiency: Data cubes are optimized for OLAP analysis, enabling fast and
efficient querying and aggregation of data.
 Data aggregation: Data cubes support complex calculations and data aggregation,
enabling users to quickly and easily summarize large amounts of data.
 Improved decision-making: Data cubes provide a clear and comprehensive view of
business data, enabling improved decision-making and business intelligence.
 Accessibility: Data cubes can be accessed from a variety of devices and platforms,
making it easy for users to access and analyze business data from anywhere.
 Helps in giving a summarised view of data.
 Data cubes store large data in a simple way.
 Data cube operation provides quick and better analysis,
 Improve performance of data.

Disadvantages of data cube:

 Complexity: OLAP systems can be complex to set up and maintain, requiring


specialized technical expertise.
 Data size limitations: OLAP systems can struggle with very large data sets and may
require extensive data aggregation or summarization.
 Performance issues: OLAP systems can be slow when dealing with large amounts of
data, especially when running complex queries or calculations.
 Data integrity: Inconsistent data definitions and data quality issues can affect the
accuracy of OLAP analysis.
 Cost: OLAP technology can be expensive, especially for enterprise-level solutions, due
to the need for specialized hardware and software.
 Inflexibility: OLAP systems may not easily accommodate changing business needs and
may require significant effort to modify or extend.

You might also like