Data Warehousing & Mining BCA V SEM
Data Warehousing & Mining BCA V SEM
Data Warehousing & Mining BCA V SEM
Unit 1
Data Warehouse
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and multiple
sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support
for decision-makers for data modeling and analysis.
Data Warehouse is a group of data specific to the entire organization, not only to a particular group of
users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of
management's decisions.
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users to
understand the subject.
1
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing to
ensure consistency in naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source operational
RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and
delete operations are not performed. It usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines
that once entered into the warehouse, and data should not change.
2
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:
1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse.
So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
4. Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Statstical database
A statistical database (SDB) system is a database system that enables its users to retrieve only aggregate
statistics (e.g., sample mean and count) for a subset of the entities represented in the database.
As a statistical database may contain sensitive individual information, such as salary and health records,
generally, users are only allowed to retrieve aggregate statistics for a subset of the entities represented in
the databases. Common aggregate query operators in SQL include SUM, COUNT, MAX, MIN, and
AVERAGE, though more sophisticated statistical measures may also be supported by some database
systems.
Statistical databases pose unique security concerns, which have been the focus of much research.
However, the key security challenge is that of ensuring that no user is able to infer private information
with respect to a privacy... The main differences between a data warehouse and a statistical database are
their purpose, how they store data, and how users access the data:
Purpose
A data warehouse's main purpose is to store and analyze data to help with decision-making, while a
statistical database's main purpose is to store data for statistical analysis and reporting.
Data storage
A data warehouse stores data from various sources, including relational databases and transactional
systems, and is designed to handle large amounts of data. A statistical database stores data that is
organized for easy access and manipulation.
Data access
A data warehouse is accessed by a smaller number of people for specific reasons, while a database is
accessed by a larger number of people with broader needs. A statistical database allows users to retrieve
aggregate statistics, such as sample mean and count, for a subset of the database's entities.
Data Mart
A data mart is a data storage system that contains information specific to an organization's business unit.
It contains a small and selected part of the data that the company stores in a larger storage system.
Companies use a data mart to analyze department-specific information more efficiently. It provides
summarized data that key stakeholders can use to quickly make informed decisions.
For example, a company might store data from various sources, such as supplier information, orders,
sensor data, employee information, and financial records in their data warehouse or data lake. However,
the company stores information relevant to, for instance, the marketing department, such as social media
reviews and customer records, in a data mart.
A data mart is a simple form of data warehouse focused on a single subject or line of business. With a
data mart, teams can access data and gain insights faster, because they don’t have to spend time searching
within a more complex data warehouse or manually aggregating data from different sources.
4
Characteristics of data marts
Typically built and managed by the enterprise data team, although they can be built and maintained by
business unit SMEs organically as well.
Business group data stewards maintain the data mart, and end users have read-only access — they can
query and view tables, but cannot modify them, in order to prevent less technically-savvy users from
accidentally deleting or modifying critical business data.
Typically uses a dimensional model and star schema.
Contains a curated subset of data from the larger data warehouse. The data is highly structured, having
been cleansed and conformed by the enterprise data team to make it easy to understand and query.
Designed around the unique needs of a particular line of business or use case.
Users typically query the data using SQL commands.
5
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created,
and then a data warehouse is designed using these independent multiple data marts. In this approach, as
all the data marts are designed independently; therefore, the integration of data marts is required. It is also
termed as a bottom-up approach as the data marts are integrated to develop a data warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added
to the organizations.
6
What is Meta Data?
Metadata is data about the data or documentation about the information which is required by the users. In
data warehousing, metadata is one of the essential aspects.
Metadata includes the following:
1. The location and descriptions of warehouse systems and components.
2. Names, definitions, structures, and content of data-warehouse and end-users views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-user analytical tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.
Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow
users access to help understand the content and find data.
Several examples of metadata are:
1. A library catalog may be considered metadata. The directory metadata consists of several
predefined components representing specific attributes of a resource, and each item can have one
or more values. These components could be the name of the author, the name of the document,
the publisher's name, the publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the book.
3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the
person's weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the
data is 80.
4. Another examples of metadata are data about the tables and figures in a report like this book. A
table (which is a record) has a name (e.g., table titles), and there are column names of the tables
that may be treated metadata. The figures also have titles or names.
Types of Metadata
Metadata in a data warehouse fall into three major parts:
o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata
Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the enterprise. These
source systems include different data structures. The data elements selected for the data warehouse have
various fields lengths and data types.
7
In selecting information from the source systems for the data warehouses, we divide records, combine
factor of documents from different source files, and deal with multiple coding schemes and field lengths.
When we deliver information to the end-users, we must be able to tie that back to the source data sets.
Operational metadata contains all of this information about the operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of data from the source systems,
namely, the extraction frequencies, extraction methods, and business rules for the data extraction. Also,
this category of metadata contains information about all the data transformation that takes place in the
data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find
data from the data warehouses. The end-user metadata allows the end-users to use their business
terminology and look for the information in those ways in which they usually think of the business.
Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The software of metadata
repository management can be used to map the source data to the target database, integrate and transform
the data, generate code for data transformation, and to move data to the warehouse.
Benefits of Metadata Repository
1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of information assets.
4. It increases coordination, understanding, identification, and utilization of information assets.
5. It enforces CASE development standards with the ability to share and reuse metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.
8. It gives useful data administration tool to manage corporate information assets with the data
dictionary.
It increases reliability, control, and flexibility of the application development process.
8
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be
measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as
dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of things
like monthly sales of items, and the branches and locations at which the items were sold. Each dimension
may have a table identify with it, known as a dimensional table, which describes the dimensions. For
example, a dimension table for items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how
to make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data
model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional data
model is organized around a central theme, like sales and transactions. A fact table represents this theme.
Facts are numerical measures. Thus, the fact table contains measure (such as Rs_sold) and keys to each
of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing
the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per
quarter in the city of Vancouver. The measured display in dollars sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would
like to view the data according to time, item as well as the location for the cities Chicago, New York,
Toronto, and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown
in the table. The 3-D data of the table are represented as a series of 2-D tables.
9
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:
Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a
supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and
supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid.
In this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes
for the dimension time, item, location, and supplier. Each cuboid represents a different degree of
summarization.
10
Multidimensional Data Model
The multi-Dimensional Data Model is a method which is used for ordering data in the database along with
good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions associated with
market or business trends, unlike relational databases which allow customers to access data in the form
of queries. They allow users to rapidly receive answers to the requests which they made by creating and
examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases. It is used
to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from many
dimensions and perspectives. It is defined by dimensions and facts and is represented by a fact table. Facts
are numerical measures and fact tables contain measures of the related dimensional tables or names of the
facts.
12
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key
of the fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the
same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the following
features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the development
cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they
do against OLTP systems. Small single-table queries, frequently of a dimension table, are almost
instantaneous. Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When
the two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between
those two tables. This design feature enforces authentic and consistent query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star schema
database. By describing facts and dimensions and separating them into the various table, the impact of a
load structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add
new facts regularly and selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is
enforced because each data in dimensional tables has a unique primary key, and all keys in the fact table
are legitimate foreign keys drawn from the dimension table. A record in the fact table which is not related
correctly to a dimension cannot be given the correct key value to be retrieved.
13
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table.
These joins are more significant to the end-user because they represent the fundamental relationship
between parts of the underlying business. Customer can also browse dimension table attributes before
constructing a query.
In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME,
ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data,
three columns for BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table
is significantly reduced. When we need to change an item, we need only make a single change in the
dimension table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.
14
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product,
Line, and Family dimension tables. The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table. The product dimension has three
dimension tables with Product as the primary dimension table, and the Line and Family table are the
outrigger dimension tables.
A star schema store all attributes for a dimension into one denormalized table. This needed more disk
space than a more normalized snowflake schema. Snowflaking normalizes the dimension by moving
attributes with low cardinality into separate dimension tables that relate to the core dimension table by
using foreign keys. Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are
damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include
quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the
dimension tables.
15
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized version
now extends to eleven tables. We will notice that in the snowflake schema, the attributes with low
cardinality in each original dimension tables are removed to form separate tables. These new tables are
connected back to the original dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions and relationship.
It is suitable for many to many and one to many relationships between dimension levels.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required
due to the increasing number of lookup tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
Difference between Star and Snowflake Schemas
Star Schema
o In a star schema, the fact table will be at the center and is connected to the dimension tables.
o The tables are completely in a denormalized structure.
o SQL queries performance is good as there is less number of joins involved.
o Data redundancy is high and occupies more disk space.
16
Snowflake Schema
o A snowflake schema is an extension of star schema where the dimension tables are connected to
one or more dimensions.
o The tables are partially denormalized in structure.
o The performance of SQL queries is a bit less when compared to star schema as more number of
joins are involved.
o Data redundancy is low and occupies less disk space when compared to star schema.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize information.
Fact Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact
table into independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.
18
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes keys to
each of the four dimensions, along with two measures: Rupee_sold and units_sold. The shipping table
has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two
measures: Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design because
many variants for specific kinds of aggregation must be considered and selected.
19
UNIT 2
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing and inventory control are
designed for online transaction processing (OLTP). Such applications gather detailed data from day to
day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity
recently dubbed online analytical processing (OLAP). These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a
warehouse database is updated from operational systems periodically, usually during off-hours. As OLTP
data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be
restructured tables de-normalized, data cleansed of errors and redundancies and new fields and keys added
to reflect the needs to the user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
o
Data Warehouse Architecture: Basic
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Advertisement
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data
20
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
Advertisement
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for strategic
decision-making. These customers interact with the warehouse using end-client access tools.
Advertisement
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place where data is
processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from multiple
source systems, especially for enterprise data warehouses where all relevant data of an enterprise is
consolidated.
Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
21
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse system:
1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Types of Data Warehouse Architectures
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data
stored to reach this goal; it removes data redundancies.
22
The figure shows the only layer physically available is the source layer. In this method, data warehouses
are virtual. This means that the data warehouse is implemented as a multidimensional view of operational
data created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data
warehouse system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between physically available
sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored
initially to corporate relational databases or legacy databases, or it may come from an information
system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard
schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a
data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but it can also be used as a source
for creating data marts, which partially replicate data warehouse contents and are designed for
specific enterprise departments. Meta-data repositories store information on sources, access
procedures, data staging, users, data mart schema, and so on.
23
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the reconciled
layer and the data warehouse layer (containing both data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems of source data extraction and integration from those
of data warehouse population. In some cases, the reconciled layer is also directly used to accomplish
better some operational tasks, such as producing daily reports that cannot be satisfactorily prepared using
the corporate applications or generating data flows to feed external processes periodically to benefit from
cleaning and integration.
Advertisement
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this
structure is the extra file storage space used through the extra redundant reconciled layer. It also makes
the analytical tools a little further away from being real-time.
25
OLTP?
OLTP stands for Online Transaction Processing and its primary objective is the processing of data. An
OLTP system administers the day to day transaction of data under a 3-tier architecture (usually 3NF).
Each of these transactions involves individual records made up of multiple fields. The main emphasis of
OLTP is fast query processing and data integrity in multi-access environments. Some OLTP examples
are credit card activity, order entry, and ATM transactions.
OLTP Example
The ATM centre is an example of an OLTP system. Assume that a couple has a joint bank account. One
day, they arrive at different ATMs simultaneously and want to withdraw the whole amount from their
bank accounts.
OLTP vs OLAP: Differences
The main difference between OLTP vs OLAP is that OLTP is operational, whereas OLAP is
informational.
Here is a list of OLTP vs OLAP's top 15 key features that illustrate both their differences and
how they need to work together.
26
Challenge Data warehouses can be expensive to Strong technical knowledge and
build experience is required
Design Designed to have fast processing and Designed uniquely to integrate different
low redundancy data sources to build a consolidated
database
Operations INSERT, DELETE and UPDATE SELECT command
commands
Updates Short and fast updates Updates are scheduled and done
periodically
No. of users Thousands of users allowed at a time Only a few users allowed at a time
Types of OLAP
There are three main types of OLAP servers are as following:
o Database server.
o ROLAP server.
o Front-end tool.
27
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the market.
This method allows multiple multidimensional views of two-dimensional relational tables to be created,
avoiding structuring record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user requests, being
'RDBMS' aware and also being capable of generating the SQL statements based on the optimizer of the
DBMS engine.
Can handle large amounts of information: The data size limitation of ROLAP technology is depends
on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.
<="" strong="" style="box-sizing: border-box;">RDBMS already comes with a lot of features. So
ROLAP technologies, (works on top of the RDBMS) can control these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational
database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query
the relational database, and SQL statements do not suit all needs.
MOLAP Architecture
MOLAP Architecture includes the following components
o Database server.
o MOLAP server.
o Front-end tool.
28
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to
dynamically create aggregations or to evaluate results which have not been pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited for
MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship Server,
Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject areas in an
RDBMS. Some vendors can solve these problems by continuing access from MOLAP tools to detailed
data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture that
contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product and sales
region) to be stored and maintained in a persistent structure. This structure would be provided to reduce
the application overhead of performing calculations and building aggregation during initialization. These
structures can be automatically refreshed at predetermined intervals established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for slicing
and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is created.
Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed when the
cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already exist in
the organization. Therefore, to adopt MOLAP technology, chances are other investments in human and
capital resources are needed.
Hybrid OLAP (HOLAP) Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP
systems save more substantial quantities of detailed data in the relational tables while the aggregations
are stored in the pre-calculated cubes. HOLAP also can drill through from the cube down to the relational
tables for delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the
OLAP server and the detail record remains in the relational database. So no duplicate copy of the
detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.
29
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so often. We
have listed some of the less popular brands existing in the OLAP industry.
Web-Enabled OLAP (WOLAP) Server
WOLAP pertains to OLAP application which is accessible via the web browser. Unlike traditional
client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists
of three components: a client, a middleware, and a database server.
Desktop OLAP (DOLAP) Server
DOLAP permits a user to download a section of the data from the database or source, and work with that
dataset locally, or on their desktop.
Mobile OLAP (MOLAP) Server
Mobile OLAP enables users to access and work on OLAP data and applications remotely through the use
of their mobile devices.
Spatial OLAP (SOLAP) Server
SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into a single
user interface. It facilitates the management of both spatial and non-spatial data.
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:
30
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:
1. A description of the DW structure, including the warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational databases, which provides source RDBMSs and
their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include business
terms and definitions, ownership information, etc.
Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data conversion,
filtering, reformatting, indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS; large,
complex queries must be complete in seconds, not days.
Terabyte Scalability
31
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of
gigabytes and terabyte-sized data warehouses.
32
Data warehouse manager
system management is mandatory for the successful implementation of a data warehouse. The most
important system managers are −
• System configuration manager
• System scheduling manager
• System event manager
• System database manager
• System backup recovery manager
System Configuration Manager
• The system configuration manager is responsible for the management of the setup and
configuration of data warehouse.
• The structure of configuration manager varies from one operating system to another.
• In Unix structure of configuration, the manager varies from vendor to vendor.
• Configuration managers have single user interface.
• The interface of configuration manager allows us to control all aspects of the system.
Note − The most important configuration tool is the I/O manager.
System Scheduling Manager
System Scheduling Manager is responsible for the successful implementation of the data warehouse. Its
purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some form of
batch control mechanism. The list of features a system scheduling manager must have is as follows −
• Work across cluster or MPP boundaries
• Deal with international time differences
• Handle job failure
• Handle multiple queries
• Support job priorities
• Restart or re-queue the failed jobs
• Notify the user or a process when job is completed
• Maintain the job schedules across system outages
• Re-queue jobs to other queues
• Support the stopping and starting of queues
• Log Queued jobs
• Deal with inter-queue processing
Note − The above list can be used as evaluation parameters for the evaluation of a good scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
• Daily and ad hoc query scheduling
• Execution of regular report requirements
• Data load
• Data processing
• Index creation
• Backup
• Aggregation creation
33
• Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the system scheduling
manager must be capable of running across the architecture.
System Event Manager
The event manager is a kind of a software. The event manager manages the events that are defined on
the data warehouse system. We cannot manage the data warehouse manually because the structure of
data warehouse is very complex. Therefore we need a tool that automatically handles all the events
without any intervention of the user.
Note − The Event manager monitors the events occurrences and deals with them. The event manager
also tracks the myriad of things that can go wrong on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It may be noted that the event is
a measurable, observable, occurrence of a defined action.
Given below is a list of common events that are required to be tracked.
• Hardware failure
• Running out of space on certain key disks
• A process dying
• A process returning an error
• CPU usage exceeding an 805 threshold
• Internal contention on database serialization points
• Buffer cache hit ratios exceeding or failure below threshold
• A table reaching to maximum of its size
• Excessive memory swapping
• A table failing to extend due to lack of space
• Disk exhibiting I/O bottlenecks
• Usage of temporary or sort area reaching a certain thresholds
• Any other database shared memory usage
The most important thing about events is that they should be capable of executing on their own. Event
packages define the procedures for the predefined events. The code associated with each event is known
as event handler. This code is executed whenever an event occurs
System and Database Manager
System and database manager may be two separate pieces of software, but they do the same job. The
objective of these tools is to automate certain processes and to simplify the execution of others. The
criteria for choosing a system and the database manager are as follows −
34
• monitor and report on space usage
• clean up old and unused file directories
• add or expand space.
System Backup Recovery Manager
The backup and recovery tool makes it easy for operations and management staff to back-up the data.
Note that the system backup manager must be integrated with the schedule manager software being
used. The important features that are required for the management of backups are as follows −
• Scheduling
• Backup data tracking
• Database awareness
Backups are taken only to protect against data loss. Following are the important points to remember −
• The backup software will keep some form of database of where and when the piece of data was
backed up.
• The backup recovery manager must have a good front-end to that database.
• The backup recovery software should be database aware.
• Being aware of the database, the software then can be addressed in database terms, and will not
perform backups that would not be viable.
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −
• Load manager
• Warehouse manager
• Query manager
Data Warehouse Load Manager
Load manager performs the operations required to extract and load the data into the database. The size
and complexity of a load manager varies between specific solutions from one data warehouse to
another.
Load Manager Architecture
The load manager does performs the following functions −
• Extract data from the source system.
• Fast load the extracted data into temporary data store.
• Perform simple transformations into structure similar to the one in the data warehouse.
While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we
need to perform the following checks −
• Strip out all the columns that are not required within the warehouse.
• Convert all the values to required data types.
Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a third-
party system software, C programs, and shell scripts. The size and complexity of a warehouse manager
varies between specific solutions.
Note − A warehouse Manager analyzes query profiles to determine whether the index and aggregations
are appropriate.
Query Manager
The query manager is responsible for directing the queries to suitable tables. By directing the queries to
appropriate tables, it speeds up the query request and response process. In addition, the query manager
is responsible for scheduling the execution of the queries posted by the user.
36
• Query scheduling via third-party software
37
• A centralized data catalog is helpful in uniting metadata, making it easier to find data and track
its lineage.
• Data warehouse automation tools get new data into warehouses faster.
• Data virtualization solutions create a logical data warehouse so users can view the data from
their choice of tools.
• Online analytical processing (OLAP) is a way of representing data that has been summarized
into multidimensional views and hierarchies. When used with an integrated ETL process, it
allows business users to get reports without IT assistance.
• An operational data store (ODS) holds a subset of near-real-time data that’s used for operational
reporting or notifications
Data is typically stored in a data warehouse through an extract, transform and load
(ETL) process. The information is extracted from the source, transformed into high-quality data
and then loaded into the warehouse. Businesses perform this process on a regular basis to keep
data updated and prepared for the next step.
When an organization is ready to use its data for analytics or reporting, the focus shifts from data
warehousing to business intelligence (BI) tools. BI technologies like visual analytics and data
exploration help organizations glean important insights from their business data. On the back
end, it’s important to understand how the data warehouse architecture organizes data and how
the database execution model optimizes queries – so developers can write data applications with
reasonably high performance.
In addition to a traditional data warehouse and ETL process, many organizations use a variety of
other methods, tools and techniques for their workloads. For example:
• Data pipelines can be used to populate cloud data warehouses, which can be fully managed by
the organization or by the cloud provider.
• Continuously streaming data can be stored in a cloud data warehouse.
• A centralized data catalog is helpful in uniting metadata, making it easier to find data and track
its lineage.
• Data warehouse automation tools get new data into warehouses faster.
• Data virtualization solutions create a logical data warehouse so users can view the data from
their choice of tools.
• Online analytical processing (OLAP) is a way of representing data that has been summarized
into multidimensional views and hierarchies. When used with an integrated ETL process, it
allows business users to get reports without IT assistance.
• An operational data store (ODS) holds a subset of near-real-time data that’s used for operational
reporting or notifications
The term data warehouse is used to distinguish a database that is used for business analysis
(OLAP) rather than transaction processing (OLTP). While an OLTP database contains current
low-level data and is typically optimized for the selection and retrieval of records, a data
warehouse typically contains aggregated historical data and is optimized for particular types of
analyses, depending upon the client applications.
The contents of your data warehouse depends on the requirements of your users. They should be
able to tell you what type of data they want to view and at what levels of aggregation they want
to be able to view it.
38
Your data warehouse will store these types of data:
• Historical data
• Derived data
• Metadata
Historical Data--A data warehouse typically contains several years of historical data. The
amount of data that you decide to make available depends on available disk space and the types
of analysis that you want to support. This data can come from your transactional database
archives or other sources.Some applications might perform analyses that require data at lower
levels than users typically view it. You will need to check with the application builder or the
application's documentation for those types of data requirements.
Derived Data--Derived data is generated from existing data using a mathematical operation or a
data transformation. It can be created as part of a database maintenance operation or generated at
run-time in response to a query.
Metadata--Metadata is data that describes the data and schema objects, and is used by
applications to fetch and compute the data correctly.OLAP Catalog metadata is designed
specifically for use with Oracle OLAP. It is required by the Java-based Oracle OLAP API, and
can also be used by SQL-based applications to query the database.
Data warehousing is the process of collecting, storing, and analyzing data from various sources
for business intelligence and decision making. One of the key aspects of data warehousing is
ensuring that the data is properly indexed, which means creating and maintaining structures that
allow fast and efficient access to the data. Indexing can improve the performance of queries,
reports, and analytics, as well as reduce the storage space and maintenance costs of the data
warehouse. In this article, you will learn how to properly index data during data warehousing,
including the types, benefits, and challenges of indexing, and some best practices and tips to
follow.
Types of indexes
When it comes to data warehousing, there are various types of indexes that can be used, depending on
the database system, the data model, and the query requirements. Primary key indexes ensure
uniqueness and identity of each row in a table, based on one or more columns that form the primary key.
These indexes are essential for data integrity and referential integrity, as well as for quickly looking up
individual rows by their key values. Foreign key indexes support relationships between tables, based on
one or more columns that reference the primary key of another table. These indexes are important for
enforcing referential integrity and cascading updates and deletes, as well as facilitating join operations
between tables. Bitmap indexes store values of a column as a series of bits, with each bit representing
the presence or absence of a value in a row. They are efficient for columns with low cardinality and can
be combined using logical operations. B-tree indexes organize values of a column into a balanced tree
structure, where each node has a range of values and a pointer to its children nodes. B-tree indexes are
suitable for columns with high cardinality and can support range queries.
Benefits of indexing
39
Indexing can provide several benefits for data warehousing, such as faster query execution, reduced
storage space, and easier maintenance. Indexes can help the database system to locate the relevant rows
for a query without scanning the entire table, which can save time and resources. Additionally, they can
help to compress the data and eliminate duplicates, reducing the storage space required for the data
warehouse. Bitmap indexes and B-tree indexes are two examples of how this can be achieved.
Furthermore, indexes can help to maintain the consistency and quality of the data by enforcing
constraints and rules of the data model. They can also assist in updating and deleting the data by
cascading changes to related tables and indexes.
Personal Experience: In a large-scale data warehousing project, we had performance issues with real-
time analytics. After implementing selective indexing on frequently queried columns, the
performance improved dramatically. Query times were reduced, and the overall efficiency of the
system increased. Proper indexing can significantly speed up data retrieval and improve user
satisfaction.
Aggregation tasks benefit from the organized structure of indexes, resulting in efficient computation
of summary statistics. Without indexes, the database might need to perform full table scans to locate
desired data. Indexes reduce the number of input/output (I/O) operations needed to retrieve data.
3 Challenges of indexing
Indexing can also present some challenges when it comes to data warehousing, such as increased
complexity and the need for careful planning, testing, and tuning. Furthermore, indexes can affect the
performance of other operations like data loading, transformation, and cleansing. Trade-offs and trade-
offs between the benefits and costs of different types and configurations of indexes must also be taken
into consideration depending on the data characteristics, query patterns, and business objectives. It's
important to note that indexes may change over time as the data and requirements evolve, thus requiring
periodic reevaluation and adjustment.
Insight: While indexing can offer significant benefits, it also comes with challenges. Over-indexing
can lead to increased storage requirements and slower write operations. In one of my experiences,
excessive indexes on a transactional table significantly slowed down the data ingestion process. It's
crucial to strike a balance and continuously monitor the impact of indexes on both read and write
operations.
Transformations that involve significant changes to indexed columns may require careful
consideration and potential adjustments to indexes.
40
Consolidation process stages
Warehouse consolidation tasks are carried out in specific areas of the facility and can be broken down
into the following phases:
• Product receipt and sorting. At the warehouse docking areas, products and raw materials are
received from different suppliers or customers, documentation is checked, and items are sorted
in a temporary storage zone.
• Stock storage. Once the goods are entered in the warehouse management software, the
operators store the products according to the criteria and rules established in advance by the
logistics manager.
• Product handling. To facilitate its distribution, the stock has to go through the weighing and
packing processes, among others.
• Order grouping. The goods are combined in a single batch so that they can be dispatched as a
grouped shipment. This phase also includes the preparation of the pertinent documentation
detailing, e.g., the orders and unit loads that make up the consolidated load.
• Goods dispatch. Once the shipment is consolidated, the stock is loaded onto the truck —
mechanically or automatically — to be delivered to the next recipient in the supply chain or to
the end custome
r.
Materialization in a data warehouse is the process of storing the results of a query in a database
object, called a materialized view, to improve performance and efficiency. Materialized views are often
used to precompute and store aggregated data, such as the sum of sales, or to precompute joins.
Here are some benefits of materialized views:
• Zero maintenance: The system automatically synchronizes data refreshes with data changes in
base tables.
• Always fresh: Materialized views are always consistent with the base table.
41
• Eliminates overhead: Materialized views eliminate the overhead associated with expensive joins
or aggregations for a large or important class of queries.
Materialized views can be especially beneficial when you need to regularly read a subset of data from a
large dataset. Retrieving the data afresh each time could be time-consuming because you would have to
run the query on the full set of data every time
OLAP software locates the intersection of dimensions, such as all products sold in the Eastern region
above a certain price during a certain time period, and displays them. The result is the measure; each
OLAP cube has at least one to perhaps hundreds of measures, which derive from information stored
in fact tables in the data warehouse.
42
UNIT 3
Data Mining Tutorial
The data mining tutorial provides basic and advanced concepts of data mining. Our data mining tutorial
is designed for learners and experts.
43
Data mining is the act of automatically searching for large stores of information to find trends and patterns
that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data
segments and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of
Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve business
problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular data
set, with an objective. This process includes various types of services such as text mining, web mining, audio
and video mining, pictorial data mining, and social media mining. It is done through software that is simple
or highly specific. By outsourcing data mining, all the work can be done faster with low operation costs.
Specialized firms can also use new technologies to collect data that is impossible to locate manually. There
are tonnes of information available on various platforms, but very little knowledge is accessible. The biggest
challenge is to analyze the data to extract important information that can be used to solve a problem or for
company development. There are many powerful instruments and techniques available to mine data and find
better insight from it.
44
Advantages of Data Mining
These are the following areas where data mining is widely used:
Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It uses data and analytics for
better insights and to identify best practices that will enhance health care services and reduce costs. Analysts
use data mining approaches such as Machine learning, Multi-dimensional database, Data visualization, Soft
computing, and statistics. Data Mining can be used to forecast patients in each category. The procedures
ensure that the patients get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.
Data Mining in Market Basket Analysis:
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of products,
then you are more likely to buy another group of products. This technique may enable the retailer to
understand the purchase behavior of a buyer. This data may assist the retailer in understanding the
45
requirements of the buyer and altering the store's layout accordingly. Using a different analytical comparison
of results between various stores, between customers in different demographic groups can be done.
Data mining in Education:
Education data mining is a newly emerging field, concerned with developing techniques that explore
knowledge from the data generated from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of educational support, and promoting
learning science. An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to teach and how to teach.
Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be beneficial to
find patterns in a complex manufacturing process. Data mining can be used in system-level designing to
obtain the relationships between product architecture, product portfolio, and data needs of the customers. It
can also be used to forecast the product development period, cost, and expectations among the other tasks.
Data Mining in CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies. To get a decent relationship with the
customer, a business organization needs to collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics
Data Mining in Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little bit time
consuming and sophisticated. Data mining provides meaningful patterns and turning data into information.
An ideal fraud detection system should protect the data of all the users. Supervised methods consist of a
collection of sample records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the document is fraudulent or not.
Data Mining in Lie Detection:
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging task.
Law enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text. The information collected from the previous investigations is compared,
and a model for lie detection is constructed.
Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous amount of data with every
new transaction. The data mining technique can help bankers by solving business-related problems in
banking and finance by identifying trends, casualties, and correlations in business information and market
costs that are not instantly evident to managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find these data for better targeting,
acquiring, retaining, segmenting, and maintain a profitable customer.
46
Incomplete and noisy data:
The process of extracting useful data from large volumes of data is data mining. The data in the real-world is
heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or unreliable. These
problems may occur due to data measuring instrument or because of human errors. Suppose a retail chain
collects phone numbers of customers who spend more than $ 500, and the accounting employees put the
information into their system. The person may make a digit mistake when entering the phone number, which
results in incorrect data. Even some customers may not be willing to disclose their phone numbers, which
results in incomplete data. The data could get changed due to human or system error. All these consequences
(noisy and incomplete data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might be
in a database, individual systems, or even on the internet. Practically, It is a quite tough task to make all the
data to a centralized data repository mainly due to organizational and technical concerns. For example,
various regional offices may have their servers to store their data. It is not feasible to store, all the data from
all the offices on a central server. Therefore, data mining requires the development of tools and algorithms
that allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting useful
information is a tough task. Most of the time, new technologies, new tools, and methodologies would have to
be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and techniques used.
If the designed algorithm and techniques are not up to the mark, then the efficiency of the data mining
process will be affected adversely.
Data Privacy and Security:
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For example,
if a retailer analyzes the details of the purchased items, then it reveals data about buying habits and
preferences of the customers without their permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary method that shows the
output to the user in a presentable way. The extracted data should convey the exact meaning of what it
intends to express. But many times, representing the information to the end-user in a precise and easy way is
difficult. The input data and the output information being complicated, very efficient, and successful data
visualization processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining relies on
getting rid of all these difficulties.
Prerequisites
Before learning the concepts of Data Mining, you should have a basic understanding of Statistics, Database
Knowledge, and Basic programming language.
Audience
Our Data Mining Tutorial is prepared for all beginners or computer science graduates to help them learn the
basics to advanced techniques related to data mining.
Problems
We assure you that you will not find any difficulty while learning our Data Mining tutorial. But if there is
any mistake in this tutorial, kindly post the problem or error in the contact form so that we can improve it.
47
Data Mining Task Primitives
Data mining task primitives refer to the basic building blocks or components that are used to construct a
data mining process. These primitives are used to represent the most common and fundamental tasks
that are performed during the data mining process. The use of data mining task primitives can provide a
modular and reusable approach, which can improve the performance, efficiency, and understandability
of the data mining process.
The availability and abundance of data today make knowledge discovery and Data Mining a matter of
impressive significance and need. In the recent development of the field, it isn't surprising that a wide variety
of techniques is presently accessible to specialists and experts.
What is KDD?
KDD is a computer science field specializing in extracting previously unknown and interesting
information from raw data. KDD is the whole process of trying to make sense of data by developing
appropriate methods or techniques. This process deals with low-level mapping data into other forms that
are more compact, abstract, and useful. This is achieved by creating short reports, modeling the process
of generating data, and developing predictive models that can predict future cases.
Due to the exponential growth of data, especially in areas such as business, KDD has become a very
important process to convert this large wealth of data into business intelligence, as manual extraction of
patterns has become seemingly impossible in the past few decades.
For example, it is currently used for various applications such as social network analysis, fraud detection,
science, investment, manufacturing, telecommunications, data cleaning, sports, information retrieval, and
49
marketing. KDD is usually used to answer questions like what are the main products that might help to
obtain high-profit next year in V-Mart.
1. Goal identification: Develop and understand the application domain and the relevant prior
knowledge and identify the KDD process's goal from the customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set of variables or data samples
on which the discovery was made.
3. Data cleaning and preprocessing:Basic operations include removing noise if appropriate,
collecting the necessary information to model or account for noise, deciding on strategies for
handling missing data fields, and accounting for time sequence information and known changes.
50
4. Data reduction and projection: Finding useful features to represent the data depending on the
purpose of the task. The effective number of variables under consideration may be reduced through
dimensionality reduction methods or conversion, or invariant representations for the data can be
found.
5. Matching process objectives: KDD with step 1 a method of mining particular. For example,
summarization, classification, regression, clustering, and others.
6. Modeling and exploratory analysis and hypothesis selection: Choosing the algorithms or data
mining and selecting the method or methods to search for data patterns. This process includes
deciding which model and parameters may be appropriate (e.g., definite data models are different
models on the real vector) and the matching of data mining methods, particularly with the general
approach of the KDD process (for example, the end-user might be more interested in
understanding the model in its predictive capabilities).
7. Data Mining: The search for patterns of interest in a particular representational form or a set of
these representations, including classification rules or trees, regression, and clustering. The user
can significantly aid the data mining method to carry out the preceding steps properly.
8. Presentation and evaluation: Interpreting mined patterns, possibly returning to some of the steps
between steps 1 and 7 for additional iterations. This step may also involve the visualization of the
extracted patterns and models or visualization of the data given the models drawn.
9. Taking action on the discovered knowledge: Using the knowledge directly, incorporating the
knowledge in another system for further action, or simply documenting and reporting to
stakeholders. This process also includes checking and resolving potential conflicts with previously
believed knowledge (or extracted).
Now we will discuss the main difference covering data mining vs KDD.
Example Clustering groups of data elements Data analysis to find patterns and
based on how similar they are. links.
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
51
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data mining
technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text data,
time-series data, World Wide Web, and so on.
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on.
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities. For
example, discrimination, classification, clustering, characterization, etc. some frameworks tend to be
extensive frameworks offering a few data mining functionalities together.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks, machine
learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented,
etc.
The classification can also take into account, the level of user interaction involved in the data mining
procedure, such as query-driven systems, autonomous systems, or interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few clusters
mainly loses certain confine details, but accomplishes improvement. It models data by its clusters. Data
modelling puts clustering from a historical point of view rooted in statistics, mathematics, and numerical
analysis. From a machine learning point of view, clusters relate to hidden patterns, the search for clusters is
unsupervised learning, and the subsequent framework represents a data concept. From a practical point of
view, clustering plays an extraordinary job in data mining applications. For example, scientific data
exploration, text mining, information retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify similar data. This
technique helps to recognize the differences and similarities between the data. Clustering is very similar to the
classification, but it involves grouping chunks of data together based on their similarities.
52
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific variable.
Regression, primarily a form of planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern in the
data set.
Association rules are if-then statements that support to show the probability of interactions between data items
within large data sets in different types of databases. Association rule mining has several applications and is
commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you have
been buying for the last six months. It calculates a percentage of items being purchased together.
These are three major measurements technique:
o Lift:
This measurement technique measures the accuracy of the confidence over how often item B is
purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and compared it
to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is purchased
as well.
(Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not match
an expected pattern or expected behavior. This technique may be used in various domains like intrusion,
detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier is a data
point that diverges too much from the rest of the dataset. The majority of the real-world datasets have an
outlier. Outlier detection plays a significant role in the data mining field. Outlier detection is valuable in
numerous fields like network interruption identification, credit or debit card fraud detection, detecting outlying
in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the stake of
a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.
53
Data Mining Tools
Data mining tools are software applications that process and analyze large data sets to discover patterns, trends,
and relationships that might not be immediately apparent. These tools enable organizations and researchers to
make informed decisions by extracting useful information. Some popular data mining tools include:
Altair RapidMiner: Known for its flexibility and wide range of functionality, it covers the entire data mining
process, from data preparation to modeling and evaluation.
WEKA: A collection of machine learning algorithms for data mining tasks that are easily applicable to real data
with a user-friendly interface.
KNIME: Combines data access, transformation, initial investigation, powerful predictive analytics, and
visualization within an open-source platform.
Python (with libraries like scikit-learn, pandas, and NumPy): While Python is a programming language, its
libraries are extensively used in data mining for sophisticated data analysis and machine learning.
Tableau: A visualization tool with powerful data mining capabilities due to its ability to interactively handle
large data sets.
These tools cater to a variety of users, from those who prefer graphical interfaces to those who are more
comfortable coding their own analyses.
54
• Background Knowledge: Prior knowledge of datasets and their relationships in a database
help in mining the data. By knowing the relationships or any useful information can ease the
process of extraction and aggregation. For an instance, the conceptual hierarchy of the number
of datasets can increase the efficiency of the process and accuracy by collecting the desired
data easily. By knowing the hierarchy, the data can be generalized with ease.
• Generalization: When the data in datasets of a data warehouse is not generalized, often the
data would be in form of unprocessed primitive integrity constraints, roughly associated multi-
valued datasets and their dependencies. But by using the generalization concept using query
language can help in processing the raw data into a precise abstraction. It also works in the
multi-level collection of data with a quality aggregation. When the larger databases come into
the scene, the generalization would play a major role in giving desirable results in a conceptual
level of data collection.
• Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from
databases, efficient exposure values or thresholds must be specified for the flexible data
mining and to provide compulsive interaction which makes the user experience interesting.
Such threshold values can be provided with queries of data mining.
The four parameters of data mining:
• The first parameter is to fetch the relevant dataset from the database in the form of a relational
query. By specifying this primitive, relevant data are retrieved.
• The second parameter is the type of resource/information extracted. This primitive includes
generalization, association, classification, characterization, and discrimination rules.
• The third parameter is the hierarchy of datasets or generalization relation or background
knowledge as said earlier in the designing of DMQL.
• The final parameter is the proficiency of the data collected which can be represented by a
specific threshold value which in turn depends on the type of rules used in data mining.
55
5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
6. Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization,
and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the data
and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.
• Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :
56
o Binning Method: This method works on sorted data in order to smooth
it. The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
o Regression:Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The
outliers may be undetected or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
• Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
• Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
• Discretization: This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
• Concept Hierarchy Generation: Here attributes are converted from lower level to higher
level in hierarchy. For Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing
the size of the dataset while preserving the important information. This is done to improve the efficiency
of data analysis and to avoid overfitting of the model. Some common steps involved in data reduction
are:
• Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
• Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
• Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic
sampling.
• Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
• Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gif compression.
57
the users easily understand and collect the data provided by visual representation rather than going
through the whole scanning the datasheets.
Importance of Data Visualization?
Data visualization represents the data in visual form. It is important since it enables information to be
more easily seen. Machine learning technique plays an important role in conducting predictive analysis
which helps in data visualization. Data visualization is not only helpful for business analysts, data
analysts, and data scientists, but it plays a vital role in comprehending data visualization in any career.
Whether we work in design, operation, tech, marketing, sales, or any other field, we need to visualize
data.
Data visualization has broad uses. Some important benefits of data visualization are given below.
o Finding errors and outliers
o Understand the whole procedure
o Grasping the data quickly
o Exploring business insights
o Quick action.
Data mining is a process of extracting useful Data Visualization refers to the visual
information, patterns, and trends from raw data. representation of data with the help of
comprehensive charts, images, lists, charts, and
other visual objects.
It comes under data science. It comes under the area of data science.
In data mining, many algorithms exist. There is no need to use any algorithm.
It is operated with web software systems. It supports and works better in advance data
analyses.
It has a broad application and is primarily It is preferred for data forecasting and
preferred for web search engines. predictions.
It is new technology but underdeveloped. It is more useful in real-time data forecasting.
Regression
Regression refers to a type of supervised machine learning technique that is used to predict any
continuous-valued attribute. Regression helps any business organization to analyze the target variable
58
and predictor variable relationships. It is a most significant tool to analyze the data that can be used for
financial forecasting and time series modeling.
Regression involves the technique of fitting a straight line or a curve on numerous data points. It
happens in such a way that the distance between the data points and cure comes out to be the lowest.
The most popular types of regression are linear and logistic regressions. Other than that, many other
types of regression can be performed depending on their performance on an individual data set.
Regression can predict all the dependent data sets, expressed in the expression of independent variables,
and the trend is available for a finite period. Regression provides a good way to predict variables, but
there are certain restrictions and assumptions like the independence of the variables, inherent normal
distributions of the variables. For example, suppose one considers two variables, A and B, and their
joint distribution is a bivariate distribution, then by that nature. In that case, these two variables might be
independent, but they are also correlated. The marginal distributions of A and B need to be derived and
used. Before applying Regression analysis, the data needs to be studied carefully and perform certain
preliminary tests to ensure the Regression is applicable. There are non-Parametric tests that are available
in such cases.
Types of Regression
Regression is divided into five different types
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
Difference between Regression and Classification in data mining
Regression Classification
Regression refers to a type of supervised machine Classification refers to a process of assigning
learning technique that is used to predict any predefined class labels to instances based on their
continuous-valued attribute. attributes.
In regression, the nature of the predicted data is In classification, the nature of the predicated data
ordered. is unordered.
The regression can be further divided into linear Classification is divided into two categories:
regression and non-linear regression. binary classifier and multi-class classifier.
In the regression process, the calculations are In the classification process, the calculations are
basically done by utilizing the root mean square basically done by measuring the efficiency.
error.
Examples of regressions are regression tree, The examples of classifications are the decision
linear regression, etc. tree.
The regression analysis usually enables us to compare the effects of various kinds of feature variables
measured on numerous scales. Such as prediction of the land prices based on the locality, total area,
surroundings, etc. These results help market researchers or data analysts to remove the useless feature
and evaluate the best features to calculate efficient models.
Statistical Methods in Data Mining
Data mining refers to extracting or mining knowledge from large amounts of data. In other words, data
mining is the science, art, and technology of discovering large and complex bodies of data in order to
discover useful patterns. Theoreticians and practitioners are continually seeking improved techniques to
make the process more efficient, cost-effective, and accurate. Any situation can be analyzed in two ways
in data mining:
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify
patterns and trends. Alternatively, it is referred to as quantitative analysis.
59
• Non-statistical Analysis: This analysis provides generalized information and includes sound,
still images, and moving images.
In statistics, there are two main categories:
• Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about populations
and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some of
these are:
• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable
Bayes’ Theorem
Bayes’ Theorem is used to determine the conditional probability of an event. It was named after an
English statistician, Thomas Bayes who discovered this formula in 1763. Bayes Theorem is a very
important theorem in mathematics, that laid the foundation of a unique statistical inference approach
called the Bayes’ inference. It is used to find the probability of an event, based on prior knowledge of
conditions that might be related to that event.
Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of
A divided by the probability of event B.” i.e.
P(A|B) = P(B|A)P(A) / P(B)
where,
• P(A) and P(B) are the probabilities of events A and B
• P(A|B) is the probability of event A when event B happens
• P(B|A) is the probability of event B when A happens
Bayes’s Theorem for Conditional Probability
Bayes Theorem Statement
Bayes’ Theorem for n set of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1,
E2,…, En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S.
Let A be an event from space S for which we have to find probability, then according to Bayes’
theorem,
60
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
for k = 1, 2, 3, …., n
Bayes Theorem Formula
For any two events A and B, then the formula for the Bayes theorem is given by: (the image given
below gives the Bayes’ theorem formula)
UNIT 4
Basic Concept of Classification (Data Mining)
••
•
Data Mining: Data mining in general terms means mining or digging deep into data that is in different
forms to gain patterns, and to gain knowledge on that pattern. In the process of data mining, large data
sets are first sorted, then patterns are identified and relationships are established to perform data analysis
and solve problems.
Classification is a task in data mining that involves assigning a class label to each instance in a dataset
based on its features. The goal of classification is to build a model that accurately predicts the class
labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification. Binary
classification involves classifying instances into two classes, such as “spam” or “not spam”, while
multi-class classification involves classifying instances into more than two classes.
The process of building a classification model typically involves the following steps:
Data Collection:
The first step in building a classification model is data collection. In this step, the data relevant to the
problem at hand is collected. The data should be representative of the problem and should contain all the
necessary attributes and labels needed for classification. The data can be collected from various sources,
such as surveys, questionnaires, websites, and databases.
Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected data needs to be
preprocessed to ensure its quality. This involves handling missing values, dealing with outliers, and
transforming the data into a format suitable for analysis. Data preprocessing also involves converting
the data into numerical form, as most classification algorithms require numerical input.
62
Handling Missing Values: Missing values in the dataset can be handled by replacing them with the
mean, median, or mode of the corresponding feature or by removing the entire record.
Dealing with Outliers: Outliers in the dataset can be detected using various statistical techniques such as
z-score analysis, boxplots, and scatterplots. Outliers can be removed from the dataset or replaced with
the mean, median, or mode of the corresponding feature.
Data Transformation: Data transformation involves scaling or normalizing the data to bring it into a
common scale. This is done to ensure that all features have the same level of importance in the analysis.
Feature Selection:
The third step in building a classification model is feature selection. Feature selection involves
identifying the most relevant attributes in the dataset for classification. This can be done using various
techniques, such as correlation analysis, information gain, and principal component analysis.
Correlation Analysis: Correlation analysis involves identifying the correlation between the features in
the dataset. Features that are highly correlated with each other can be removed as they do not provide
additional information for classification.
Information Gain: Information gain is a measure of the amount of information that a feature provides for
classification. Features with high information gain are selected for classification.
Principal Component Analysis:
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the dataset.
PCA identifies the most important features in the dataset and removes the redundant ones.
Model Selection:
The fourth step in building a classification model is model selection. Model selection involves selecting
the appropriate classification algorithm for the problem at hand. There are several algorithms available,
such as decision trees, support vector machines, and neural networks.
Decision Trees: Decision trees are a simple yet powerful classification algorithm. They divide the
dataset into smaller subsets based on the values of the features and construct a tree-like model that can
be used for classification.
Support Vector Machines: Support Vector Machines (SVMs) are a popular classification algorithm used
for both linear and nonlinear classification problems. SVMs are based on the concept of maximum
margin, which involves finding the hyperplane that maximizes the distance between the two classes.
Neural Networks:
Neural Networks are a powerful classification algorithm that can learn complex patterns in the data.
They are inspired by the structure of the human brain and consist of multiple layers of interconnected
nodes.
Model Training:
The fifth step in building a classification model is model training. Model training involves using the
selected classification algorithm to learn the patterns in the data. The data is divided into a training set
and a validation set. The model is trained using the training set, and its performance is evaluated on the
validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation. Model evaluation involves
assessing the performance of the trained model on a test set. This is done to ensure that the model
generalizes well
Classification is a widely used technique in data mining and is applied in a variety of domains, such as
email filtering, sentiment analysis, and medical diagnosis.
63
Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of
categories (subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further approve
it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on
test data and hence estimate the accuracy of the classification rules.
Test data are used to estimate the accuracy of the classification rule
64
and if the person does not move aside then the system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
1. Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey evaluating some products. We need to check whether it’s
useful or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No
• Symmetric: Both values are equally important in all aspects
• Asymmetric: When both the values may not be important.
2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather than
being in Integer form.
Example: One needs to choose some material but of different colors. So, the color might be
Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
• Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain
different grades as per their performance such as A, B, C, D
Grades: A, B, C, D
• Continuous: May have an infinite number of values, it is in float type
Example: Measuring the weight of few Students in a sequence or orderly manner
i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
• Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Syntax:
• Mathematical Notation: Classification is based on building a function taking input feature
vector “X” and predicting its outcome “Y” (Qualitative response taking values in set C)
• Here Classifier (or model) is used which is a Supervised function, can be designed manually
based on the expert’s knowledge. It has been constructed to predict class labels (Example:
Label – “Yes” or “No” for the approval of some event).
Classifiers can be categorized into two major types:
1. Discriminative: It is a very basic classifier and determines just one class for each row of
data. It tries to model just by depending on the observed data, depends heavily on the quality
of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model that
generates the data behind the scenes by estimating assumptions and distributions of the
model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too
divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if
a user wants to check that if an email contains the word cheap, then that may be termed as
Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)
65
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
Associated Tools and Languages: Used to mine/ extract useful information from raw data.
• Main Languages used: R, SAS, Python, SQL
• Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
• Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK, TensorFlow,
Seaborn, Basemap, etc.
Real–Life Examples :
• Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of buying some
combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have bought
in the past.
• Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters such as
temperature, humidity, wind direction. This keen observation also requires the use of
previous records in order to predict it accurately.
Advantages:
• Mining Based Methods are cost-effective and efficient
• Helps in identifying criminal suspects
• Helps in predicting the risk of diseases
• Helps Banks and Financial Institutions to identify defaulters so that they may approve Cards,
Loan, etc.
Disadvantages:
Privacy: When the data is either are chances that a company may give some information about their
customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best accuracy and
result.
APPLICATIONS:
66
Clustering in Data Mining
•
Clustering:
The process of making a group of abstract objects into classes of similar objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
• In the process of cluster analysis, the first step is to partition the set of data into groups with
the help of data similarity, and then groups are assigned to their respective labels.
• The biggest advantage of clustering over-classification is it can adapt to the changes made
and helps single out useful features that differentiate different groups.
Applications of cluster analysis :
• It is widely used in many applications such as image processing, data analysis, and pattern
recognition.
• It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the web.
Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method
2. Hierarchical Method
3. Constraint-Based Method
4. Grid-Based Method
5. Partitioning Method
6. Density-Based Method
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
• Scalability – we require highly scalable clustering algorithms to work with large databases.
• Ability to deal with different kinds of attributes – Algorithms should be able to work with
the type of data such as categorical, numerical, and binary data.
• Discovery of clusters with attribute shape – The algorithm should be able to detect clusters
in arbitrary shapes and it should not be bounded to distance measures.
• Interpretability – The results should be comprehensive, usable, and interpretable.
• High dimensionality – The algorithm should be able to handle high dimensional space
instead of only handling low dimensional data.
Difference between Classification and Clustering
Classification Clustering
Classification is a supervised learning Clustering is an unsupervised learning approach
approach where a specific label is provided to where grouping is done on similarities basis.
the machine to classify new observations.
Here the machine needs proper testing and
training for the label verification.
Supervised learning approach. Unsupervised learning approach.
It uses a training dataset. It does not use a training dataset.
It uses algorithms to categorize the new data It uses statistical concepts in which the data set is
as per the observations of the training set. divided into subsets with the same features.
In classification, there are labels for training In clustering, there are no labels for training data.
data.
67
Its objective is to find which class a new Its objective is to group a set of objects to find
object belongs to form the set of predefined whether there is any relationship between them.
classes.
It is more complex as compared to clustering. It is less complex as compared to clustering.
Here are some issues in classification in data mining:
• Discrimination
Data mining can lead to discrimination if it focuses on characteristics like ethnicity, gender, religion, or
sexual preference. In some cases, this can be considered unethical or illegal.
• Performance issues
Data mining algorithms can have performance issues with speed, reliability, and time. Parallelization
can help by breaking down jobs into smaller tasks that multiple computers can work on.
• Outlier analysis
Outlier detection is a major issue in data mining. An outlier is a pattern that is different from all the
other patterns in a data set.
• Anomaly detection
Anomaly detection is the process of finding patterns in data that don't conform to expected behavior. It
has applications in many fields, including finance, medicine, industry, and the internet.
Other issues in data mining include:
• Data quality
• Data privacy and security
• Handling diverse data types
• Scalability
• Integration with heterogeneous data sources
• Interpretation of results
• Dynamic data
• Legal and ethical concerns
68
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the next decision node (distance from
the office) and one leaf node based on the corresponding labels. The next decision node further
69
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:
70
• C4.5
• CART(Classification and Regression Trees)
• CHAID (Chi-Square Automatic Interaction Detection)
• MARS(Multivariate Adaptive Regression Splines)
Disadvantages
• The space of possible decision trees is exponentially large. Greedy approaches are often
unable to find the best tree.
• Does not take into account interactions between attributes.
• Each decision boundary involves only a single attribute.
• Can lead to overfitting.
• May not be effective with data with many attributes.
72
C4.5
As an enhancement to the ID3 algorithm, Ross Quinlan created the decision tree algorithm C4.5. In
machine learning and data mining applications, it is a well-liked approach for creating decision trees.
Certain drawbacks of the ID3 algorithm are addressed in C4.5, including its incapacity to deal with
continuous characteristics and propensity to overfit the training set.
A modification of information gain known as the gain ratio is used to address the bias towards qualities
with many values. It is computed by dividing the information gain by the intrinsic information, which is
a measurement of the quantity of data required to characterize an attribute’s values.
GainRatio=SplitgainGaininformationGainRatio=GaininformationSplitgain
Where Split Information represents the entropy of the feature itself. The feature with the highest gain
ratio is chosen for splitting.
When dealing with continuous attributes, C4.5 sorts the attribute’s values first, and then chooses the
midpoint between each pair of adjacent values as a potential split point. Next, it determines which split
point has the largest value by calculating the information gain or gain ratio for each.
By turning every path from the root to a leaf into a rule, C4.5 can also produce rules from the decision
tree. Predictions based on fresh data can be generated using the rules.
C4.5 is an effective technique for creating decision trees that can produce rules from the tree and handle
both discrete and continuous attributes. The model’s accuracy is increased and overfitting is prevented
by its utilization of gain ratio and decreased error pruning. Nevertheless, it might still be susceptible to
noisy data and might not function effectively on datasets with a lot of features.
C4.5 decision tree is a modification over the ID3 Decision Tree. C4.5 uses the Gain Ratio as
the goodness function to split the dataset, unlike ID3 which used the Information Gain.
The Information Gain function tends to prefer the features with more categories as they tend to have
lower entropy. This results in overfitting of the training data. Gain Ratio mitigates this issue by
penalising features for having a more categories using a formula called Split Information or Intrinsic
Information.
Calculation of Split
73
Thus resulting in an infinite Gain Ratio. As Gain Ratio favours the features with lesser categories, the
branching will be lesser and thus it prevents overfitting.
74
Node Impurity and the Gini Index
The goal of CART's splitting process is to achieve pure nodes, meaning nodes that have data points
belonging to a single class or with very similar values. To quantify the purity or impurity of a node, the
CART algorithm often employs measures like the Gini Index. A lower Gini Index suggests that a node is
pure.
For regression problems, other measures like mean squared error can be used to evaluate splits, aiming
to minimize the variability within nodes.
Pruning Techniques used in Cart Algorithm
While a deep tree with many nodes might fit the training data exceptionally well, it can often lead to
overfitting, where the model performs poorly on unseen data. Pruning addresses this by trimming down
the tree, removing branches that add little predictive power. Two common approaches in CART pruning
are:
• Reduced Error Pruning: Removing a node and checking if it improves model accuracy.
• Cost Complexity Pruning: Using a complexity parameter to weigh the trade-off between tree
size and its fit to the data.
By acquainting oneself with these foundational concepts, one can better appreciate the sophistication of
the CART Algorithm and its application in varied data scenarios.
How the CART Algorithm Works
To harness the full power of the CART Algorithm, it's essential to comprehend its inner workings. This
section provides an in-depth look into the step-by-step process that CART follows, unraveling the logic
behind each decision and split.
What is CHAID?
CHAID is a predictive model used to forecast scenarios and draw conclusions. It’s based
on significance testing and came into use during the 1980s after Gordon Kass published An
Exploratory Technique for Investigating Large Quantities of Categorical Data .
It involves:
• Regression: A statistical analysis to estimate the relationships between a dependent
response variable and other independent ones.
• Machine learning: Artificial intelligence that leverages data to absorb information,
then makes logical predictions or decisions.
• Decision trees: Branching models of decisions or attributes, followed by their event
outcomes.
77
Chi-Square Formula
• To find the most dominant feature, chi-square tests will use that is also called CHAID
whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses the GINI index.
• Today, most programming libraries (e.g. Pandas for Python) use Pearson metric for
correlation by default.
• The formula of chi-square:
• √((y – y’)2 / y’)
• where y is actual and y’ is expected.
• Practical Implementation of Chi-Square Test
• Let’s consider a sample dataset and calculate chi-square for various features.
• Sample Dataset
• We are going to build decision rules for the following data set. The decision column is the
target we would like to find based on some features.
• By The Way, we will ignore the day column because it’s just the row number.
• We need to find the most important feature w.r.t target columns to choose the node to split
Most commonly, the two objects are rows of data that describes a subject (such as a person, car, or
house), or an event (such as purchases, a claim, or a diagnosis)
Perhaps, the most likely way we can encounter distance measures is when we are using a specific
machine learning algorithm that uses distance measures at its core. The most famous algorithm is KNN
— [K-Nearest Neighbours Algorithm]
KNN
A Classification or Regression prediction is made for new examples by calculating the distance between
the new and all existing example sets in the training datasets.
The K examples in the training dataset with the smallest distance are then selected and a prediction is
made by averaging the outcome(mode of the class label or mean of the real value for regression)
80
KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of
which uses distance measures in a similar manner. Another popular instance-based algorithm that uses
distance measures is the learning vector quantization or LVQ, the algorithm that may also be considered
a type of neural network.
Next, We have the Self-Organizing Map algorithm, or SOM, which is an algorithm that also uses
distance measures and can be used for supervised and unsupervised learning algorithms that use
distance measures at its core is the K-means clustering algorithm.
In Instance-Based Learning, the training examples are stored verbatim and a distance function is used to
determine which member of the training set is closest to an unknown test instance. Once the nearest
training instance has been located its class is predicted for the test instance.
Let's understand this with an example, suppose we are a market manager, and we have a new tempting
product to sell. We are sure that the product would bring enormous profit, as long as it is sold to the
right people. So, how can we tell who is best suited for the product from our company's huge customer
base?
Clustering, falling under the category of unsupervised machine learning, is one of the problems that
machine learning algorithms solve.
Advertisement
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input data.
A good clustering algorithm aims to obtain clusters whose:
o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar
to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other
data.
81
What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the cluster is less
than the distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.
What is clustering in Data Mining?
o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses
called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a
stand-alone instrument to get a better insight into data distribution or as a pre-processing step for
other algorithms
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is based on data
similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single
out important characteristics that differentiate between distinct groups.
Applications of cluster analysis in data mining:
o In many applications, clustering analysis is widely used, such as data analysis, market research,
pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, categorization of
genes with the same functionalities and gain insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
geographical location.
82
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:
Advertisement
The outcomes of clustering should be interpretable, comprehensible, and usable.
3. Discovery of clusters with attribute shape:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to
only distance measurements that tend to discover a spherical cluster of small sizes.
4. Ability to deal with different types of attributes:
Algorithms should be capable of being applied to any data such as data based on intervals (numeric),
binary data, and categorical data.
5. Ability to deal with noisy data:
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and
may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the low-
dimensional space.
Partitioning Algorithms
A typical algorithmic strategy is partitioning, which involves breaking a big issue into smaller
subproblems that may be solved individually and then combining their solutions to solve the original
problem.
The fundamental concept underlying partitioning is to separate the input into subsets, solve each subset
independently, and then combine the results to get the whole answer. From sorting algorithms to parallel
computing, partitioning has a wide range of uses.
The quicksort algorithm is a well-known illustration of partitioning. The effective sorting algorithm
QuickSort uses partitioning to arrange an array of items. The algorithm divides the array into two
subarrays: one having elements smaller than the pivot and the other holding entries bigger than the
pivot. The pivot element is often the first or final member of the array. The pivot element is then
positioned correctly, and the procedure is then performed recursively on the two subarrays to sort the
full array.
The quicksort method is demonstrated using an array of numbers in the following example −
Input: [5, 3, 8, 4, 2, 7, 1, 6]
Step 1 − Select a pivot element (in this case, 5)
[5, 3, 8, 4, 2, 7, 1, 6]
Step 2 − Partition the array into two subarrays
[3, 4, 2, 1] [5] [8, 7, 6]
Step 3 − Recursively apply quicksort to the two subarrays
[3, 4, 2, 1] [5] [8, 7, 6]
[1, 2, 3, 4] [5] [6, 7, 8]
[1, 2, 3, 4, 5, 6, 7, 8]
83
Step 4 − The array is now sorted
[1, 2, 3, 4, 5, 6, 7, 8]
The quicksort method in this example divides the input array into three smaller subarrays, uses the
quicksort algorithm iteratively on the two of the smaller subarrays, then combines the sorted subarrays
to get the final sorted array.
Advantages
Performance gains
By allowing distinct sections of a problem to be addressed in parallel or on multiple processors,
partitioning can result in considerable performance gains. Faster processing speeds and improved
resource use may follow from this.
Improved Performance
The capacity of algorithms to handle bigger data sets and issue sizes is referred to as scalability.
Partitioning can aid in this process.
Scalability
Partitioning can simplify complicated issues by dividing them into smaller, easier-to-manage sub-issues.
This can facilitate understanding and solution of the issue.
Disadvantages
Added complexity
Partitioning can make an algorithm more complicated since it calls for extra logic to maintain the
various partitions and coordinate their operations.
Communication costs
When using partitioning to parallelize a task, communication costs might be a major obstacle. The
communication burden may outweigh the performance benefits of parallelization if the partitions must
communicate often.
Load imbalance
If the sub-problems are not equally large or challenging, partitioning may result in load imbalance.
Overall performance may suffer as a result of certain processors sitting idle while others are
overworked.
Step 1:
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the
individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar to
each other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST),
(V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)]
together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to form a
new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Advertisement
86
Density-based clustering refers to a method that is based on local cluster criterion, such as density
connected points. In this tutorial, we will discuss density-based clustering with examples.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used
in model building and machine learning algorithms. The data points in the region separated by two
clusters of low point density are considered as noise. The surroundings with a radius ε of a given object
are known as the ε neighborhood of the object. If the ε neighborhood of the object comprises at least a
minimum number, MinPts of objects, then it is called a core object.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts
Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such
that both i and j are considered as density reachable from o with respect to Eps and MinPts.
87
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable form
the object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of objects,
D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is directly
density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D' only if
there is an object o belongs to D such that both point i and j are density reachable from o with respect to
ε and MinPts.
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order of
database with respect to its density-based clustering structure. The order of the cluster comprises
information equivalent to the density-based clustering related to a long range of parameter settings.
OPTICS methods are beneficial for both automatic and interactive cluster analysis, including
determining an intrinsic clustering structure.
DENCLUE
Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical description of
arbitrarily shaped clusters in high dimension state of data, and it is good for data sets with a huge
amount of noise.
88
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle
'noise' (data points that are not part of the underlying pattern) effectively", beating DBSCAN by two
months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most commonly used
clustering algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are
the advanced clustering algorithms useful for performing precise clustering on large datasets. Moreover,
BIRCH is very useful because of its easy implementation. BIRCH is a clustering algorithm that clusters
the dataset first in small summaries, then after small summaries get clustered. It does not directly cluster
the dataset. That is why BIRCH is often used with other clustering algorithms; after making the
summary, the summary can also be clustered by other clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the
centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for
other cluster algorithms like Agglomerative Clustering.
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that
have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal
CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary
information of given data for further hierarchical clustering. This prevents the need to work with whole
data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).
o N = number of items in subclusters
o LS = vector sum of the data points
o SS = sum of the squared data points
89
There are mainly four phases which are followed by the algorithm of BIRCH.
o Scanning data into memory.
o Condense data (resize data).
o Global clustering.
o Refining clusters.
Two of them (resize data and refining clusters) are optional in these four phases. They come in the
process when more clarity is required. But scanning data is just like loading data into a model. After
loading the data, the algorithm scans the whole data and fits them into the CF trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it
sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of
CF trees where the same valued points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to
represent a larger set of data points. These summary statistics constitute a CF and represent a sufficient
substitute for the actual data for clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster. These
statistics are as follows:
o Count [The number of data values in the cluster]
o Linear Sum [The sum of the individual coordinates. This is a measure of the location of the
cluster]
o Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the cluster]
NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data
point.
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1: For each given record, BIRCH compares the location of that record with the location of each CF
in the root node, using either the linear sum or the mean of the CF. BIRCH passes the incoming record
to the root node CF closest to the incoming record.
Step 2: The record then descends down to the non-leaf child nodes of the root node CF selected in step
1. BIRCH compares the location of the record with the location of each non-leaf CF. BIRCH passes the
incoming record to the non-leaf node CF closest to the incoming record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node CF selected in step
2. BIRCH compares the location of the record with the location of each leaf. BIRCH tentatively passes
the incoming record to the leaf closest to the incoming record.
Step 4: Perform one of the below points (i) or (ii):
1. If the radius of the chosen leaf, including the new record, does not exceed the threshold T, then
the incoming record is assigned to that leaf. The leaf and its parent CF's are updated to account
for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the Threshold T, then a new
leaf is formed, consisting of the incoming record only. The parent CFs is updated to account for
the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the leaf node is split
into two leaf nodes. If the parent node is full, split the parent node, and so on. The most distant leaf node
CFs are used as leaf node seeds, with the remaining CFs being assigned to whichever leaf node is closer.
Note that the radius of a cluster may be calculated even without knowing the data points, as long as we
have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to evaluate whether a
given data point belongs to a particular sub-cluster without scanning the original data set.
Clustering the Sub-Clusters
Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters (the CF
leaf nodes) to combine these sub-clusters into clusters. The task of clustering becomes much easier as
the number of sub-clusters is much less than the number of data points. When a new data value is added,
these statistics may be easily updated, thus making the computation more efficient.
Parameters of BIRCH
90
There are three parameters in this algorithm that needs to be tuned. Unlike K-means, the optimal
number of clusters (k) need not be input by the user as the algorithm determines them.
o Threshold: Threshold is the maximum number of data points a sub-cluster in the leaf node of
the CF tree can hold.
o branching_factor: This parameter specifies the maximum number of CF sub-clusters in each
node (internal node).
o n_clusters: The number of clusters to be returned after the entire BIRCH algorithm is complete,
i.e., the number of clusters after the final clustering step. The final clustering step is not
performed if set to none, and intermediate clusters are returned.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and existing clusters.
It exploits the observation that the data space is not usually uniformly occupied, and not every data point
is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also
an incremental method that does not require the whole data set in advance.
91
CURE Architecture
• Idea: Random sample, say ‘s’ is drawn out of a given data. This random sample is partitioned,
say ‘p’ partitions with size s/p. The partitioned sample is partially clustered, into say ‘s/pq’
clusters. Outliers are discarded/eliminated from this partially clustered partition. The partially
clustered partitions need to be clustered again. Label the data in the disk.
• Procedure :
1. Select target sample number ‘gfg’.
2. Choose ‘gfg’ well scattered points in a cluster.
3. These scattered points are shrunk towards centroid.
4. These points are used as representatives of clusters and used in ‘Dmin’ cluster merging
approach. In Dmin(distance minimum) cluster merging approach, the minimum distance
from the scattered point inside the sample ‘gfg’ and the points outside ‘gfg sample, is
calculated. The point having the least distance to the scattered point inside the sample,
when compared to other points, is considered and merged into the sample.
5. After every such merging, new sample points will be selected to represent the new
cluster.
6. Cluster merging will stop until target, say ‘k’ is reached.
92
Here are some comparisons of clustering with categorical attributes:
• Categorical data clustering
This is a machine learning task that involves grouping objects with similar categorical attributes into
clusters. Categorical attributes are discrete values that lack a natural order or distance function. Most
classical clustering algorithms can't be directly applied to categorical data because of this.
• Clustering mixed data
Most clustering algorithms can only work with data that's either entirely categorical or
numerical. However, real-world datasets often contain both types of data.
• K-means vs K-modes
A study compared the k-means and k-modes algorithms for clustering numerical and qualitative
data. The study found that the k-modes algorithm produced better results and was faster than the k-
means algorithm.
• Graph-based representation
A framework for clustering categorical data uses a graph-based representation method to learn how to
represent categorical values. The framework compares the proposed method with other representation
methods on benchmark datasets.
• Object-cluster similarity
A framework for clustering mixed data uses the concept of object-cluster similarity to create a unified
similarity metric. This metric can be applied to data with categorical, numerical, and mixed attributes.
Unit 5
Association Rule Learning
Association rule learning is a type of unsupervised learning technique that checks for the dependency of
one data item on another data item and maps accordingly so that it can be more profitable. It tries to find
some interesting relations or associations among the variables of dataset. It is based on different rules to
discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here market
basket analysis is a technique used by the various big retailer to discover the associations between items.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
93
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:
Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are
several metrics. These metrics are given below:
o Support
o Confidence
o Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the
fraction of the transaction T that contains the itemset X. If there are X datasets, then for transactions T,
it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the transaction
that contains X and Y to the number of records that contain X.
Lift
It is the strength of any rule, which can be defined as below formula:
94
It is the ratio of the observed support measure and expected support if X and Y are independent of each
other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each
other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.
Types of Association Rule Lerning
Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the
databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate
the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori
Algorithm.
F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori
Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern or
tree. The purpose of this frequent tree is to extract the most frequent patterns.
Applications of Association Rule Learning
It has various applications in machine learning and data mining. Below are some popular applications of
association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of association rule
mining. This technique is commonly used by big retailers to determine the association between
items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in
identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.
95
item set {milk, bread} appears in 20 of those transactions, the support count for {milk, bread} is
20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find frequent item
sets and generate association rules. These algorithms work by iteratively generating candidate
item sets and pruning those that do not meet the minimum support threshold. Once the frequent
item sets are found, association rules can be generated by using the concept of confidence, which
is the ratio of the number of transactions that contain the item set and the number of transactions
that contain the antecedent (left-hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks such as market basket
analysis, cross-selling and recommendation systems. However, it should be noted that
association rule mining can generate a large number of rules, many of which may be irrelevant
or uninteresting. Therefore, it is important to use appropriate measures such as lift and
conviction to evaluate the interestingness of the generated rules.
Association Mining searches for frequent items in the data set. In frequent mining usually, interesting
associations and correlations between item sets in transactional and relational databases are found. In
short, Frequent Mining shows which items appear together in a transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to put them
together in stores or provide some discount offer on one item on purchase of another item. This can
really increase sales. For example, it is likely to find that if a customer buys Milk and bread he/she also
buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the
customer buy butter if he/she buys Milk and Bread.
Important Definitions :
• Support : It is one of the measures of interestingness. This tells about the usefulness and
certainty of rules. 5% Support means total 5% of transactions in the database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
• Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
• Support_count(X): Number of transactions in which X appears. If X is A union B then it is the
number of transactions in which A and B both are present.
• Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
• Closed Itemset: An itemset is closed if none of its immediate supersets have same support count
same as Itemset.
• K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an itemset is
frequent if the corresponding support count is greater than the minimum support count.
Example On finding Frequent Itemsets – Consider the given dataset with given transactions.
96
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say that the
apriori algorithm is an association rule leaning that analyzes that people who bought product A also
bought product B.
The primary objective of the apriori algorithm is to create the association rule between different objects.
The association rule describes how two or more objects are related to one another. Apriori algorithm is
also called frequent pattern mining. Generally, you operate the Apriori algorithm on a database that
consists of a huge number of transactions. Let's understand the apriori algorithm with the help of an
example; suppose you go to Big Bazar and buy different products. It helps the customers buy their
products with ease and increases the sales performance of the Big Bazar. In this tutorial, we will discuss
the apriori algorithm with examples.
Introduction
We take an example to understand the concept better. You must have noticed that the Pizza shop seller
makes a pizza, soft drink, and breadstick combo together. He also offers a discount to their customers
who buy these combos. Do you ever think why does he do so? He thinks that customers who buy pizza
also buy soft drinks and breadsticks. However, by making combos, he makes it easy for the customers.
At the same time, he also increases his sales performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled together. It
shows that the shopkeeper makes it comfortable for the customers to buy these products in the same
place.
The above two examples are the best examples of Association Rules in Data Mining. It helps us to learn
the concept of apriori algorithms.
What is Apriori Algorithm?
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant
association rules. Generally, the apriori algorithm operates on a database containing a huge number of
transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Let's take an example to understand this concept.
We have already discussed above; you need a huge database containing a large no of transactions.
Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the Support,
Confidence, and Lift for two products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600
transactions include a 200 that includes Biscuits and chocolates. Using this data, we will find out the
support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a quotient of the division
of the number of transactions comprising that product by the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates together. So,
you need to divide the number of transactions that comprise both biscuits and chocolates by the total
number of transactions to get the confidence.
Hence,
97
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving
Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when you
sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five times more
than that of purchasing the biscuits alone. If the lift value is below one, it requires that the people are
unlikely to buy both the items together. Larger the value, the better is the combination.
How does the Apriori Algorithm work in Data Mining?
We will understand this algorithm with the help of an example
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}. The
database comprises six transactions where 1 represents the presence of the product and 0 represents the
absence of the product.
Transaction ID Rice Pulse Oil Milk Apple
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
The Apriori Algorithm makes the given assumptions
o All subsets of a frequent itemset must be frequent.
o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.
Step 1
Make a frequency table of all the products that appear in all the transactions. Now, short the frequency
table to add only those products with a threshold support level of over 50 percent. We find the given
frequency table.
Product Frequency (Number of transactions)
Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M) 4
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency table.
Itemset Frequency (Number of transactions)
RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3
Implementing the same threshold support of 50 percent and consider the products that are more than 50
percent. In our case, it is more than 3
Thus, we get RP, RO, PO, and PM
Step 4
98
Now, look for a set of three products that the customers buy together. We get the given combination.
1. RP and RO give RPO
2. PO and PM give POM
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table.
Itemset Frequency (Number of transactions)
RPO 4
POM 3
If you implement the threshold assumption, you can figure out that the customers' set of three products
is RPO.
We have considered an easy example to discuss the apriori algorithm in data mining. In reality, you find
thousands of such combinations.
How to improve the efficiency of the Apriori Algorithm?
There are various methods used for the efficiency of the Apriori algorithm
Hash-based itemset counting
In hash-based itemset counting, you need to exclude the k-itemset whose equivalent hashing bucket
count is least than the threshold is an infrequent itemset.
Transaction Reduction
In transaction reduction, a transaction not involving any frequent X itemset becomes not valuable in
subsequent scans.
Apriori Algorithm in data mining
We have already discussed an example of the apriori algorithm related to the frequent itemset
generation. Apriori algorithm has many applications in data mining.
The primary requirements to find the association rules in data mining are given below.
Use Brute Force
Analyze all the rules and find the support and confidence levels for the individual rule. Afterward,
eliminate the values which are less than the threshold support and confidence levels.
The two-step approaches
The two-step approach is a better option to find the associations rules than the Brute Force method.
Step 1
In this article, we have already discussed how to create the frequency table and calculate itemsets
having a greater support value than that of the threshold support.
Step 2
To create association rules, you need to use a binary partition of the frequent itemsets. You need to
choose the ones having the highest confidence levels.
In the above example, you can see that the RPO combination was the frequent itemset. Now, we find
out all the rules using RPO.
RP-O, RO-P, PO-R, O-RP, P-RO, R-PO
You can see that there are six different combinations. Therefore, if you have n elements, there will be
2n - 2 candidate association rules.
Advantages of Apriori Algorithm
o It is used to calculate large itemsets.
o Simple to understand and apply.
Disadvantages of Apriori Algorithms
o Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
o Sometimes, you need a huge number of candidate rules, so it becomes computationally more
expensive.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with
fewer dimensions. This approach is useful when we want to keep the whole information but use fewer
resources while processing the information.
Some common feature extraction techniques are:
1. Principal Component Analysis
2. Linear Discriminant Analysis
3. Kernel PCA
4. Quadratic Discriminant Analysis
102
So, such variables are put into a group, and that group is known as the factor. The number of these
factors will be reduced as compared to the original dimension of the dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of ANN
or artificial neural network, and its main aim is to copy the inputs to their outputs. In this, the input is
compressed into latent-space representation, and output is occurred using this representation. It has
mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
Classification:
Classification determines a set of rules which find the class of the specified object as per its attributes.
Association rules:
Association rules determine rules from the data sets, and it describes patterns that are usually in the
database.
Characteristic rules:
Characteristic rules describe some parts of the data set.
Discriminate rules:
As the name suggests, discriminate rules describe the differences between two parts of the database,
such as calculating the difference between two cities as per employment rate.
104
Spatial data mining refers to the temporal data mining refers to the process of extraction of
extraction of knowledge, spatial knowledge about the occurrence of an event whether they
relationships and interesting patterns follow, random, cyclic, seasonal variation, etc
that are not specifically stored in a
spatial database.
It needs space. It needs time.
Primarily, it deals with spatial data such Primarily, it deals with implicit and explicit temporal
as location, geo-referenced. content, form a huge set of data.
It involves characteristic rules, It targets mining new patterns and unknown knowledge,
discriminant rules, evaluation rules, and which takes the temporal aspects of data.
association rules.
Examples: Finding hotspots, unusual Examples: An association rules which seems - "Any person
locations. who buys motorcycle also buys helmet". By temporal
aspect, this rule would be - "Any person who buys a
motorcycle also buy a helmet after that."
Over the last few years, the World Wide Web has become a significant source of information and
simultaneously a popular platform for business. Web mining can define as the method of utilizing data
mining techniques and algorithms to extract useful information directly from the web, such as Web
documents and services, hyperlinks, Web content, and server logs. The World Wide Web contains a
large amount of data that provides a rich source to data mining. The objective of Web mining is to look
for patterns in Web data by collecting and examining data in order to gain insights.
105
1. Web Content Mining:
Web content mining can be used to extract useful data, information, knowledge from the web page
content. In web content mining, each web page is considered as an individual document. The individual
can take advantage of the semi-structured nature of web pages, as HTML provides information that
concerns not only the layout but also logical structure. The primary task of content mining is data
extraction, where structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web content mining can
be utilized to distinguish topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
Advertisement
2. Web Structured Mining:
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that
data either link the web pages or direct link network. In Web Structure Mining, an individual considers
the web as a directed graph, with the web pages being the vertices that are associated with hyperlinks.
The most important application in this regard is the Google search engine, which estimates the ranking
of its outcomes primarily with the PageRank algorithm. It characterizes a page to be exceptionally
relevant when frequently connected by other highly related pages. Structure and content mining
methodologies are usually combined. For example, web structured mining can be beneficial to
organizations to regulate the network between two commercial sites.
3. Web Usage Mining:
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and
assists in recognizing the user access patterns for web pages. In Mining, the usage of web resources, the
individual is thinking about records of requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of web pages follow the intentions of the
authors of the pages, the individual requests demonstrate how the consumers see these pages. Web
usage mining may disclose relationships that were not proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
I. Session and visitor analysis:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates the guest
records, days, time, sessions, etc. This data can be utilized to analyze the visitor's behavior.
The document is created after this analysis, which contains the details of repeatedly visited web pages,
common entry, and exit.
II. OLAP (Online Analytical Processing):
OLAP accomplishes a multidimensional analysis of advanced data.
OLAP can be accomplished on various parts of log related data in a specific period.
OLAP tools can be used to infer important business intelligence metrics
106
The web pretends incredible challenges for resources, and knowledge discovery based on the following
observations:
107