Data Warehousing & Mining BCA V SEM

Download as pdf or txt
Download as pdf or txt
You are on page 1of 107

Data Warehousing & Mining

Unit 1
Data Warehouse
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than
transaction processing. It includes historical data derived from transaction data from single and multiple
sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support
for decision-makers for data modeling and analysis.
Data Warehouse is a group of data specific to the entire organization, not only to a particular group of
users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of
management's decisions.
Characteristics of Data Warehouse

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users to
understand the subject.

1
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing to
ensure consistency in naming conventions, attributes types, etc., among different data sources.

Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6
months, 12 months, or even previous data from a data warehouse. These variations with a transactions
system, where often only the most current file is kept.

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source operational
RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and
delete operations are not performed. It usually requires only two procedures in data accessing: Initial
loading of data and access to data. Therefore, the DW does not require transaction processing, recovery,
and concurrency capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines
that once entered into the warehouse, and data should not change.

History of Data Warehouse


The idea of data warehousing came to the late 1980's when IBM researchers Barry Devlin and Paul
Murphy established the "Business Data Warehouse."
In essence, the data warehousing idea was planned to support an architectural model for the flow of
information from the operational system to decisional support environments. The concept attempt to
address the various problems associated with the flow, mainly the high costs associated with it.
In the absence of data warehousing architecture, a vast amount of space was required to support multiple
decision support environments. In large corporations, it was ordinary for various decision support
environments to operate independently.

2
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Need for Data Warehouse
Data Warehouse is needed for the following reasons:

1) Business User: Business users require a data warehouse to view summarized data from the past.
Since these people are non-technical, the data may be presented to them in an elementary form.
2) Store historical data: Data Warehouse is required to store the time variable data from the past.
This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data warehouse.
So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a commonplace, the
user can effectively undertake to bring the uniformity and consistency in data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of
queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse
1. Understand business trends and make better forecasting decisions.
2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and
query.
4. Queries that would be complex in many normalized databases could be easier to build and
maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of
users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.

Difference between Database and Data Warehouse


Database Data Warehouse
1. It is used for Online Transactional Processing 1. It is used for Online Analytical Processing
(OLTP) but can be used for other objectives such (OLAP). This reads the historical information for
as Data Warehousing. This records the data from the customers for business decisions.
the clients for history.
2. The tables and joins are complicated since they 2. The tables and joins are accessible since they are
are normalized for RDBMS. This is done to reduce de-normalized. This is done to minimize the
redundant files and to save storage space. response time for analytical queries.
3. Data is dynamic 3. Data is largely static
4. Entity: Relational modeling procedures are 4. Data: Modeling approach are used for the Data
used for RDBMS database design. Warehouse design.
3
5. Optimized for write operations. 5. Optimized for read operations.
6. Performance is low for analysis queries. 6. High performance for analytical queries.
7. The database is the place where the data is taken 7. Data Warehouse is the place where the
as a base and managed to get available fast and application data is handled for analysis and
efficient access. reporting objectives.

Statstical database
A statistical database (SDB) system is a database system that enables its users to retrieve only aggregate
statistics (e.g., sample mean and count) for a subset of the entities represented in the database.
As a statistical database may contain sensitive individual information, such as salary and health records,
generally, users are only allowed to retrieve aggregate statistics for a subset of the entities represented in
the databases. Common aggregate query operators in SQL include SUM, COUNT, MAX, MIN, and
AVERAGE, though more sophisticated statistical measures may also be supported by some database
systems.
Statistical databases pose unique security concerns, which have been the focus of much research.
However, the key security challenge is that of ensuring that no user is able to infer private information
with respect to a privacy... The main differences between a data warehouse and a statistical database are
their purpose, how they store data, and how users access the data:
Purpose
A data warehouse's main purpose is to store and analyze data to help with decision-making, while a
statistical database's main purpose is to store data for statistical analysis and reporting.
Data storage
A data warehouse stores data from various sources, including relational databases and transactional
systems, and is designed to handle large amounts of data. A statistical database stores data that is
organized for easy access and manipulation.
Data access
A data warehouse is accessed by a smaller number of people for specific reasons, while a database is
accessed by a larger number of people with broader needs. A statistical database allows users to retrieve
aggregate statistics, such as sample mean and count, for a subset of the database's entities.

Data Mart
A data mart is a data storage system that contains information specific to an organization's business unit.
It contains a small and selected part of the data that the company stores in a larger storage system.
Companies use a data mart to analyze department-specific information more efficiently. It provides
summarized data that key stakeholders can use to quickly make informed decisions.
For example, a company might store data from various sources, such as supplier information, orders,
sensor data, employee information, and financial records in their data warehouse or data lake. However,
the company stores information relevant to, for instance, the marketing department, such as social media
reviews and customer records, in a data mart.

A data mart is a simple form of data warehouse focused on a single subject or line of business. With a
data mart, teams can access data and gain insights faster, because they don’t have to spend time searching
within a more complex data warehouse or manually aggregating data from different sources.

4
Characteristics of data marts
Typically built and managed by the enterprise data team, although they can be built and maintained by
business unit SMEs organically as well.
Business group data stewards maintain the data mart, and end users have read-only access — they can
query and view tables, but cannot modify them, in order to prevent less technically-savvy users from
accidentally deleting or modifying critical business data.
Typically uses a dimensional model and star schema.
Contains a curated subset of data from the larger data warehouse. The data is highly structured, having
been cleansed and conformed by the enterprise data team to make it easy to understand and query.
Designed around the unique needs of a particular line of business or use case.
Users typically query the data using SQL commands.

Benefits of data marts


Single source of truth — the data mart can serve as a single source of truth for a particular line of business,
so everyone is working off of the same facts and data.
Simplicity — business users looking for data can visit the curated data mart for easy access to the data
they care about, instead of having to wade through the entire data warehouse and join tables together to
get the data they need.

Types of Data Marts


There are mainly two approaches to designing data marts. These approaches are
o Dependent Data Marts
o Independent Data Marts
Dependent Data Marts
A dependent data mart is a logical subset of a physical subset of a higher data warehouse. According to
this technique, the data marts are treated as the subsets of a data warehouse. In this technique, firstly a
data warehouse is created from which further various data marts can be created. These data mart are
dependent on the data warehouse and extract the essential record from it. In this technique, as the data
warehouse creates the data mart; therefore, there is no need for data mart integration. It is also known as
a top-down approach.

5
Independent Data Marts
The second approach is Independent data marts (IDM) Here, firstly independent data marts are created,
and then a data warehouse is designed using these independent multiple data marts. In this approach, as
all the data marts are designed independently; therefore, the integration of data marts is required. It is also
termed as a bottom-up approach as the data marts are integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added
to the organizations.

Data Warehouse Data Mart


A Data Warehouse is a vast repository of information A data mart is an only subtype of a Data
collected from various organizations or departments Warehouses. It is architecture to meet the
within a corporation. requirement of a specific user group.
It may hold multiple subject areas. It holds only one subject area. For example,
Finance or Sales.
It holds very detailed information. It may hold more summarized data.
Works to integrate all data sources It concentrates on integrating data from a given
subject area or set of source systems.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema
are used.
It is a Centralized System. It is a Decentralized
System.
Data Warehousing is the data-oriented. Data Marts is a project-oriented.

6
What is Meta Data?
Metadata is data about the data or documentation about the information which is required by the users. In
data warehousing, metadata is one of the essential aspects.
Metadata includes the following:
1. The location and descriptions of warehouse systems and components.
2. Names, definitions, structures, and content of data-warehouse and end-users views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-user analytical tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.
Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow
users access to help understand the content and find data.
Several examples of metadata are:
1. A library catalog may be considered metadata. The directory metadata consists of several
predefined components representing specific attributes of a resource, and each item can have one
or more values. These components could be the name of the author, the name of the document,
the publisher's name, the publication date, and the methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the book.
3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the
person's weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the
data is 80.
4. Another examples of metadata are data about the tables and figures in a report like this book. A
table (which is a record) has a name (e.g., table titles), and there are column names of the tables
that may be treated metadata. The figures also have titles or names.

Types of Metadata
Metadata in a data warehouse fall into three major parts:
o Operational Metadata
o Extraction and Transformation Metadata
o End-User Metadata

Operational Metadata
As we know, data for the data warehouse comes from various operational systems of the enterprise. These
source systems include different data structures. The data elements selected for the data warehouse have
various fields lengths and data types.
7
In selecting information from the source systems for the data warehouses, we divide records, combine
factor of documents from different source files, and deal with multiple coding schemes and field lengths.
When we deliver information to the end-users, we must be able to tie that back to the source data sets.
Operational metadata contains all of this information about the operational data sources.
Extraction and Transformation Metadata
Extraction and transformation metadata include data about the removal of data from the source systems,
namely, the extraction frequencies, extraction methods, and business rules for the data extraction. Also,
this category of metadata contains information about all the data transformation that takes place in the
data staging area.
End-User Metadata
The end-user metadata is the navigational map of the data warehouses. It enables the end-users to find
data from the data warehouses. The end-user metadata allows the end-users to use their business
terminology and look for the information in those ways in which they usually think of the business.
Metadata Repository
The metadata itself is housed in and controlled by the metadata repository. The software of metadata
repository management can be used to map the source data to the target database, integrate and transform
the data, generate code for data transformation, and to move data to the warehouse.
Benefits of Metadata Repository
1. It provides a set of tools for enterprise-wide metadata management.
2. It eliminates and reduces inconsistency, redundancy, and underutilization.
3. It improves organization control, simplifies management, and accounting of information assets.
4. It increases coordination, understanding, identification, and utilization of information assets.
5. It enforces CASE development standards with the ability to share and reuse metadata.
6. It leverages investment in legacy systems and utilizes existing applications.
7. It provides a relational model for heterogeneous RDBMS to share information.
8. It gives useful data administration tool to manage corporate information assets with the data
dictionary.
It increases reliability, control, and flexibility of the application development process.

What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data Cubes. The data cube method
has a few alternative names or a few variants, such as "Multidimensional databases," "materialized views,"
and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are frequently
inquired.
For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of aggregate
function value (such as total-sales) computed by grouping three attributes part, supplier, and
customer, p indicates a view composed of the corresponding aggregate function values calculated by
grouping part alone, etc.

8
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to be
measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected as
dimensions or functional attributes. The measure attributes are aggregated according to the dimensions.
For example, XYZ may create a sales data warehouse to keep records of the store's sales for the
dimensions time, item, branch, and location. These dimensions enable the store to keep track of things
like monthly sales of items, and the branches and locations at which the items were sold. Each dimension
may have a table identify with it, known as a dimensional table, which describes the dimensions. For
example, a dimension table for items may contain the attributes item_name, brand, and type.
Data cube method is an interesting technique with many applications. Data cubes could be sparse in many
cases because not every cell in each dimension may have corresponding data in the database.
Techniques should be developed to handle sparse cubes efficiently.
If a query contains constants at even lower levels than those provided in a data cube, it is not clear how
to make the best use of the precomputed results stored in the data cube.
The model view data in the form of a data cube. OLAP tools are based on the multidimensional data
model. Data cubes usually model n-dimensional data.
A data cube enables data to be modeled and viewed in multiple dimensions. A multidimensional data
model is organized around a central theme, like sales and transactions. A fact table represents this theme.
Facts are numerical measures. Thus, the fact table contains measure (such as Rs_sold) and keys to each
of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are used for analyzing
the relationship between dimensions.

Example: In the 2-D representation, we will look at the All Electronics sales data for items sold per
quarter in the city of Vancouver. The measured display in dollars sold (in thousands).

3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example, suppose we would
like to view the data according to time, item as well as the location for the cities Chicago, New York,
Toronto, and Vancouver. The measured display in dollars sold (in thousands). These 3-D data are shown
in the table. The 3-D data of the table are represented as a series of 2-D tables.

9
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown in fig:

Let us suppose that we would like to view our sales data with an additional fourth dimension, such as a
supplier.
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level of
summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location, and
supplier dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time, item,
location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex cuboid.
In this example, this is the total sales, or dollars sold, summarized over all four dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating 4-D data cubes
for the dimension time, item, location, and supplier. Each cuboid represents a different degree of
summarization.

10
Multidimensional Data Model
The multi-Dimensional Data Model is a method which is used for ordering data in the database along with
good arrangement and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions associated with
market or business trends, unlike relational databases which allow customers to access data in the form
of queries. They allow users to rapidly receive answers to the requests which they made by creating and
examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases. It is used
to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from many
dimensions and perspectives. It is defined by dimensions and facts and is represented by a fact table. Facts
are numerical measures and fact tables contain measures of the related dimensional tables or names of the
facts.

Multidimensional Data Representation

Working on a Multidimensional Data Model


On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional Data Model
:
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model collects
correct data from the client. Mostly, software professionals provide simplicity to the client about the range
of data which can be gained with the selected technology and collect the complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi Dimensional Data
Model recognizes and classifies all the data to the respective section they belong to and also builds it
problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the design of
the system is based. In this stage, the main factors are recognized according to the user’s point of view.
These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth stage, the
factors which are recognized in the previous step are used further for identifying the related qualities.
These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities : In the fifth
stage, A Multi Dimensional Data Model separates and differentiates the actuality from the factors which
are collected by it. These actually play a significant role in the arrangement of a Multi Dimensional Data
Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected from the
steps above : In the sixth stage, on the basis of the data which was collected previously, a Schema is
built.
For Example :
1. Let us take the example of a firm. The revenue cost of a firm can be recognized on the basis of different
factors such as geographical location of firm’s workplace, products of the firm, advertisements done,
time utilized to flourish a product, etc.
11
Features of multidimensional data models:
Measures: Measures are numerical data that can be analyzed and compared, such as sales or revenue.
They are typically stored in fact tables in a multidimensional data model.
Dimensions: Dimensions are attributes that describe the measures, such as time, location, or product.
They are typically stored in dimension tables in a multidimensional data model.
Cubes: Cubes are structures that represent the multidimensional relationships between measures and
dimensions in a data model. They provide a fast and efficient way to retrieve and analyze data.
Aggregation: Aggregation is the process of summarizing data across dimensions and levels of detail.
This is a key feature of multidimensional data models, as it enables users to quickly analyze data at
different levels of granularity.
Drill-down and roll-up: Drill-down is the process of moving from a higher-level summary of data to a
lower level of detail, while roll-up is the opposite process of moving from a lower-level detail to a
higher-level summary. These features enable users to explore data in greater detail and gain insights into
the underlying patterns.
Hierarchies: Hierarchies are a way of organizing dimensions into levels of detail. For example, a time
dimension might be organized into years, quarters, months, and days. Hierarchies provide a way to
navigate the data and perform drill-down and roll-up operations.

What is Star Schema?


A star schema is the elementary form of a dimensional model, in which data are organized
into facts and dimensions. A fact is an event that is counted or measured, such as a sale or log in. A
dimension includes reference data about the fact, such as date, item, or customer.
A star schema is a relational schema where a relational schema whose design represents a
multidimensional data model. The star schema is the explicit data warehouse schema. It is known as star
schema because the entity-relationship diagram of this schemas simulates a star, with points, diverge
from a central table. The center of the schema consists of a large fact table, and the points of the star are
the dimension tables.

12
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two types of
columns: those that include fact and those that are foreign keys to the dimension table. The primary key
of the fact tables is generally a composite key that is made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the
same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the following
features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the development
cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.

Advantages of Star Schema


Star Schemas are easy for end-users and application to understand and navigate. With a well-designed
schema, the customer can instantly analyze large, multidimensional data sets.
The main advantage of star schemas in a decision-support environment are:

Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they
do against OLTP systems. Small single-table queries, frequently of a dimension table, are almost
instantaneous. Large join queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When
the two-dimension table is used in a query, only one join path, intersecting the fact tables, exist between
those two tables. This design feature enforces authentic and consistent query results.
Load performance and administration
Structural simplicity also decreases the time required to load large batches of record into a star schema
database. By describing facts and dimensions and separating them into the various table, the impact of a
load structure is reduced. Dimension table can be populated once and occasionally refreshed. We can add
new facts regularly and selectively by appending records to a fact table.
Built-in referential integrity
A star schema has referential integrity built-in when information is loaded. Referential integrity is
enforced because each data in dimensional tables has a unique primary key, and all keys in the fact table
are legitimate foreign keys drawn from the dimension table. A record in the fact table which is not related
correctly to a dimension cannot be given the correct key value to be retrieved.

13
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table.
These joins are more significant to the end-user because they represent the fundamental relationship
between parts of the underlying business. Customer can also browse dimension table attributes before
constructing a query.

Disadvantage of Star Schema


There is some condition which cannot be meet by star schemas like the relationship between the user, and
bank account cannot describe as star schema as the relationship between them is many to many.
Example: Suppose a star schema is composed of a fact table, SALES, and several dimension tables
connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for
each item_Key, item_name, brand, type, supplier_type. The BRANCH table has columns for each
branch_key, branch_name, branch_type. The LOCATION table has columns of geographic data,
including street, city, state, and country.

In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME,
ITEM, BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data,
three columns for BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table
is significantly reduced. When we need to change an item, we need only make a single change in the
dimension table, instead of making many changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.

What is Snowflake Schema?


A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension tables."
The snowflake schema is an expansion of the star schema where each point of the star explodes into more
points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact
table in the middle.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each
fact surrounded by its associated dimensions, and those dimensions are related to other dimensions,
branching out into a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can be
linked to other dimension tables through a many-to-one relationship. Tables in a snowflake schema are
generally normalized to the third normal form. Each dimension table performs exactly one level in a
hierarchy. can have any number of levels.

14
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product,
Line, and Family dimension tables. The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table. The product dimension has three
dimension tables with Product as the primary dimension table, and the Line and Family table are the
outrigger dimension tables.

A star schema store all attributes for a dimension into one denormalized table. This needed more disk
space than a more normalized snowflake schema. Snowflaking normalizes the dimension by moving
attributes with low cardinality into separate dimension tables that relate to the core dimension table by
using foreign keys. Snowflaking for the sole purpose of minimizing disk space is not recommended,
because it can adversely impact query performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are
damaged into multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include
quantity, price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the
dimension tables.

15
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized version
now extends to eleven tables. We will notice that in the snowflake schema, the attributes with low
cardinality in each original dimension tables are removed to form separate tables. These new tables are
connected back to the original dimension table through artificial keys.

A snowflake schema is designed for flexible querying across more complex dimensions and relationship.
It is suitable for many to many and one to many relationships between dimension levels.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required
due to the increasing number of lookup tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
Difference between Star and Snowflake Schemas
Star Schema
o In a star schema, the fact table will be at the center and is connected to the dimension tables.
o The tables are completely in a denormalized structure.
o SQL queries performance is good as there is less number of joins involved.
o Data redundancy is high and occupies more disk space.

16
Snowflake Schema
o A snowflake schema is an extension of star schema where the dimension tables are connected to
one or more dimensions.
o The tables are partially denormalized in structure.
o The performance of SQL queries is a bit less when compared to star schema as more number of
joins are involved.
o Data redundancy is low and occupies less disk space when compared to star schema.

Let's see the differentiate between Star and Snowflake Schema.

Basis for Star Schema Snowflake Schema


Comparison
Ease of It has redundant data and hence less No redundancy and therefore more easy to
Maintenance/chan easy to maintain/change maintain and change
ge
Ease of Use Less complex queries and simple to More complex queries and therefore less
understand easy to understand
Parent table In a star schema, a dimension table In a snowflake schema, a dimension table
will not have any parent table will have one or more parent tables
Query Less number of foreign keys and More foreign keys and thus more query
Performance hence lesser query execution time execution time
Normalization It has De-normalized tables It has normalized tables
Type of Data Good for data marts with simple Good to use for data warehouse core to
Warehouse relationships (one to one or one to simplify complex relationships (many to
many) many)
17
Joins Fewer joins Higher number of joins
Dimension Table It contains only a single dimension It may have more than one dimension table
table for each dimension for each dimension
Hierarchies Hierarchies for the dimension are Hierarchies are broken into separate tables
stored in the dimensional table itself in a snowflake schema. These hierarchies
in a star schema help to drill down the information from
topmost hierarchies to the lowermost
hierarchies.
When to use When the dimensional table contains When dimensional table store a huge
less number of rows, we can go for number of rows with redundancy
Star schema. information and space is such an issue, we
can choose snowflake schema to store
space.
Data Warehouse Work best in any data warehouse/ Better for small data warehouse/data mart.
system data mart

What is Fact Constellation Schema?


A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation
Schema can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to summarize information.
Fact Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact
table into independent simplex Fact tables.
Example: A fact constellation schema is shown in the figure below.

18
This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes keys to
each of the four dimensions, along with two measures: Rupee_sold and units_sold. The shipping table
has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two
measures: Rupee_cost and units_shipped.
The primary disadvantage of the fact constellation schema is that it is a more challenging design because
many variants for specific kinds of aggregation must be considered and selected.

Data Warehouse Applications


The application areas of the data warehouse are:
Information Processing
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays,
information processing of data warehouse is to construct a low cost, web-based accessing tools typically
integrated with web browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The historical
data is being processed in both summarized and detailed format.
OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is to support
ad-hoc querying needed for support DSS. The multidimensional view of data is fundamental to the OLAP
application. OLAP is an operational view, not a data structure or schema. The complex nature of OLAP
applications requires a multidimensional view of the data.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models, operating
classification and prediction, and performing the mining results using visualization tools.
Data mining is the technique of designing essential new correlations, patterns, and trends by changing
through high amounts of a record save in repositories, using pattern recognition technologies as well as
statistical and mathematical techniques.
It is the phase of selection, exploration, and modeling of huge quantities of information to determine
regularities or relations that are at first unknown to access precise and useful results for the owner of the
database.
It is the process of inspection and analysis, by automatic or semi-automatic means, of large quantities of
records to discover meaningful patterns and rules.

19
UNIT 2
Data Warehouse Architecture
A data warehouse architecture is a method of defining the overall architecture of data communication
processing and presentation that exist for end-clients computing within the enterprise. Each data
warehouse is different, but all are characterized by standard vital components.
Production applications such as payroll accounts payable product purchasing and inventory control are
designed for online transaction processing (OLTP). Such applications gather detailed data from day to
day operations.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity
recently dubbed online analytical processing (OLAP). These include applications such as forecasting,
profiling, summary reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a
warehouse database is updated from operational systems periodically, usually during off-hours. As OLTP
data accumulates in production databases, it is regularly extracted, filtered, and then loaded into a
dedicated warehouse server that is accessible to users. As the warehouse is populated, it must be
restructured tables de-normalized, data cleansed of errors and redundancies and new fields and keys added
to reflect the needs to the user for sorting, combining, and summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's situation.
Three common architectures are:
o Data Warehouse Architecture: Basic
o Data Warehouse Architecture: With Staging Area
o Data Warehouse Architecture: With Staging Area and Data Marts
o
Data Warehouse Architecture: Basic

Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process
the day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system
must have a different name.
Advertisement
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with
particular instances of data more accessible. For example, author, data build, and data changed, and file
size are examples of very basic document metadata.
Metadata is used to direct a query to the most appropriate data source.
Lightly and highly summarized data

20
The area of the data warehouse saves all the predefined lightly and highly summarized (aggregated) data
generated by the warehouse manager.
Advertisement
The goals of the summarized information are to speed up query performance. The summarized record is
updated continuously as new information is loaded into the warehouse.
End-User access Tools
The principal purpose of a data warehouse is to provide information to the business managers for strategic
decision-making. These customers interact with the warehouse using end-client access tools.
Advertisement
The examples of some of the end-user access tools can be:
o Reporting and Query Tools
o Application Development Tools
o Executive Information Systems Tools
o Online Analytical Processing Tools
o Data Mining Tools
Data Warehouse Architecture: With Staging Area
We must clean and process your operational information before put it into the warehouse.
We can do this programmatically, although data warehouses uses a staging area (A place where data is
processed before entering the warehouse).
A staging area simplifies data cleansing and consolidation for operational method coming from multiple
source systems, especially for enterprise data warehouses where all relevant data of an enterprise is
consolidated.

Data Warehouse Staging Area is a temporary location where a record from source systems is copied.

Data Warehouse Architecture: With Staging Area and Data Marts


We may want to customize our warehouse's architecture for multiple groups within our organization.
We can do this by adding data marts. A data mart is a segment of a data warehouses that can provided
information for reporting and analysis on a section, unit, department or operation in the company, e.g.,
sales, payroll, production, etc.
The figure illustrates an example where purchasing, sales, and stocks are separated. In this example, a
financial analyst wants to analyze historical data for purchases and sales or mine historical information to
make predictions about customer behavior.

21
Properties of Data Warehouse Architectures
The following architecture properties are necessary for a data warehouse system:

1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which
has to be managed and processed, and the number of user's requirements, which have to be met,
progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without
redesigning the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Types of Data Warehouse Architectures

Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data
stored to reach this goal; it removes data redundancies.
22
The figure shows the only layer physically available is the source layer. In this method, data warehouses
are virtual. This means that the data warehouse is implemented as a multidimensional view of operational
data created by specific middleware, or an intermediate processing layer.

The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the
middleware interprets them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data
warehouse system, as shown in fig:

Although it is typically called two-layer architecture to highlight a separation between physically available
sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored
initially to corporate relational databases or legacy databases, or it may come from an information
system outside the corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove
inconsistencies and fill gaps, and integrated to merge heterogeneous sources into one standard
schema. The so-named Extraction, Transformation, and Loading Tools (ETL) can combine
heterogeneous schemata, extract, transform, cleanse, validate, filter, and load source data into a
data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository: a
data warehouse. The data warehouses can be directly accessed, but it can also be used as a source
for creating data marts, which partially replicate data warehouse contents and are designed for
specific enterprise departments. Meta-data repositories store information on sources, access
procedures, data staging, users, data mart schema, and so on.

23
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports,
dynamically analyze information, and simulate hypothetical business scenarios. It should feature
aggregate information navigators, complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the reconciled
layer and the data warehouse layer (containing both data warehouses and data marts). The reconciled layer
sits between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems of source data extraction and integration from those
of data warehouse population. In some cases, the reconciled layer is also directly used to accomplish
better some operational tasks, such as producing daily reports that cannot be satisfactorily prepared using
the corporate applications or generating data flows to feed external processes periodically to benefit from
cleaning and integration.
Advertisement
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this
structure is the extra file storage space used through the extra redundant reconciled layer. It also makes
the analytical tools a little further away from being real-time.

Data Warehouse Process Architecture


The process architecture defines an architecture in which the data from the data warehouse is processed
for a particular computation.
Following are the two fundamental process architectures:

Centralized Process Architecture


In this architecture, the data is collected into single centralized storage and processed upon completion by
a single machine with a huge structure in terms of memory, processor, and storage.
Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.
It requires minimal resources both from people and system perspectives.
24
It is very successful when the collection and consumption of data occur at the same location.

Distributed Process Architecture


In this architecture, information and its processing are allocated across data centers, and its processing is
distributed across data centers, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized
process architectures where all the information needs to be collected to one central location, and results
are available in one central location.
There are several architectures of the distributed process:
Client-Server
In this architecture, the user does all the information collecting and presentation, while the server does the
processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server machine, thus
mandating finite states and introducing latencies and overhead in terms of record to be carried between
clients and servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated
into tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in
a cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of
a client or server or just process data.

What is OLAP (Online Analytical Processing)?


OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology
which authorizes analysts, managers, and executives to gain insight into information through fast,
consistent, interactive access in a wide variety of possible views of data that has been transformed from
raw information to reflect the real dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and support the capability for
complex estimations, trend analysis, and sophisticated data modeling. It is rapidly enhancing the essential
foundation for Intelligent Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge Discovery, and Data
Warehouses Reporting. OLAP enables end-clients to perform ad hoc analysis of record in multiple
dimensions, providing the insight and understanding they require for better decision making.

25
OLTP?
OLTP stands for Online Transaction Processing and its primary objective is the processing of data. An
OLTP system administers the day to day transaction of data under a 3-tier architecture (usually 3NF).
Each of these transactions involves individual records made up of multiple fields. The main emphasis of
OLTP is fast query processing and data integrity in multi-access environments. Some OLTP examples
are credit card activity, order entry, and ATM transactions.
OLTP Example
The ATM centre is an example of an OLTP system. Assume that a couple has a joint bank account. One
day, they arrive at different ATMs simultaneously and want to withdraw the whole amount from their
bank accounts.
OLTP vs OLAP: Differences
The main difference between OLTP vs OLAP is that OLTP is operational, whereas OLAP is
informational.

Fig: OLTP vs OLAP (source)

Here is a list of OLTP vs OLAP's top 15 key features that illustrate both their differences and
how they need to work together.

Parameters OLTP OLAP


Main Handles a large number of small Handles large volumes of data in multiple
characteristics transactions on a day to day basis databases
Data source Transactions OLTP databases and other sources
Purpose To support essential business To discover hidden insights and support
intelligence operations in real-time business decisions
Response time Milliseconds Seconds to hours (depends on the amount
of data to be processed)
Query type Simple Complex
Database design Normalized database for efficiency Denormalized database for analysis
Audience Market-oriented Customer-oriented
Domain Industry-specific (manufacturing, Subject-specific (sales, marketing, etc.)
finance, etc.)
Performance Transaction throughput Query throughput
metric

26
Challenge Data warehouses can be expensive to Strong technical knowledge and
build experience is required
Design Designed to have fast processing and Designed uniquely to integrate different
low redundancy data sources to build a consolidated
database
Operations INSERT, DELETE and UPDATE SELECT command
commands
Updates Short and fast updates Updates are scheduled and done
periodically
No. of users Thousands of users allowed at a time Only a few users allowed at a time

Space Very small (if data is archived Very large


requirements periodically)

Types of OLAP
There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.


MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.
Relational OLAP (ROLAP) Server
These are intermediate servers which stand in between a relational back-end server and user frontend
tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP
middleware to provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation navigation
logic, and additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database, where the base data
and dimension tables are stored as relational tables. This model permits the multidimensional analysis of
data.
This technique relies on manipulating the data stored in the relational database to give the presence of
traditional OLAP's slicing and dicing functionality. In essence, each method of slicing and dicing is
equivalent to adding a "WHERE" clause in the SQL statement.
Relational OLAP Architecture
ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool.

27
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the market.
This method allows multiple multidimensional views of two-dimensional relational tables to be created,
avoiding structuring record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user requests, being
'RDBMS' aware and also being capable of generating the SQL statements based on the optimizer of the
DBMS engine.
Can handle large amounts of information: The data size limitation of ROLAP technology is depends
on the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.
<="" strong="" style="box-sizing: border-box;">RDBMS already comes with a lot of features. So
ROLAP technologies, (works on top of the RDBMS) can control these functionalities.

Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational
database, the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query
the relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server


A MOLAP system is based on a native logical model that directly supports multidimensional data and
operations. Data are stored physically into multidimensional arrays, and positional techniques are used to
access them.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and are
stored in an optimized format in a multidimensional cube, instead of in a relational database. In MOLAP
model, data are structured into proprietary formats by client's reporting requirements with the calculations
pre-generated on the cubes.

MOLAP Architecture
MOLAP Architecture includes the following components

o Database server.
o MOLAP server.
o Front-end tool.

28
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to
dynamically create aggregations or to evaluate results which have not been pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited for
MOLAP technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship Server,
Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject areas in an
RDBMS. Some vendors can solve these problems by continuing access from MOLAP tools to detailed
data in and RDBMS.
This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture that
contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product and sales
region) to be stored and maintained in a persistent structure. This structure would be provided to reduce
the application overhead of performing calculations and building aggregation during initialization. These
structures can be automatically refreshed at predetermined intervals established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for slicing
and dicing operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is created.
Hence, complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed when the
cube is built, it is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already exist in
the organization. Therefore, to adopt MOLAP technology, chances are other investments in human and
capital resources are needed.
Hybrid OLAP (HOLAP) Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP
systems save more substantial quantities of detailed data in the relational tables while the aggregations
are stored in the pre-calculated cubes. HOLAP also can drill through from the cube down to the relational
tables for delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.

Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the
OLAP server and the detail record remains in the relational database. So no duplicate copy of the
detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.

29
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so often. We
have listed some of the less popular brands existing in the OLAP industry.
Web-Enabled OLAP (WOLAP) Server
WOLAP pertains to OLAP application which is accessible via the web browser. Unlike traditional
client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists
of three components: a client, a middleware, and a database server.
Desktop OLAP (DOLAP) Server
DOLAP permits a user to download a section of the data from the database or source, and work with that
dataset locally, or on their desktop.
Mobile OLAP (MOLAP) Server
Mobile OLAP enables users to access and work on OLAP data and applications remotely through the use
of their mobile devices.
Spatial OLAP (SOLAP) Server
SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into a single
user interface. It facilitates the management of both spatial and non-spatial data.

Three-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that includes:
1. Bottom Tier (Data Warehouse Server)
2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by external
consultants) are extracted using application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate SQL code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
The OLAP server is implemented using either
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions on
multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as additional
tools for data mining of the OLAP-generated data.
The overall Data Warehouse Architecture is shown in fig:
30
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational databases, which provides source RDBMSs and
their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include business
terms and definitions, ownership information, etc.

Principles of Data Warehousing

Load Performance
Data warehouses require increase loading of new data periodically basis within narrow time windows;
performance on the load process should be measured in hundreds of millions of rows and gigabytes per
hour and must not artificially constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse, including data conversion,
filtering, reformatting, indexing, and metadata update.
Data Quality Management
Fact-based management demands the highest data quality. The warehouse ensures local consistency,
global consistency, and referential integrity despite "dirty" sources and massive database size.
Query Performance
Fact-based management must not be slowed by the performance of the data warehouse RDBMS; large,
complex queries must be complete in seconds, not days.
Terabyte Scalability

31
Data warehouse sizes are growing at astonishing rates. Today these size from a few to hundreds of
gigabytes and terabyte-sized data warehouses.

Distributed Data Warehouses


The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed throughout
the enterprise and a global warehouses as shown in fig:

Characteristics of Local data warehouses


o Activity appears at the local level
o Bulk of the operational processing
o Local site is autonomous
o Each local data warehouse has its unique architecture and contents of data
o The data is unique and of prime essential to that locality only
o Majority of the record is local and not replicated
o Any intersection of data between local data warehouses is circumstantial
o Local warehouse serves different technical communities
o The scope of the local data warehouses is finite to the local site
o Local warehouses also include historical data and are integrated only within the local site.
Virtual Data Warehouses
Virtual Data Warehouses is created in the following stages:
1. Installing a set of data approach, data dictionary, and process management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to provide the high-frequency
results
This strategy defines that end users are allowed to get at operational databases directly using whatever
tools are implemented to the data access network. This method provides ultimate flexibility as well as
the minimum amount of redundant information that must be loaded and maintained. The data warehouse
is a great idea, but it is difficult to build and requires investment. Why not use a cheap and fast method
by eliminating the transformation phase of repositories for metadata and another database.
This method is termed the 'virtual data warehouse.'
To accomplish this, there is a need to define four kinds of data:
1. A data dictionary including the definitions of the various databases.
2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to do it.
Disadvantages
1. Since queries compete with production record transactions, performance can be degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support System)
integration or history. All queries must be copied, causing an additional burden on the system.
3. There is no refreshing process, causing the queries to be very complex.

32
Data warehouse manager
system management is mandatory for the successful implementation of a data warehouse. The most
important system managers are −
• System configuration manager
• System scheduling manager
• System event manager
• System database manager
• System backup recovery manager
System Configuration Manager
• The system configuration manager is responsible for the management of the setup and
configuration of data warehouse.
• The structure of configuration manager varies from one operating system to another.
• In Unix structure of configuration, the manager varies from vendor to vendor.
• Configuration managers have single user interface.
• The interface of configuration manager allows us to control all aspects of the system.
Note − The most important configuration tool is the I/O manager.
System Scheduling Manager
System Scheduling Manager is responsible for the successful implementation of the data warehouse. Its
purpose is to schedule ad hoc queries. Every operating system has its own scheduler with some form of
batch control mechanism. The list of features a system scheduling manager must have is as follows −
• Work across cluster or MPP boundaries
• Deal with international time differences
• Handle job failure
• Handle multiple queries
• Support job priorities
• Restart or re-queue the failed jobs
• Notify the user or a process when job is completed
• Maintain the job schedules across system outages
• Re-queue jobs to other queues
• Support the stopping and starting of queues
• Log Queued jobs
• Deal with inter-queue processing
Note − The above list can be used as evaluation parameters for the evaluation of a good scheduler.
Some important jobs that a scheduler must be able to handle are as follows −
• Daily and ad hoc query scheduling
• Execution of regular report requirements
• Data load
• Data processing
• Index creation
• Backup
• Aggregation creation

33
• Data transformation
Note − If the data warehouse is running on a cluster or MPP architecture, then the system scheduling
manager must be capable of running across the architecture.
System Event Manager
The event manager is a kind of a software. The event manager manages the events that are defined on
the data warehouse system. We cannot manage the data warehouse manually because the structure of
data warehouse is very complex. Therefore we need a tool that automatically handles all the events
without any intervention of the user.
Note − The Event manager monitors the events occurrences and deals with them. The event manager
also tracks the myriad of things that can go wrong on this complex data warehouse system.
Events
Events are the actions that are generated by the user or the system itself. It may be noted that the event is
a measurable, observable, occurrence of a defined action.
Given below is a list of common events that are required to be tracked.
• Hardware failure
• Running out of space on certain key disks
• A process dying
• A process returning an error
• CPU usage exceeding an 805 threshold
• Internal contention on database serialization points
• Buffer cache hit ratios exceeding or failure below threshold
• A table reaching to maximum of its size
• Excessive memory swapping
• A table failing to extend due to lack of space
• Disk exhibiting I/O bottlenecks
• Usage of temporary or sort area reaching a certain thresholds
• Any other database shared memory usage
The most important thing about events is that they should be capable of executing on their own. Event
packages define the procedures for the predefined events. The code associated with each event is known
as event handler. This code is executed whenever an event occurs
System and Database Manager

System and database manager may be two separate pieces of software, but they do the same job. The
objective of these tools is to automate certain processes and to simplify the execution of others. The
criteria for choosing a system and the database manager are as follows −

• increase user's quota.


• assign and de-assign roles to the users
• assign and de-assign the profiles to the users
• perform database space management
• monitor and report on space usage
• tidy up fragmented and unused space
• add and expand the space
• add and remove users
• manage user password
• manage summary or temporary tables
• assign or deassign temporary space to and from the user
• reclaim the space form old or out-of-date temporary tables
• manage error and trace logs
• to browse log and trace files
• redirect error or trace information
• switch on and off error and trace logging
• perform system space management

34
• monitor and report on space usage
• clean up old and unused file directories
• add or expand space.
System Backup Recovery Manager
The backup and recovery tool makes it easy for operations and management staff to back-up the data.
Note that the system backup manager must be integrated with the schedule manager software being
used. The important features that are required for the management of backups are as follows −
• Scheduling
• Backup data tracking
• Database awareness
Backups are taken only to protect against data loss. Following are the important points to remember −
• The backup software will keep some form of database of where and when the piece of data was
backed up.
• The backup recovery manager must have a good front-end to that database.
• The backup recovery software should be database aware.
• Being aware of the database, the software then can be addressed in database terms, and will not
perform backups that would not be viable.
Process managers are responsible for maintaining the flow of data both into and out of the data
warehouse. There are three different types of process managers −
• Load manager
• Warehouse manager
• Query manager
Data Warehouse Load Manager
Load manager performs the operations required to extract and load the data into the database. The size
and complexity of a load manager varies between specific solutions from one data warehouse to
another.
Load Manager Architecture
The load manager does performs the following functions −
• Extract data from the source system.
• Fast load the extracted data into temporary data store.
• Perform simple transformations into structure similar to the one in the data warehouse.

Extract Data from Source


The data is extracted from the operational databases or the external information providers. Gateways are
the application programs that are used to extract data. It is supported by underlying DBMS and allows
the client program to generate SQL to be executed at a server. Open Database Connection (ODBC) and
Java Database Connection (JDBC) are examples of gateway.
Fast Load
• In order to minimize the total load window, the data needs to be loaded into the warehouse in the
fastest possible time.
• Transformations affect the speed of data processing.
• It is more effective to load the data into a relational database prior to applying transformations and
checks.
• Gateway technology is not suitable, since they are inefficient when large data volumes are
involved.
35
Simple Transformations

While loading, it may be required to perform simple transformations. After completing simple
transformations, we can do complex checks. Suppose we are loading the EPOS sales transaction, we
need to perform the following checks −

• Strip out all the columns that are not required within the warehouse.
• Convert all the values to required data types.
Warehouse Manager

The warehouse manager is responsible for the warehouse management process. It consists of a third-
party system software, C programs, and shell scripts. The size and complexity of a warehouse manager
varies between specific solutions.

Warehouse Manager Architecture


A warehouse manager includes the following −
• The controlling process
• Stored procedures or C with SQL
• Backup/Recovery tool
• SQL scripts

Functions of Warehouse Manager


A warehouse manager performs the following functions −
• Analyzes the data to perform consistency and referential integrity checks.
• Creates indexes, business views, partition views against the base data.
• Generates new aggregations and updates the existing aggregations.
• Generates normalizations.
• Transforms and merges the source data of the temporary store into the published data warehouse.
• Backs up the data in the data warehouse.
• Archives the data that has reached the end of its captured life.

Note − A warehouse Manager analyzes query profiles to determine whether the index and aggregations
are appropriate.

Query Manager

The query manager is responsible for directing the queries to suitable tables. By directing the queries to
appropriate tables, it speeds up the query request and response process. In addition, the query manager
is responsible for scheduling the execution of the queries posted by the user.

Query Manager Architecture


A query manager includes the following components −
• Query redirection via C tool or RDBMS
• Stored procedures
• Query management tool
• Query scheduling via C tool or RDBMS

36
• Query scheduling via third-party software

Functions of Query Manager


• It presents the data to the user in a form they understand.
• It schedules the execution of the queries posted by the end-user.
• It stores query profiles to allow the warehouse manager to determine which indexes and
aggregations are appropriate.
• Consolidation warehouse definition
• A consolidation warehouse is a logistics facility that receives individual orders from other
centers, suppliers, or even multiple customers — as in the case of third-party logistics (3PL)
providers — and groups them into larger orders to facilitate their transportation.
• The consolidation process carried out in these logistics centers consists of organizing and
sorting orders before shipping them. This operation can be simple when orders comprise full
pallets, consolidating 33 pallets on a single truck. Alternatively, when boxes are the unit load
and order volumes are high, the process can be more complex.
• One of the main objectives of load consolidation is to lower the transportation costs for each
order. Combining the orders of multiple customers in a single shipment makes it possible
to divide the delivery costs and dispatch full truckloads. Reducing transportation costs is
especially opportune in organizations with ecommerce sales channels, where users are
accustomed to free or low-cost shipping.
• The rise of ecommerce and B2C channels has led to the spread of consolidation warehouses,
which are very common in B2B sectors. For example, more and more 3PL providers have
consolidation warehouses near their distribution centres to efficiently group their customers’
orders by dispatch process or shipping route.

Data storage and indexing


Data is typically stored in a data warehouse through an extract, transform and load (ETL) process. The
information is extracted from the source, transformed into high-quality data and then loaded into the
warehouse. Businesses perform this process on a regular basis to keep data updated and prepared for the
next step.
When an organization is ready to use its data for analytics or reporting, the focus shifts from data
warehousing to business intelligence (BI) tools. BI technologies like visual analytics and data
exploration help organizations glean important insights from their business data. On the back end, it’s
important to understand how the data warehouse architecture organizes data and how the database
execution model optimizes queries – so developers can write data applications with reasonably high
performance.
In addition to a traditional data warehouse and ETL process, many organizations use a variety of other
methods, tools and techniques for their workloads. For example:
• Data pipelines can be used to populate cloud data warehouses, which can be fully managed by
the organization or by the cloud provider.
• Continuously streaming data can be stored in a cloud data warehouse.

37
• A centralized data catalog is helpful in uniting metadata, making it easier to find data and track
its lineage.
• Data warehouse automation tools get new data into warehouses faster.
• Data virtualization solutions create a logical data warehouse so users can view the data from
their choice of tools.
• Online analytical processing (OLAP) is a way of representing data that has been summarized
into multidimensional views and hierarchies. When used with an integrated ETL process, it
allows business users to get reports without IT assistance.
• An operational data store (ODS) holds a subset of near-real-time data that’s used for operational
reporting or notifications
Data is typically stored in a data warehouse through an extract, transform and load
(ETL) process. The information is extracted from the source, transformed into high-quality data
and then loaded into the warehouse. Businesses perform this process on a regular basis to keep
data updated and prepared for the next step.
When an organization is ready to use its data for analytics or reporting, the focus shifts from data
warehousing to business intelligence (BI) tools. BI technologies like visual analytics and data
exploration help organizations glean important insights from their business data. On the back
end, it’s important to understand how the data warehouse architecture organizes data and how
the database execution model optimizes queries – so developers can write data applications with
reasonably high performance.

In addition to a traditional data warehouse and ETL process, many organizations use a variety of
other methods, tools and techniques for their workloads. For example:

• Data pipelines can be used to populate cloud data warehouses, which can be fully managed by
the organization or by the cloud provider.
• Continuously streaming data can be stored in a cloud data warehouse.
• A centralized data catalog is helpful in uniting metadata, making it easier to find data and track
its lineage.
• Data warehouse automation tools get new data into warehouses faster.
• Data virtualization solutions create a logical data warehouse so users can view the data from
their choice of tools.
• Online analytical processing (OLAP) is a way of representing data that has been summarized
into multidimensional views and hierarchies. When used with an integrated ETL process, it
allows business users to get reports without IT assistance.
• An operational data store (ODS) holds a subset of near-real-time data that’s used for operational
reporting or notifications

Types of Data Stored in a Data Warehouse

The term data warehouse is used to distinguish a database that is used for business analysis
(OLAP) rather than transaction processing (OLTP). While an OLTP database contains current
low-level data and is typically optimized for the selection and retrieval of records, a data
warehouse typically contains aggregated historical data and is optimized for particular types of
analyses, depending upon the client applications.

The contents of your data warehouse depends on the requirements of your users. They should be
able to tell you what type of data they want to view and at what levels of aggregation they want
to be able to view it.

38
Your data warehouse will store these types of data:

• Historical data
• Derived data
• Metadata

These types of data are discussed individually.

Historical Data--A data warehouse typically contains several years of historical data. The
amount of data that you decide to make available depends on available disk space and the types
of analysis that you want to support. This data can come from your transactional database
archives or other sources.Some applications might perform analyses that require data at lower
levels than users typically view it. You will need to check with the application builder or the
application's documentation for those types of data requirements.

Derived Data--Derived data is generated from existing data using a mathematical operation or a
data transformation. It can be created as part of a database maintenance operation or generated at
run-time in response to a query.

Metadata--Metadata is data that describes the data and schema objects, and is used by
applications to fetch and compute the data correctly.OLAP Catalog metadata is designed
specifically for use with Oracle OLAP. It is required by the Java-based Oracle OLAP API, and
can also be used by SQL-based applications to query the database.

Data warehousing is the process of collecting, storing, and analyzing data from various sources
for business intelligence and decision making. One of the key aspects of data warehousing is
ensuring that the data is properly indexed, which means creating and maintaining structures that
allow fast and efficient access to the data. Indexing can improve the performance of queries,
reports, and analytics, as well as reduce the storage space and maintenance costs of the data
warehouse. In this article, you will learn how to properly index data during data warehousing,
including the types, benefits, and challenges of indexing, and some best practices and tips to
follow.

Types of indexes
When it comes to data warehousing, there are various types of indexes that can be used, depending on
the database system, the data model, and the query requirements. Primary key indexes ensure
uniqueness and identity of each row in a table, based on one or more columns that form the primary key.
These indexes are essential for data integrity and referential integrity, as well as for quickly looking up
individual rows by their key values. Foreign key indexes support relationships between tables, based on
one or more columns that reference the primary key of another table. These indexes are important for
enforcing referential integrity and cascading updates and deletes, as well as facilitating join operations
between tables. Bitmap indexes store values of a column as a series of bits, with each bit representing
the presence or absence of a value in a row. They are efficient for columns with low cardinality and can
be combined using logical operations. B-tree indexes organize values of a column into a balanced tree
structure, where each node has a range of values and a pointer to its children nodes. B-tree indexes are
suitable for columns with high cardinality and can support range queries.

Benefits of indexing

39
Indexing can provide several benefits for data warehousing, such as faster query execution, reduced
storage space, and easier maintenance. Indexes can help the database system to locate the relevant rows
for a query without scanning the entire table, which can save time and resources. Additionally, they can
help to compress the data and eliminate duplicates, reducing the storage space required for the data
warehouse. Bitmap indexes and B-tree indexes are two examples of how this can be achieved.
Furthermore, indexes can help to maintain the consistency and quality of the data by enforcing
constraints and rules of the data model. They can also assist in updating and deleting the data by
cascading changes to related tables and indexes.

Personal Experience: In a large-scale data warehousing project, we had performance issues with real-
time analytics. After implementing selective indexing on frequently queried columns, the
performance improved dramatically. Query times were reduced, and the overall efficiency of the
system increased. Proper indexing can significantly speed up data retrieval and improve user
satisfaction.

Aggregation tasks benefit from the organized structure of indexes, resulting in efficient computation
of summary statistics. Without indexes, the database might need to perform full table scans to locate
desired data. Indexes reduce the number of input/output (I/O) operations needed to retrieve data.

3 Challenges of indexing
Indexing can also present some challenges when it comes to data warehousing, such as increased
complexity and the need for careful planning, testing, and tuning. Furthermore, indexes can affect the
performance of other operations like data loading, transformation, and cleansing. Trade-offs and trade-
offs between the benefits and costs of different types and configurations of indexes must also be taken
into consideration depending on the data characteristics, query patterns, and business objectives. It's
important to note that indexes may change over time as the data and requirements evolve, thus requiring
periodic reevaluation and adjustment.

Insight: While indexing can offer significant benefits, it also comes with challenges. Over-indexing
can lead to increased storage requirements and slower write operations. In one of my experiences,
excessive indexes on a transactional table significantly slowed down the data ingestion process. It's
crucial to strike a balance and continuously monitor the impact of indexes on both read and write
operations.

Transformations that involve significant changes to indexed columns may require careful
consideration and potential adjustments to indexes.

Best practices and tips


To properly index data during data warehousing,. First, analyze the data and the queries to understand
the structure, volume, distribution, and quality of the data, as well as the frequency, complexity, and
purpose of the queries. This will help you decide on the types and number of indexes to create.
Additionally, consider the trade-offs between query performance and data loading performance, as well
as storage space and maintenance costs. Then choose the appropriate types of indexes that match the
data characteristics and query requirements. Avoid creating unnecessary or redundant indexes that may
waste space and resources. Utilize features and options of your database system to optimize the indexes
such as compression, partitioning, clustering, and parallelism. Lastly, test and tune the indexes to ensure
they are effective and efficient for your data warehouse. Use tools and metrics provided by your
database system to measure and compare performance of the indexes such as execution plans, statistics,
and histograms. Monitor and update the indexes regularly to reflect changes in data and queries.

40
Consolidation process stages
Warehouse consolidation tasks are carried out in specific areas of the facility and can be broken down
into the following phases:

• Product receipt and sorting. At the warehouse docking areas, products and raw materials are
received from different suppliers or customers, documentation is checked, and items are sorted
in a temporary storage zone.
• Stock storage. Once the goods are entered in the warehouse management software, the
operators store the products according to the criteria and rules established in advance by the
logistics manager.
• Product handling. To facilitate its distribution, the stock has to go through the weighing and
packing processes, among others.
• Order grouping. The goods are combined in a single batch so that they can be dispatched as a
grouped shipment. This phase also includes the preparation of the pertinent documentation
detailing, e.g., the orders and unit loads that make up the consolidated load.
• Goods dispatch. Once the shipment is consolidated, the stock is loaded onto the truck —
mechanically or automatically — to be delivered to the next recipient in the supply chain or to
the end custome

r.

Materialization in a data warehouse is the process of storing the results of a query in a database
object, called a materialized view, to improve performance and efficiency. Materialized views are often
used to precompute and store aggregated data, such as the sum of sales, or to precompute joins.
Here are some benefits of materialized views:
• Zero maintenance: The system automatically synchronizes data refreshes with data changes in
base tables.
• Always fresh: Materialized views are always consistent with the base table.

41
• Eliminates overhead: Materialized views eliminate the overhead associated with expensive joins
or aggregations for a large or important class of queries.
Materialized views can be especially beneficial when you need to regularly read a subset of data from a
large dataset. Retrieving the data afresh each time could be time-consuming because you would have to
run the query on the full set of data every time

What is OLAP (Online Analytical Processing)?


OLAP stands for On-Line Analytical Processing. OLAP is a classification of software technology
which authorizes analysts, managers, and executives to gain insight into information through fast,
consistent, interactive access in a wide variety of possible views of data that has been transformed from
raw information to reflect the real dimensionality of the enterprise as understood by the clients.
OLAP implement the multidimensional analysis of business information and support the capability for
complex estimations, trend analysis, and sophisticated data modeling. It is rapidly enhancing the essential
foundation for Intelligent Solutions containing Business Performance Management, Planning, Budgeting,
Forecasting, Financial Documenting, Analysis, Simulation-Models, Knowledge Discovery, and Data
Warehouses Reporting. OLAP enables end-clients to perform ad hoc analysis of record in multiple
dimensions, providing the insight and understanding they require for better decision making.

How OLAP systems work


To facilitate this kind of analysis, data is collected from multiple sources and stored in data warehouses,
then cleansed and organized into data cubes. Each OLAP cube contains data categorized by dimensions
(such as customers, geographic sales region and time period) derived by dimensional tables in the data
warehouses. Dimensions are then populated by members (such as customer names, countries and
months) that are organized hierarchically. OLAP cubes are often pre-summarized across dimensions to
drastically improve query time over relational databases.
Analysts can then perform five types of OLAP analytical operations against these multidimensional
databases:
• Roll-up. Also known as consolidation, or drill-up, this operation summarizes the data along
the dimension.
• Drill-down. This allows analysts to navigate deeper among the dimensions of data. For
example, drilling down from "time period" to "years" and "months" to chart sales growth for
a product.
• Slice. This enables an analyst to take one level of information for display, such as "sales in
2017."
• Dice. This allows an analyst to select data from multiple dimensions to analyze, such as
"sales of blue beach balls in Iowa in 2017."
• Pivot. Analysts can gain a new view of data by rotating the data axes of the cube.

OLAP software locates the intersection of dimensions, such as all products sold in the Eastern region
above a certain price during a certain time period, and displays them. The result is the measure; each
OLAP cube has at least one to perhaps hundreds of measures, which derive from information stored
in fact tables in the data warehouse.
42
UNIT 3
Data Mining Tutorial

The data mining tutorial provides basic and advanced concepts of data mining. Our data mining tutorial
is designed for learners and experts.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that would allow the
business to take the data-driven decision from huge sets of data is called Data MiningIn other words, we can
say that Data Mining is the process of investigating hidden patterns of information to various perspectives
for categorization into useful data, which is collected and assembled in particular areas such as data
warehouses, efficient analysis, data mining algorithm, helping decision making and other data requirement
to eventually cost-cutting and generating revenue.

43
Data mining is the act of automatically searching for large stores of information to find trends and patterns
that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data
segments and evaluates the probability of future events. Data Mining is also called Knowledge Discovery of
Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases to solve business
problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular data
set, with an objective. This process includes various types of services such as text mining, web mining, audio
and video mining, pictorial data mining, and social media mining. It is done through software that is simple
or highly specific. By outsourcing data mining, all the work can be done faster with low operation costs.
Specialized firms can also use new technologies to collect data that is impossible to locate manually. There
are tonnes of information available on various platforms, but very little knowledge is accessible. The biggest
challenge is to analyze the data to extract important information that can be used to solve a problem or for
company development. There are many powerful instruments and techniques available to mine data and find
better insight from it.

Current Time 0:06


/

Types of Data Mining


Data mining can be performed on the following types of data:
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables, records, and columns
from which data can be accessed in various ways without having to recognize the database tables. Tables
convey and share information, which facilitates data searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within the organization to
provide meaningful business insights. The huge amount of data comes from multiple places such as
Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making
for a business organization. The data warehouse is designed for the analysis of data rather than transaction
processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT professionals
utilize the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group
of databases, where an organization has kept various kinds of information.
Object-Relational Database:
A combination of an object-oriented database model and relational database model is called an object-
relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the Relational
database and the object-oriented model practices frequently utilized in many programming languages, for
example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the potential to undo a
database transaction if it is not performed appropriately. Even though this was a unique capability a very
long while back, today, most of the relational database systems support transactional database activities.

44
Advantages of Data Mining

o The Data Mining technique enables organizations to obtain knowledge-based data.


o Data mining enables organizations to make lucrative modifications in operation and production.
o Compared with other statistical data applications, data mining is a cost-efficient.
o Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and
behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short
time.

Disadvantages of Data Mining


o There is a probability that the organizations may sell useful data of customers to other organizations for
money. As per the report, American Express has sold credit card purchases of their customers to other
organizations.
o Many data mining analytics software is difficult to operate and needs advance training to work on.
o Different data mining instruments operate in distinct ways due to the different algorithms used in their
design. Therefore, the selection of the right data mining tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe consequences in certain
conditions.

Data Mining Applications


Data Mining is primarily used by organizations with intense consumer demands- Retail, Communication,
Financial, marketing company, determine price, consumer preferences, product positioning, and impact on
sales, customer satisfaction, and corporate profits. Data mining enables a retailer to use point-of-sale records
of customer purchases to develop products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:
Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It uses data and analytics for
better insights and to identify best practices that will enhance health care services and reduce costs. Analysts
use data mining approaches such as Machine learning, Multi-dimensional database, Data visualization, Soft
computing, and statistics. Data Mining can be used to forecast patients in each category. The procedures
ensure that the patients get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.
Data Mining in Market Basket Analysis:
Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of products,
then you are more likely to buy another group of products. This technique may enable the retailer to
understand the purchase behavior of a buyer. This data may assist the retailer in understanding the

45
requirements of the buyer and altering the store's layout accordingly. Using a different analytical comparison
of results between various stores, between customers in different demographic groups can be done.
Data mining in Education:
Education data mining is a newly emerging field, concerned with developing techniques that explore
knowledge from the data generated from educational Environments. EDM objectives are recognized as
affirming student's future learning behavior, studying the impact of educational support, and promoting
learning science. An organization can use data mining to make precise decisions and also to predict the
results of the student. With the results, the institution can concentrate on what to teach and how to teach.
Data Mining in Manufacturing Engineering:
Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be beneficial to
find patterns in a complex manufacturing process. Data mining can be used in system-level designing to
obtain the relationships between product architecture, product portfolio, and data needs of the customers. It
can also be used to forecast the product development period, cost, and expectations among the other tasks.
Data Mining in CRM (Customer Relationship Management):
Customer Relationship Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies. To get a decent relationship with the
customer, a business organization needs to collect data and analyze the data. With data mining technologies,
the collected data can be used for analytics
Data Mining in Fraud detection:
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little bit time
consuming and sophisticated. Data mining provides meaningful patterns and turning data into information.
An ideal fraud detection system should protect the data of all the users. Supervised methods consist of a
collection of sample records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the document is fraudulent or not.
Data Mining in Lie Detection:
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging task.
Law enforcement may use data mining techniques to investigate offenses, monitor suspected terrorist
communications, etc. This technique includes text mining also, and it seeks meaningful patterns in data,
which is usually unstructured text. The information collected from the previous investigations is compared,
and a model for lie detection is constructed.
Data Mining Financial Banking:
The Digitalization of the banking system is supposed to generate an enormous amount of data with every
new transaction. The data mining technique can help bankers by solving business-related problems in
banking and finance by identifying trends, casualties, and correlations in business information and market
costs that are not instantly evident to managers or executives because the data volume is too large or are
produced too rapidly on the screen by experts. The manager may find these data for better targeting,
acquiring, retaining, segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution. Various challenges
could be related to performance, data, methods, and techniques, etc. The process of data mining becomes
effective when the challenges or problems are correctly recognized and adequately resolved.

46
Incomplete and noisy data:
The process of extracting useful data from large volumes of data is data mining. The data in the real-world is
heterogeneous, incomplete, and noisy. Data in huge quantities will usually be inaccurate or unreliable. These
problems may occur due to data measuring instrument or because of human errors. Suppose a retail chain
collects phone numbers of customers who spend more than $ 500, and the accounting employees put the
information into their system. The person may make a digit mistake when entering the phone number, which
results in incorrect data. Even some customers may not be willing to disclose their phone numbers, which
results in incomplete data. The data could get changed due to human or system error. All these consequences
(noisy and incomplete data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing environment. It might be
in a database, individual systems, or even on the internet. Practically, It is a quite tough task to make all the
data to a centralized data repository mainly due to organizational and technical concerns. For example,
various regional offices may have their servers to store their data. It is not feasible to store, all the data from
all the offices on a central server. Therefore, data mining requires the development of tools and algorithms
that allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and video, images,
complex data, spatial data, time series, and so on. Managing these various types of data and extracting useful
information is a tough task. Most of the time, new technologies, new tools, and methodologies would have to
be refined to obtain specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms and techniques used.
If the designed algorithm and techniques are not up to the mark, then the efficiency of the data mining
process will be affected adversely.
Data Privacy and Security:
Data mining usually leads to serious issues in terms of data security, governance, and privacy. For example,
if a retailer analyzes the details of the purchased items, then it reveals data about buying habits and
preferences of the customers without their permission.

Data Visualization:
In data mining, data visualization is a very important process because it is the primary method that shows the
output to the user in a presentable way. The extracted data should convey the exact meaning of what it
intends to express. But many times, representing the information to the end-user in a precise and easy way is
difficult. The input data and the output information being complicated, very efficient, and successful data
visualization processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining relies on
getting rid of all these difficulties.
Prerequisites
Before learning the concepts of Data Mining, you should have a basic understanding of Statistics, Database
Knowledge, and Basic programming language.

Audience
Our Data Mining Tutorial is prepared for all beginners or computer science graduates to help them learn the
basics to advanced techniques related to data mining.

Problems
We assure you that you will not find any difficulty while learning our Data Mining tutorial. But if there is
any mistake in this tutorial, kindly post the problem or error in the contact form so that we can improve it.

47
Data Mining Task Primitives
Data mining task primitives refer to the basic building blocks or components that are used to construct a
data mining process. These primitives are used to represent the most common and fundamental tasks
that are performed during the data mining process. The use of data mining task primitives can provide a
modular and reusable approach, which can improve the performance, efficiency, and understandability
of the data mining process.

The Data Mining Task Primitives are as follows:


1. The set of task relevant data to be mined: It refers to the specific data that is relevant and
necessary for a particular task or analysis being conducted using data mining techniques. This
data may include specific attributes, variables, or characteristics that are relevant to the task at
hand, such as customer demographics, sales data, or website usage statistics. The data selected
for mining is typically a subset of the overall data available, as not all data may be necessary or
relevant for the task. For example: Extracting the database name, database tables, and relevant
required attributes from the dataset from the provided input database.
2. Kind of knowledge to be mined: It refers to the type of information or insights that are being
sought through the use of data mining techniques. This describes the data mining tasks that must
be carried out. It includes various tasks such as classification, clustering, discrimination,
characterization, association, and evolution analysis. For example, It determines the task to be
performed on the relevant data in order to mine useful information such as classification,
clustering, prediction, discrimination, outlier detection, and correlation analysis.
3. Background knowledge to be used in the discovery process: It refers to any prior information
or understanding that is used to guide the data mining process. This can include domain-specific
knowledge, such as industry-specific terminology, trends, or best practices, as well as knowledge
about the data itself. The use of background knowledge can help to improve the accuracy and
relevance of the insights obtained from the data mining process. For example, The use of
background knowledge such as concept hierarchies, and user beliefs about relationships in data
in order to evaluate and perform more efficiently.
4. Interestingness measures and thresholds for pattern evaluation: It refers to the methods and
criteria used to evaluate the quality and relevance of the patterns or insights discovered through
data mining. Interestingness measures are used to quantify the degree to which a pattern is
considered to be interesting or relevant based on certain criteria, such as its frequency,
confidence, or lift. These measures are used to identify patterns that are meaningful or relevant
to the task. Thresholds for pattern evaluation, on the other hand, are used to set a minimum level
of interestingness that a pattern must meet in order to be considered for further analysis or
action. For example: Evaluating the interestingness and interestingness measures such as utility,
certainty, and novelty for the data and setting an appropriate threshold value for the pattern
evaluation.
5. Representation for visualizing the discovered pattern: It refers to the methods used to
represent the patterns or insights discovered through data mining in a way that is easy to
understand and interpret. Visualization techniques such as charts, graphs, and maps are
commonly used to represent the data and can help to highlight important trends, patterns, or
relationships within the data. Visualizing the discovered pattern helps to make the insights
48
obtained from the data mining process more accessible and understandable to a wider audience,
including non-technical stakeholders. For example Presentation and visualization of discovered
pattern data using various visualization techniques such as barplot, charts, graphs, tables, etc.
Advantages of Data Mining Task Primitives
The use of data mining task primitives has several advantages, including:
1. Modularity: Data mining task primitives provide a modular approach to data mining, which
allows for flexibility and the ability to easily modify or replace specific steps in the process.
2. Reusability: Data mining task primitives can be reused across different data mining projects,
which can save time and effort.
3. Standardization: Data mining task primitives provide a standardized approach to data mining,
which can improve the consistency and quality of the data mining process.
4. Understandability: Data mining task primitives are easy to understand and communicate, which
can improve collaboration and communication among team members.
5. Improved Performance: Data mining task primitives can improve the performance of the data
mining process by reducing the amount of data that needs to be processed, and by optimizing the
data for specific data mining algorithms.
6. Flexibility: Data mining task primitives can be combined and repeated in various ways to
achieve the goals of the data mining process, making it more adaptable to the specific needs of
the project.
7. Efficient use of resources: Data mining task primitives can help to make more efficient use of
resources, as they allow to perform specific tasks with the right tools, avoiding unnecessary steps
and reducing the time and computational power needed.

KDD- Knowledge Discovery in Databases


The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of discovering
knowledge in data and emphasizes the high-level applications of specific Data Mining techniques. It is a field
of interest to researchers in various fields, including artificial intelligence, machine learning, pattern
recognition, databases, statistics, knowledge acquisition for expert systems, and data visualization
The main objective of the KDD process is to extract information from data in the context of large databases.
It does this by using Data Mining algorithms to identify what is deemed knowledge.
The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis and modeling
of vast data repositories. KDD is the organized procedure of recognizing valid, useful, and understandable
patterns from huge and complex data sets. Data Mining is the root of the KDD procedure, including the
inferring of algorithms that investigate the data, develop the model, and find previously unknown patterns.
The model is used for extracting the knowledge from the data, analyze the data, and predict the data.

The availability and abundance of data today make knowledge discovery and Data Mining a matter of
impressive significance and need. In the recent development of the field, it isn't surprising that a wide variety
of techniques is presently accessible to specialists and experts.
What is KDD?
KDD is a computer science field specializing in extracting previously unknown and interesting
information from raw data. KDD is the whole process of trying to make sense of data by developing
appropriate methods or techniques. This process deals with low-level mapping data into other forms that
are more compact, abstract, and useful. This is achieved by creating short reports, modeling the process
of generating data, and developing predictive models that can predict future cases.

Due to the exponential growth of data, especially in areas such as business, KDD has become a very
important process to convert this large wealth of data into business intelligence, as manual extraction of
patterns has become seemingly impossible in the past few decades.

For example, it is currently used for various applications such as social network analysis, fraud detection,
science, investment, manufacturing, telecommunications, data cleaning, sports, information retrieval, and
49
marketing. KDD is usually used to answer questions like what are the main products that might help to
obtain high-profit next year in V-Mart.

The KDD Process


The knowledge discovery process(illustrates in the given figure) is iterative and interactive, comprises of nine
steps. The process is iterative at each stage, implying that moving back to the previous actions might be
required. The process has many imaginative aspects in the sense that one cant presents one formula or make a
complete scientific categorization for the correct decisions for each step and application type. Thus, it is needed
to understand the process and the different requirements and possibilities in each stage.
The process begins with determining the KDD objectives and ends with the implementation of the discovered
knowledge. At that point, the loop is closed, and the Active Data Mining starts. Subsequently, changes would
need to be made in the application domain. For example, offering various features to cell phone users in order
to reduce churn. This closes the loop, and the impacts are then measured on the new data repositories, and the
KDD process again. Following is a concise description of the nine-step KDD process, Beginning with a
managerial step:

KDD Process Steps


Knowledge discovery in the database process includes the following steps, such as:

1. Goal identification: Develop and understand the application domain and the relevant prior
knowledge and identify the KDD process's goal from the customer perspective.
2. Creating a target data set: Selecting the data set or focusing on a set of variables or data samples
on which the discovery was made.
3. Data cleaning and preprocessing:Basic operations include removing noise if appropriate,
collecting the necessary information to model or account for noise, deciding on strategies for
handling missing data fields, and accounting for time sequence information and known changes.
50
4. Data reduction and projection: Finding useful features to represent the data depending on the
purpose of the task. The effective number of variables under consideration may be reduced through
dimensionality reduction methods or conversion, or invariant representations for the data can be
found.
5. Matching process objectives: KDD with step 1 a method of mining particular. For example,
summarization, classification, regression, clustering, and others.
6. Modeling and exploratory analysis and hypothesis selection: Choosing the algorithms or data
mining and selecting the method or methods to search for data patterns. This process includes
deciding which model and parameters may be appropriate (e.g., definite data models are different
models on the real vector) and the matching of data mining methods, particularly with the general
approach of the KDD process (for example, the end-user might be more interested in
understanding the model in its predictive capabilities).
7. Data Mining: The search for patterns of interest in a particular representational form or a set of
these representations, including classification rules or trees, regression, and clustering. The user
can significantly aid the data mining method to carry out the preceding steps properly.
8. Presentation and evaluation: Interpreting mined patterns, possibly returning to some of the steps
between steps 1 and 7 for additional iterations. This step may also involve the visualization of the
extracted patterns and models or visualization of the data given the models drawn.
9. Taking action on the discovered knowledge: Using the knowledge directly, incorporating the
knowledge in another system for further action, or simply documenting and reporting to
stakeholders. This process also includes checking and resolving potential conflicts with previously
believed knowledge (or extracted).

Data Mining vs KDD

Now we will discuss the main difference covering data mining vs KDD.

Key Features Data Mining KDD


Basic Definition Data mining is the process of identifying The KDD method is a complex and
patterns and extracting details about big iterative approach to knowledge
data sets using intelligent methods. extraction from big data.

Goal To extract patterns from datasets. To discover knowledge from


datasets.
Scope In the KDD method, the fourth phase is KDD is a broad method that
called "data mining." includes data mining as one of its
steps.
Used Techniques • Classification • Data cleaning
• Clustering • Data Integration
• Decision Trees • Data selection
• Dimensionality Reduction • Data transformation
• Neural Networks • Data mining
• Regression • Pattern evaluation
• Knowledge Presentation

Example Clustering groups of data elements Data analysis to find patterns and
based on how similar they are. links.

In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.

51
1. Classification:
This technique is used to obtain important and relevant information about data and metadata. This data mining
technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data, text data,
time-series data, World Wide Web, and so on.
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on.
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining functionalities. For
example, discrimination, classification, clustering, characterization, etc. some frameworks tend to be
extensive frameworks offering a few data mining functionalities together.
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks, machine
learning, genetic algorithms, visualization, statistics, data warehouse-oriented or database-oriented,
etc.
The classification can also take into account, the level of user interaction involved in the data mining
procedure, such as query-driven systems, autonomous systems, or interactive exploratory systems.

2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the data by a few clusters
mainly loses certain confine details, but accomplishes improvement. It models data by its clusters. Data
modelling puts clustering from a historical point of view rooted in statistics, mathematics, and numerical
analysis. From a machine learning point of view, clusters relate to hidden patterns, the search for clusters is
unsupervised learning, and the subsequent framework represents a data concept. From a practical point of
view, clustering plays an extraordinary job in data mining applications. For example, scientific data
exploration, text mining, information retrieval, spatial database applications, CRM, Web analysis,
computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify similar data. This
technique helps to recognize the differences and similarities between the data. Clustering is very similar to the
classification, but it involves grouping chunks of data together based on their similarities.

52
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the relationship between
variables because of the presence of the other factor. It is used to define the probability of the specific variable.
Regression, primarily a form of planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.

4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a hidden pattern in the
data set.
Association rules are if-then statements that support to show the probability of interactions between data items
within large data sets in different types of databases. Association rule mining has several applications and is
commonly used to help sales correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery items that you have
been buying for the last six months. It calculates a percentage of items being purchased together.
These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often item B is
purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and compared it
to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is purchased
as well.
(Item A + Item B)/ (Item A)

5. Outer detection:
This type of data mining technique relates to the observation of data items in the data set, which do not match
an expected pattern or expected behavior. This technique may be used in various domains like intrusion,
detection, fraud detection, etc. It is also known as Outlier Analysis or Outilier mining. The outlier is a data
point that diverges too much from the rest of the dataset. The majority of the real-world datasets have an
outlier. Outlier detection plays a significant role in the data mining field. Outlier detection is valuable in
numerous fields like network interruption identification, credit or debit card fraud detection, detecting outlying
in wireless sensor network data, etc.

6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential data to discover
sequential patterns. It comprises of finding interesting subsequences in a set of sequences, where the stake of
a sequence can be measured in terms of different criteria like length, occurrence frequency, etc.

53
Data Mining Tools
Data mining tools are software applications that process and analyze large data sets to discover patterns, trends,
and relationships that might not be immediately apparent. These tools enable organizations and researchers to
make informed decisions by extracting useful information. Some popular data mining tools include:
Altair RapidMiner: Known for its flexibility and wide range of functionality, it covers the entire data mining
process, from data preparation to modeling and evaluation.
WEKA: A collection of machine learning algorithms for data mining tasks that are easily applicable to real data
with a user-friendly interface.
KNIME: Combines data access, transformation, initial investigation, powerful predictive analytics, and
visualization within an open-source platform.
Python (with libraries like scikit-learn, pandas, and NumPy): While Python is a programming language, its
libraries are extensively used in data mining for sophisticated data analysis and machine learning.
Tableau: A visualization tool with powerful data mining capabilities due to its ability to interactively handle
large data sets.
These tools cater to a variety of users, from those who prefer graphical interfaces to those who are more
comfortable coding their own analyses.

Data Mining Query Language


Data•• Mining is a process is in which user data are extracted and processed from a heap of unprocessed
raw data. By aggregating these datasets into a summarized format, many problems arising in finance,
marketing, and many other fields can be solved. In the modern world with enormous data, Data Mining
is one of the growing fields of technology that acts as an application in many industries we depend on in
our life. Many developments and researches have been held in this field and many systems are also been
disclosed. Since there are numerous processes and functions to be done in Data Mining, a very well
developed user interface is needed. Even though there are many well-developed user interfaces for the
relational systems, Han, Fu, Wang, et al. proposed the Data Mining Query Language(DMQL) to further
build more developmental systems and innovate many kinds of research in this field. Though we can’t
consider DMQL as a standard language. It is a derived language that stands as a general query language
to perform data mining techniques. DMQL is executed in DB miner systems for collecting data from
several layers of databases.
Ideas in designing DMQL:
DMQL is designed based on Structured Query Language(SQL) which in turn is a relational query
language.
• Data Mining request: For the given data mining task, the corresponding datasets must be
defined in the form of a data mining request. Let us see this with an example. As the user can
request for any specific part of a dataset in the database, the data miner can use the database
query to retrieve the suitable datasets before the process of data mining. If the aggregation of
that specific data is not possible for the data miner, he then collects the supersets from which
one can derive the required data. This proves the need for query language in data mining which
acts as its subtask. Since the extraction of relevant data from huge datasets cannot be performed
by manual work, many development methods are present in the data mining technique. But by
doing this way, sometimes the task of collecting relevant data requested by the user may be
failed. By using DMQL, a command to retrieve specific datasets or data from the database,
which gives a desired result to the user and it gives comprehending experience in fulfilling the
expectations of users.

54
• Background Knowledge: Prior knowledge of datasets and their relationships in a database
help in mining the data. By knowing the relationships or any useful information can ease the
process of extraction and aggregation. For an instance, the conceptual hierarchy of the number
of datasets can increase the efficiency of the process and accuracy by collecting the desired
data easily. By knowing the hierarchy, the data can be generalized with ease.
• Generalization: When the data in datasets of a data warehouse is not generalized, often the
data would be in form of unprocessed primitive integrity constraints, roughly associated multi-
valued datasets and their dependencies. But by using the generalization concept using query
language can help in processing the raw data into a precise abstraction. It also works in the
multi-level collection of data with a quality aggregation. When the larger databases come into
the scene, the generalization would play a major role in giving desirable results in a conceptual
level of data collection.
• Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from
databases, efficient exposure values or thresholds must be specified for the flexible data
mining and to provide compulsive interaction which makes the user experience interesting.
Such threshold values can be provided with queries of data mining.
The four parameters of data mining:
• The first parameter is to fetch the relevant dataset from the database in the form of a relational
query. By specifying this primitive, relevant data are retrieved.
• The second parameter is the type of resource/information extracted. This primitive includes
generalization, association, classification, characterization, and discrimination rules.
• The third parameter is the hierarchy of datasets or generalization relation or background
knowledge as said earlier in the designing of DMQL.
• The final parameter is the proficiency of the data collected which can be represented by a
specific threshold value which in turn depends on the type of rules used in data mining.

Data Preprocessing in Data Mining



Data•• preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific data
mining task.
Steps of Data Preprocessing
Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:
1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
2. Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can be
used for data integration.
3. Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.
4. Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.

55
5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
6. Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization,
and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the data
and the analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and
efficient format.

Steps Involved in Data Preprocessing


1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
• Missing Data: This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
o Ignore the tuples: This approach is suitable only when the dataset we
have is quite large and multiple values are missing within a tuple.
o Fill the Missing values: There are various ways to do this task. You can
choose to fill the missing values manually, by attribute mean or the most
probable value.

• Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :

56
o Binning Method: This method works on sorted data in order to smooth
it. The whole data is divided into segments of equal size and then various
methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary
values can be used to complete the task.
o Regression:Here data can be made smooth by fitting it to a regression
function.The regression used may be linear (having one independent
variable) or multiple (having multiple independent variables).
o Clustering: This approach groups the similar data in a cluster. The
outliers may be undetected or it will fall outside the clusters.
2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
• Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
• Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.
• Discretization: This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.
• Concept Hierarchy Generation: Here attributes are converted from lower level to higher
level in hierarchy. For Example-The attribute “city” can be converted to “country”.
3. Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing
the size of the dataset while preserving the important information. This is done to improve the efficiency
of data analysis and to avoid overfitting of the model. Some common steps involved in data reduction
are:
• Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).
• Feature Extraction: This involves transforming the data into a lower-dimensional space
while preserving the important information. Feature extraction is often used when the
original features are high-dimensional and complex. It can be done using techniques such
as PCA, linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).
• Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic
sampling.
• Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.
• Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gif compression.

What is Data Visualization?


Data Visualization refers to the visual representation of data with the help of comprehensive charts,
images, lists, charts, and other visual objects. It enables users to easily understand the information
within a fraction of time and extract useful information, patterns, and trends. Moreover, it makes the
information easy to understand.
In other words, we can say that data representation in graphical form so that users can easily
comprehend the procedure of trends in the data is called data visualization. There are many tools
involved in data visualization, such as chart maps, graphs, etc. The tools used for data visualization help

57
the users easily understand and collect the data provided by visual representation rather than going
through the whole scanning the datasheets.
Importance of Data Visualization?
Data visualization represents the data in visual form. It is important since it enables information to be
more easily seen. Machine learning technique plays an important role in conducting predictive analysis
which helps in data visualization. Data visualization is not only helpful for business analysts, data
analysts, and data scientists, but it plays a vital role in comprehending data visualization in any career.
Whether we work in design, operation, tech, marketing, sales, or any other field, we need to visualize
data.
Data visualization has broad uses. Some important benefits of data visualization are given below.
o Finding errors and outliers
o Understand the whole procedure
o Grasping the data quickly
o Exploring business insights
o Quick action.

Types of Data Visualization charts?


There are various tools available that help the data visualization process. Some are automated, and some
are manual, but with the help of these tools, you can make any type of visualization.
Line chart: A-line chart illustration changes over time. The X-axis represents the period, whereas the y-
axis represents the quantity.
Bar Chart: Bar charts illustration also change over time. In the case of more than one variable, a bar
chart can make it simpler to distinguish the data for each variable at each moment in time.
Pie chart: A pie chart is one of the best options we have for illustrating algebra (percentages) because it
depicts each object as part of a whole. So, if your data explain an error in percentages, a pie chart will
help present the pieces in the proper proportions.

Difference between Data Mining and Data Visualization

Data Mining Data Visualization

Data mining is a process of extracting useful Data Visualization refers to the visual
information, patterns, and trends from raw data. representation of data with the help of
comprehensive charts, images, lists, charts, and
other visual objects.
It comes under data science. It comes under the area of data science.

In data mining, many algorithms exist. There is no need to use any algorithm.

It is operated with web software systems. It supports and works better in advance data
analyses.
It has a broad application and is primarily It is preferred for data forecasting and
preferred for web search engines. predictions.
It is new technology but underdeveloped. It is more useful in real-time data forecasting.

It works on any web-enabled platform or with It provides visual representation, irrespective of


any application. hardware and software.

Regression
Regression refers to a type of supervised machine learning technique that is used to predict any
continuous-valued attribute. Regression helps any business organization to analyze the target variable
58
and predictor variable relationships. It is a most significant tool to analyze the data that can be used for
financial forecasting and time series modeling.
Regression involves the technique of fitting a straight line or a curve on numerous data points. It
happens in such a way that the distance between the data points and cure comes out to be the lowest.
The most popular types of regression are linear and logistic regressions. Other than that, many other
types of regression can be performed depending on their performance on an individual data set.
Regression can predict all the dependent data sets, expressed in the expression of independent variables,
and the trend is available for a finite period. Regression provides a good way to predict variables, but
there are certain restrictions and assumptions like the independence of the variables, inherent normal
distributions of the variables. For example, suppose one considers two variables, A and B, and their
joint distribution is a bivariate distribution, then by that nature. In that case, these two variables might be
independent, but they are also correlated. The marginal distributions of A and B need to be derived and
used. Before applying Regression analysis, the data needs to be studied carefully and perform certain
preliminary tests to ensure the Regression is applicable. There are non-Parametric tests that are available
in such cases.

Types of Regression
Regression is divided into five different types
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
Difference between Regression and Classification in data mining

Regression Classification
Regression refers to a type of supervised machine Classification refers to a process of assigning
learning technique that is used to predict any predefined class labels to instances based on their
continuous-valued attribute. attributes.
In regression, the nature of the predicted data is In classification, the nature of the predicated data
ordered. is unordered.
The regression can be further divided into linear Classification is divided into two categories:
regression and non-linear regression. binary classifier and multi-class classifier.
In the regression process, the calculations are In the classification process, the calculations are
basically done by utilizing the root mean square basically done by measuring the efficiency.
error.
Examples of regressions are regression tree, The examples of classifications are the decision
linear regression, etc. tree.

The regression analysis usually enables us to compare the effects of various kinds of feature variables
measured on numerous scales. Such as prediction of the land prices based on the locality, total area,
surroundings, etc. These results help market researchers or data analysts to remove the useless feature
and evaluate the best features to calculate efficient models.
Statistical Methods in Data Mining
Data mining refers to extracting or mining knowledge from large amounts of data. In other words, data
mining is the science, art, and technology of discovering large and complex bodies of data in order to
discover useful patterns. Theoreticians and practitioners are continually seeking improved techniques to
make the process more efficient, cost-effective, and accurate. Any situation can be analyzed in two ways
in data mining:
• Statistical Analysis: In statistics, data is collected, analyzed, explored, and presented to identify
patterns and trends. Alternatively, it is referred to as quantitative analysis.

59
• Non-statistical Analysis: This analysis provides generalized information and includes sound,
still images, and moving images.
In statistics, there are two main categories:
• Descriptive Statistics: The purpose of descriptive statistics is to organize data and identify the
main characteristics of that data. Graphs or numbers summarize the data. Average, Mode,
SD(Standard Deviation), and Correlation are some of the commonly used descriptive statistical
methods.
• Inferential Statistics: The process of drawing conclusions based on probability theory and
generalizing the data. By analyzing sample statistics, you can infer parameters about populations
and make models of relationships within data.
There are various statistical terms that one should be aware of while dealing with statistics. Some of
these are:
• Population
• Sample
• Variable
• Quantitative Variable
• Qualitative Variable
• Discrete Variable
• Continuous Variable
Bayes’ Theorem
Bayes’ Theorem is used to determine the conditional probability of an event. It was named after an
English statistician, Thomas Bayes who discovered this formula in 1763. Bayes Theorem is a very
important theorem in mathematics, that laid the foundation of a unique statistical inference approach
called the Bayes’ inference. It is used to find the probability of an event, based on prior knowledge of
conditions that might be related to that event.

Bayes’ Theorem Illustration


For example, if we want to find the probability that a white marble drawn at random came from the first
bag, given that a white marble has already been drawn, and there are three bags each containing some
white and black marbles, then we can use Bayes’ Theorem.
This article explores the Bayes theorem including its statement, proof, derivation, and formula of the
theorem, as well as its applications with various examples.
What is Bayes’ Theorem?

Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional
probability of event A when event B has already occurred.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given the
occurrence of another event B, is equal to the product of the event of B, given A and the probability of
A divided by the probability of event B.” i.e.
P(A|B) = P(B|A)P(A) / P(B)

where,
• P(A) and P(B) are the probabilities of events A and B
• P(A|B) is the probability of event A when event B happens
• P(B|A) is the probability of event B when A happens
Bayes’s Theorem for Conditional Probability
Bayes Theorem Statement
Bayes’ Theorem for n set of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events E1,
E2,…, En have a non-zero probability of occurrence. All the events E1, E2,…, E form a partition of S.
Let A be an event from space S for which we have to find probability, then according to Bayes’
theorem,
60
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
for k = 1, 2, 3, …., n
Bayes Theorem Formula
For any two events A and B, then the formula for the Bayes theorem is given by: (the image given
below gives the Bayes’ theorem formula)

Bayes’ Theorem Formula


where,
• P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero.
• P(A|B) is the probability of event A when event B happens
• P(B|A) is the probability of event B when A happens
Bayes Theorem Derivation
The proof of Bayes’ Theorem is given as, according to the conditional probability formula,
P(Ei|A) = P(Ei∩A) / P(A)…..(i)
Then, by using the multiplication rule of probability, we get
P(Ei∩A) = P(Ei)P(A|Ei)……(ii)
Now, by the total probability theorem,
P(A) = ∑ P(Ek)P(A|Ek)…..(iii)
Substituting the value of P(Ei∩A) and P(A) from eq (ii) and eq(iii) in eq(i) we get,
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
Bayes’ theorem is also known as the formula for the probability of “causes”. As we know, the Ei‘s are a
partition of the sample space S, and at any given time only one of the events Ei occurs. Thus we
conclude that the Bayes’ theorem formula gives the probability of a particular Ei, given the event A has
occurred.
Terms Related to Bayes Theorem
After learning about Bayes theorem in detail, let us understand some important terms related to the
concepts we covered in formula and derivation.
• Hypotheses: Events happening in the sample space E1, E2,… En is called the hypotheses
• Priori Probability: Priori Probability is the initial probability of an event occurring before any
new data is taken into account. P(Ei) is the priori probability of hypothesis Ei.
• Posterior Probability: Posterior Probability is the updated probability of an event after
considering new information. Probability P(Ei|A) is considered as the posterior probability of
hypothesis Ei.
Bayes Theorem Examples
Example : There are three urns containing 3 white and 2 black balls; 2 white and 3 black balls; 1
black and 4 white balls respectively. There is an equal probability of each urn being chosen. One
ball is equal probability chosen at random. what is the probability that a white ball is drawn?
Solution:
Let E1, E2, and E3 be the events of choosing the first, second, and third urn respectively. Then,
P(E1) = P(E2) = P(E3) =1/3
Let E be the event that a white ball is drawn. Then,
P(E/E1) = 3/5, P(E/E2) = 2/5, P(E/E3) = 4/5
By theorem of total probability, we have
P(E) = P(E/E1) . P(E1) + P(E/E2) . P(E2) + P(E/E3) . P(E3)
⇒ P(E) = (3/5 × 1/3) + (2/5 × 1/3) + (4/5 × 1/3)
⇒ P(E) = 9/15 = 3/5
61
What is Hypothesis Testing?
A hypothesis is an assumption or idea, specifically a statistical claim about an unknown
population parameter. For example, a judge assumes a person is innocent and verifies this by reviewing
evidence and hearing testimony before reaching a verdict.
Hypothesis testing is a statistical method that is used to make a statistical decision using experimental
data. Hypothesis testing is basically an assumption that we make about a population parameter. It
evaluates two mutually exclusive statements about a population to determine which statement is best
supported by the sample data.
To test the validity of the claim or assumption about the population parameter:
• A sample is drawn from the population and analyzed.
• The results of the analysis are used to decide whether the claim is true or not.
Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an
assumption that we are assuming, and we need some statistical way to prove these. We need some
mathematical conclusion whatever we are assuming is true.
Defining Hypotheses
• Null hypothesis (H0): In statistics, the null hypothesis is a general statement or default position
that there is no relationship between two measured cases or no relationship among groups. In
other words, it is a basic assumption or made based on the problem knowledge.
Example: A company’s mean production is 50 units/per da H0: μμ = 50.
• Alternative hypothesis (H1): The alternative hypothesis is the hypothesis used in hypothesis
testing that is contrary to the null hypothesis.
Example: A company’s production is not equal to 50 units/per day i.e. H1: μμ ≠ = 50.

UNIT 4
Basic Concept of Classification (Data Mining)
••

Data Mining: Data mining in general terms means mining or digging deep into data that is in different
forms to gain patterns, and to gain knowledge on that pattern. In the process of data mining, large data
sets are first sorted, then patterns are identified and relationships are established to perform data analysis
and solve problems.
Classification is a task in data mining that involves assigning a class label to each instance in a dataset
based on its features. The goal of classification is to build a model that accurately predicts the class
labels of new instances based on their features.
There are two main types of classification: binary classification and multi-class classification. Binary
classification involves classifying instances into two classes, such as “spam” or “not spam”, while
multi-class classification involves classifying instances into more than two classes.

The process of building a classification model typically involves the following steps:

Data Collection:
The first step in building a classification model is data collection. In this step, the data relevant to the
problem at hand is collected. The data should be representative of the problem and should contain all the
necessary attributes and labels needed for classification. The data can be collected from various sources,
such as surveys, questionnaires, websites, and databases.
Data Preprocessing:
The second step in building a classification model is data preprocessing. The collected data needs to be
preprocessed to ensure its quality. This involves handling missing values, dealing with outliers, and
transforming the data into a format suitable for analysis. Data preprocessing also involves converting
the data into numerical form, as most classification algorithms require numerical input.
62
Handling Missing Values: Missing values in the dataset can be handled by replacing them with the
mean, median, or mode of the corresponding feature or by removing the entire record.
Dealing with Outliers: Outliers in the dataset can be detected using various statistical techniques such as
z-score analysis, boxplots, and scatterplots. Outliers can be removed from the dataset or replaced with
the mean, median, or mode of the corresponding feature.
Data Transformation: Data transformation involves scaling or normalizing the data to bring it into a
common scale. This is done to ensure that all features have the same level of importance in the analysis.
Feature Selection:
The third step in building a classification model is feature selection. Feature selection involves
identifying the most relevant attributes in the dataset for classification. This can be done using various
techniques, such as correlation analysis, information gain, and principal component analysis.
Correlation Analysis: Correlation analysis involves identifying the correlation between the features in
the dataset. Features that are highly correlated with each other can be removed as they do not provide
additional information for classification.
Information Gain: Information gain is a measure of the amount of information that a feature provides for
classification. Features with high information gain are selected for classification.
Principal Component Analysis:
Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the dataset.
PCA identifies the most important features in the dataset and removes the redundant ones.
Model Selection:
The fourth step in building a classification model is model selection. Model selection involves selecting
the appropriate classification algorithm for the problem at hand. There are several algorithms available,
such as decision trees, support vector machines, and neural networks.
Decision Trees: Decision trees are a simple yet powerful classification algorithm. They divide the
dataset into smaller subsets based on the values of the features and construct a tree-like model that can
be used for classification.
Support Vector Machines: Support Vector Machines (SVMs) are a popular classification algorithm used
for both linear and nonlinear classification problems. SVMs are based on the concept of maximum
margin, which involves finding the hyperplane that maximizes the distance between the two classes.
Neural Networks:
Neural Networks are a powerful classification algorithm that can learn complex patterns in the data.
They are inspired by the structure of the human brain and consist of multiple layers of interconnected
nodes.
Model Training:
The fifth step in building a classification model is model training. Model training involves using the
selected classification algorithm to learn the patterns in the data. The data is divided into a training set
and a validation set. The model is trained using the training set, and its performance is evaluated on the
validation set.
Model Evaluation:
The sixth step in building a classification model is model evaluation. Model evaluation involves
assessing the performance of the trained model on a test set. This is done to ensure that the model
generalizes well
Classification is a widely used technique in data mining and is applied in a variety of domains, such as
email filtering, sentiment analysis, and medical diagnosis.

63
Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of
categories (subpopulations), a new observation belongs to, on the basis of a training set of data
containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further approve
it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on
test data and hence estimate the accuracy of the classification rules.

Test data are used to estimate the accuracy of the classification rule

Training and Testing:


Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get aside
in order not to get hurt. So, this is his training part to move away. While Testing if the person sees any
heavy object coming towards him or falling on him and moves aside then the system is tested positively

64
and if the person does not move aside then the system is negatively tested.
The same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the file
(whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
1. Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey evaluating some products. We need to check whether it’s
useful or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No
• Symmetric: Both values are equally important in all aspects
• Asymmetric: When both the values may not be important.
2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather than
being in Integer form.
Example: One needs to choose some material but of different colors. So, the color might be
Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
• Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain
different grades as per their performance such as A, B, C, D
Grades: A, B, C, D
• Continuous: May have an infinite number of values, it is in float type
Example: Measuring the weight of few Students in a sequence or orderly manner
i.e. 50, 51, 52, 53
Weight: 50, 51, 52, 53
• Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Syntax:
• Mathematical Notation: Classification is based on building a function taking input feature
vector “X” and predicting its outcome “Y” (Qualitative response taking values in set C)
• Here Classifier (or model) is used which is a Supervised function, can be designed manually
based on the expert’s knowledge. It has been constructed to predict class labels (Example:
Label – “Yes” or “No” for the approval of some event).
Classifiers can be categorized into two major types:

1. Discriminative: It is a very basic classifier and determines just one class for each row of
data. It tries to model just by depending on the observed data, depends heavily on the quality
of data rather than on distributions.
Example: Logistic Regression
2. Generative: It models the distribution of individual classes and tries to learn the model that
generates the data behind the scenes by estimating assumptions and distributions of the
model. Used to predict the unseen data.
Example: Naive Bayes Classifier
Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too
divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if
a user wants to check that if an email contains the word cheap, then that may be termed as
Spam.
It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.
And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.
So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)

65
Classifiers Of Machine Learning:
1. Decision Trees
2. Bayesian Classifiers
3. Neural Networks
4. K-Nearest Neighbour
5. Support Vector Machines
6. Linear Regression
7. Logistic Regression
Associated Tools and Languages: Used to mine/ extract useful information from raw data.
• Main Languages used: R, SAS, Python, SQL
• Major Tools used: RapidMiner, Orange, KNIME, Spark, Weka
• Libraries used: Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK, TensorFlow,
Seaborn, Basemap, etc.
Real–Life Examples :
• Market Basket Analysis:
It is a modeling technique that has been associated with frequent transactions of buying some
combination of items.
Example: Amazon and many other Retailers use this technique. While viewing some
products, certain suggestions for the commodities are shown that some people have bought
in the past.
• Weather Forecasting:
Changing Patterns in weather conditions needs to be observed based on parameters such as
temperature, humidity, wind direction. This keen observation also requires the use of
previous records in order to predict it accurately.
Advantages:
• Mining Based Methods are cost-effective and efficient
• Helps in identifying criminal suspects
• Helps in predicting the risk of diseases
• Helps Banks and Financial Institutions to identify defaulters so that they may approve Cards,
Loan, etc.
Disadvantages:
Privacy: When the data is either are chances that a company may give some information about their
customers to other vendors or use this information for their profit.
Accuracy Problem: Selection of Accurate model must be there in order to get the best accuracy and
result.
APPLICATIONS:

•Marketing and Retailing


•Manufacturing
•Telecommunication Industry
•Intrusion Detection
•Education System
•Fraud Detection
GIST OF DATA MINING :
1. Choosing the correct classification method, like decision trees, Bayesian networks, or neural
networks.
2. Need a sample of data, where all class values are known. Then the data will be divided into
two parts, a training set, and a test set.

66
Clustering in Data Mining

Clustering:
The process of making a group of abstract objects into classes of similar objects is known as clustering.
Points to Remember:
One group is treated as a cluster of data objects
• In the process of cluster analysis, the first step is to partition the set of data into groups with
the help of data similarity, and then groups are assigned to their respective labels.
• The biggest advantage of clustering over-classification is it can adapt to the changes made
and helps single out useful features that differentiate different groups.
Applications of cluster analysis :
• It is widely used in many applications such as image processing, data analysis, and pattern
recognition.
• It helps marketers to find the distinct groups in their customer base and they can characterize
their customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and
identifying genes with the same capabilities.
• It also helps in information discovery by classifying documents on the web.
Clustering Methods:
It can be classified based on the following categories.
1. Model-Based Method
2. Hierarchical Method
3. Constraint-Based Method
4. Grid-Based Method
5. Partitioning Method
6. Density-Based Method
Requirements of clustering in data mining:
The following are some points why clustering is important in data mining.
• Scalability – we require highly scalable clustering algorithms to work with large databases.
• Ability to deal with different kinds of attributes – Algorithms should be able to work with
the type of data such as categorical, numerical, and binary data.
• Discovery of clusters with attribute shape – The algorithm should be able to detect clusters
in arbitrary shapes and it should not be bounded to distance measures.
• Interpretability – The results should be comprehensive, usable, and interpretable.
• High dimensionality – The algorithm should be able to handle high dimensional space
instead of only handling low dimensional data.
Difference between Classification and Clustering

Classification Clustering
Classification is a supervised learning Clustering is an unsupervised learning approach
approach where a specific label is provided to where grouping is done on similarities basis.
the machine to classify new observations.
Here the machine needs proper testing and
training for the label verification.
Supervised learning approach. Unsupervised learning approach.
It uses a training dataset. It does not use a training dataset.
It uses algorithms to categorize the new data It uses statistical concepts in which the data set is
as per the observations of the training set. divided into subsets with the same features.
In classification, there are labels for training In clustering, there are no labels for training data.
data.

67
Its objective is to find which class a new Its objective is to group a set of objects to find
object belongs to form the set of predefined whether there is any relationship between them.
classes.
It is more complex as compared to clustering. It is less complex as compared to clustering.
Here are some issues in classification in data mining:
• Discrimination
Data mining can lead to discrimination if it focuses on characteristics like ethnicity, gender, religion, or
sexual preference. In some cases, this can be considered unethical or illegal.
• Performance issues
Data mining algorithms can have performance issues with speed, reliability, and time. Parallelization
can help by breaking down jobs into smaller tasks that multiple computers can work on.
• Outlier analysis
Outlier detection is a major issue in data mining. An outlier is a pattern that is different from all the
other patterns in a data set.
• Anomaly detection
Anomaly detection is the process of finding patterns in data that don't conform to expected behavior. It
has applications in many fields, including finance, medicine, industry, and the internet.
Other issues in data mining include:
• Data quality
• Data privacy and security
• Handling diverse data types
• Scalability
• Integration with heterogeneous data sources
• Interpretation of results
• Dynamic data
• Legal and ethical concerns

Decision Tree Algorithms


Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches
represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision
nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree
into subtrees.
o Below diagram explains the general structure of a decision tree:
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.

68
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of
the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and move
further. It continues the process until it reaches the leaf node of the tree. The complete process can be
better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and
called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM). The root node splits further into the next decision node (distance from
the office) and one leaf node based on the corresponding labels. The next decision node further
69
gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits
into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of
the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using the
below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples
o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Types of Decision Tree Algorithms


The different decision tree algorithms are listed below:
• ID3(Iterative Dichotomiser 3)

70
• C4.5
• CART(Classification and Regression Trees)
• CHAID (Chi-Square Automatic Interaction Detection)
• MARS(Multivariate Adaptive Regression Splines)

What is Iterative Dichotomiser3 Algorithm?


ID3 [Iterative Dichotomiser3]
(It is the most popular algorithms used to constructing trees.)
ID3 stands for Iterative Dichotomizer3 and is named such because the algorithm
iteratively(repeatedly) dichotomizes(divides) features into two or more groups at each step.ID3 is an
algorithm invented by Ross Quinlan used to generate a decision tree from a dataset and is the most
popular algorithms used to constructing trees.
ID3 is the core algorithm for building a decision tree .It employs a top-down greedy search through
the space of all possible branches with no backtracking. This algorithm uses information gain and
entropy to construct a classification decision tree.
Characteristics of ID3 Algorithm
Major Characteristics of the ID3 Algorithm are listed below:
ID3 can overfit the training data (to avoid overfitting, smaller decision trees should be preferred over
larger ones).
This algorithm usually produces small trees, but it does not always produce the smallest possible tree.
ID3 is harder to use on continuous data (if the values of any given attribute is continuous, then there
are many more places to split the data on this attribute, and searching for the best value to split by can
be time-consuming).
Advantages and Disadvantages of ID3 Algorithm
Advantages
• Inexpensive to construct
• Extremely fast at classifying unknown records Easy to interpret for small-sized trees.
• Robust to noise (especially when methods to avoid over-fitting are employed).
• Can easily handle redundant or irrelevant attributes (unless the attributes are interacting).
• Simple and easy to understand.
• Requires little training data.
• Can work well with data with discrete and continuous attributes.

Disadvantages
• The space of possible decision trees is exponentially large. Greedy approaches are often
unable to find the best tree.
• Does not take into account interactions between attributes.
• Each decision boundary involves only a single attribute.
• Can lead to overfitting.
• May not be effective with data with many attributes.

What are the steps in ID3 algorithm?


1. Determine entropy for the overall the dataset using class distribution.
2. For each feature.
• Calculate Entropy for Categorical Values.
• Assess information gain for each unique categorical value of the feature.
3. Choose the feature that generates highest information gain.
4. Iteratively apply all above steps to build the decision tree structure.
Pseudocode of ID3
def ID3(D, A): if D is pure or A is empty: return a leaf node with the majority class in D else: A_best
= argmax(InformationGain(D, A)) root = Node(A_best) for v in values(A_best): D_v = subset(D,
A_best, v) child = ID3(D_v, A - {A_best}) root.add_child(v, child) return root
71
Applications of ID3
1. Fraud detection: ID3 can be used to develop models that can detect fraudulent
transactions or activities.
2. Medical diagnosis: ID3 can be used to develop models that can diagnose diseases or
medical conditions.
3. Customer segmentation: ID3 can be used to segment customers into different groups
based on their demographics, purchase history, or other factors.
4. Risk assessment: ID3 can be used to assess risk in a variety of different areas, such as
insurance, finance, and healthcare.
5. Recommendation systems: ID3 can be used to develop recommendation systems that can
recommend products, services, or content to users based on their past behavior or
preferences.
• Amazon uses ID3 to recommend products to its customers.
• Netflix uses ID3 to recommend movies and TV shows to its users.
• Spotify uses ID3 to recommend songs and playlists to its users.
• LendingClub uses ID3 to assess the risk of approving loans to borrowers.
• Healthcare organizations use ID3 to diagnose diseases, predict patient
outcomes, and develop personalized treatment plans.
Example-
Forecast whether the match will be played or not according to the weather condition. Here we can see the
table-

72
C4.5
As an enhancement to the ID3 algorithm, Ross Quinlan created the decision tree algorithm C4.5. In
machine learning and data mining applications, it is a well-liked approach for creating decision trees.
Certain drawbacks of the ID3 algorithm are addressed in C4.5, including its incapacity to deal with
continuous characteristics and propensity to overfit the training set.
A modification of information gain known as the gain ratio is used to address the bias towards qualities
with many values. It is computed by dividing the information gain by the intrinsic information, which is
a measurement of the quantity of data required to characterize an attribute’s values.

GainRatio=SplitgainGaininformationGainRatio=GaininformationSplitgain
Where Split Information represents the entropy of the feature itself. The feature with the highest gain
ratio is chosen for splitting.
When dealing with continuous attributes, C4.5 sorts the attribute’s values first, and then chooses the
midpoint between each pair of adjacent values as a potential split point. Next, it determines which split
point has the largest value by calculating the information gain or gain ratio for each.
By turning every path from the root to a leaf into a rule, C4.5 can also produce rules from the decision
tree. Predictions based on fresh data can be generated using the rules.
C4.5 is an effective technique for creating decision trees that can produce rules from the tree and handle
both discrete and continuous attributes. The model’s accuracy is increased and overfitting is prevented
by its utilization of gain ratio and decreased error pruning. Nevertheless, it might still be susceptible to
noisy data and might not function effectively on datasets with a lot of features.

C4.5 decision tree is a modification over the ID3 Decision Tree. C4.5 uses the Gain Ratio as
the goodness function to split the dataset, unlike ID3 which used the Information Gain.
The Information Gain function tends to prefer the features with more categories as they tend to have
lower entropy. This results in overfitting of the training data. Gain Ratio mitigates this issue by
penalising features for having a more categories using a formula called Split Information or Intrinsic
Information.

Consider the calculation of Split Information for the Outlook feature.

Calculation of Split

Now Applying the value of Split Information to the Gain Ratio:

Formula to compute the Gain Ratio


Had the feature had just one category, for instance, had outlook had just one category say, Sunny, the
Split Information would have been:

73
Thus resulting in an infinite Gain Ratio. As Gain Ratio favours the features with lesser categories, the
branching will be lesser and thus it prevents overfitting.

CART (Classification and Regression Trees)


Introduction to the CART Algorithm
Decision trees have become one of the most popular and versatile algorithms in the realm of data
science and machine learning. Among the array of techniques used to construct decision trees, the
CART (Classification and Regression Trees) Algorithm stands out, known for its simplicity and
efficiency.
Brief Overview of the CART Algorithm
The CART Algorithm, an acronym for Classification and Regression Trees, is a foundational
technique used to construct decision trees. The beauty of CART lies in its binary tree structure, where
each node represents a decision based on attribute values, eventually leading to an outcome or class
label at the terminal nodes or leaves.

The algorithm can be used for both classification and regression


• Classification: categorizing data into predefined classes
• Regression: (predicting continuous values).
The CART method operates by recursively partitioning the dataset, ensuring that each partition or
subset is as pure as possible.

Definition of CART Algorithm


At its core, the CART (Classification and Regression Trees) Algorithm is a tree-building method used
to predict a target variable based on one or several input variables. The algorithm derives its name from
the two main problems it addresses: Classification, where the goal is to categorize data points into classes,
and Regression, where the aim is to predict continuous numeric values.
Understanding Binary Trees and Splits
Binary trees are a hallmark of the CART methodology. In the context of this algorithm:
• Nodes: Represent decisions based on attribute values. Each node tests a specific attribute and
splits the data accordingly.
• Edges/Branches: Symbolize the outcome of a decision, leading to another node or a terminal
node.
• Terminal Nodes/Leaves: Indicate the final decision or prediction.
The process of deciding where and how to split the data is central to CART. The algorithm evaluates each
attribute's potential as a split point, selecting the one that results in the most homogeneous subsets of data.

74
Node Impurity and the Gini Index
The goal of CART's splitting process is to achieve pure nodes, meaning nodes that have data points
belonging to a single class or with very similar values. To quantify the purity or impurity of a node, the
CART algorithm often employs measures like the Gini Index. A lower Gini Index suggests that a node is
pure.
For regression problems, other measures like mean squared error can be used to evaluate splits, aiming
to minimize the variability within nodes.
Pruning Techniques used in Cart Algorithm
While a deep tree with many nodes might fit the training data exceptionally well, it can often lead to
overfitting, where the model performs poorly on unseen data. Pruning addresses this by trimming down
the tree, removing branches that add little predictive power. Two common approaches in CART pruning
are:
• Reduced Error Pruning: Removing a node and checking if it improves model accuracy.
• Cost Complexity Pruning: Using a complexity parameter to weigh the trade-off between tree
size and its fit to the data.
By acquainting oneself with these foundational concepts, one can better appreciate the sophistication of
the CART Algorithm and its application in varied data scenarios.
How the CART Algorithm Works
To harness the full power of the CART Algorithm, it's essential to comprehend its inner workings. This
section provides an in-depth look into the step-by-step process that CART follows, unraveling the logic
behind each decision and split.

Step-by-Step Process of the CART Algorithm


The CART algorithm's magic lies in its systematic approach to building decision trees. Here's a detailed
walk-through:
1. Feature Selection:
• Start by evaluating each feature's ability to split the data effectively.
• Measure the impurity of potential splits using metrics like the Gini Index for classification or
mean squared error for regression.
• Choose the feature and the split point that results in the most significant reduction in impurity.
2. Binary Splitting:
• Once the best feature is identified, create a binary split in the data.
• This creates two child nodes, each representing a subset of the data based on the chosen
feature's value.
3. Tree Building:
• Recursively apply the above two steps for each child node, considering only the subset of data
within that node.
• Continue this process until a stopping criterion is met, such as a maximum tree depth or a
minimum number of samples in a node.
4. Tree Pruning:
75
• With the full tree built, the pruning process begins.
• Examine the tree's sections to identify branches that can be removed without a significant loss
in prediction accuracy.
• Pruning helps prevent overfitting, ensuring the model generalizes well to new data.
Illustrative Examples for Clarity
Example 1: Imagine a dataset predicting whether a person will buy a product based on age and income.
The CART algorithm might determine that splitting the data at an age of 30 results in the purest nodes.
Younger individuals might predominantly fall into the "will buy" category, while older ones might be in
the "will not buy" group.
Example 2: In predicting house prices based on various features, the CART algorithm could decide that
the number of bedrooms is the most critical feature for the initial split. Houses with more than 3 bedrooms
might generally have higher prices, leading to one node, while those with fewer bedrooms lead to another.
These examples offer a glimpse into how CART evaluates data, choosing the most discriminative features
to make decisions and predictions. The beauty of the CART Algorithm lies in its simplicity combined
with its depth. While the basics are easy to grasp, the intricate details, when unfolded, showcase the
algorithm's power and adaptability.
Applications and Use Cases of the CART Algorithm
The versatility of the CART Algorithm has made it a favorite in diverse domains, from healthcare and
finance to e-commerce and energy. Its ability to handle both classification and regression problems,
combined with the transparent nature of decision trees, offers valuable insights and predictions.

Healthcare: Disease Diagnosis and Risk Assessment


In the healthcare domain, timely and accurate diagnosis is critical. CART can help medical professionals:
• Predict the likelihood of a patient having a particular disease based on symptoms and test
results.
• Assess the risk factors contributing to certain health conditions, enabling preventative
measures.
• Example: A hospital could employ the CART Algorithm to determine the risk of patients
developing post-operative complications, considering factors like age, surgery type, and pre-
existing conditions.
Finance: Credit Scoring and Fraud Detection
Financial institutions are continuously seeking efficient ways to mitigate risks. With CART, they can:
• Predict the creditworthiness of customers based on their financial behaviors and histories.
• Detect potentially fraudulent transactions by analyzing patterns and outliers.
• Example: A bank might use CART to segment customers based on their likelihood to default
on loans, considering variables like income, employment status, and debt ratios.
E-commerce: Customer Segmentation and Product Recommendations
In the digital marketplace, understanding customer behavior is paramount. E-commerce platforms
leverage CART to:
• Segment customers based on purchasing behaviors, optimizing marketing campaigns.
• Recommend products based on past browsing and purchase histories.
• Example: An online retailer could apply the CART Algorithm to suggest products that a user
is likely to buy next, based on their past interactions and similar customer profiles.
Energy: Consumption Forecasting and Equipment Maintenance
The energy sector, with its vast infrastructures, benefits from predictive analytics. With CART's help,
organizations can:
• Forecast energy consumption patterns, aiding in efficient grid management.
• Predict when equipment is likely to fail or require maintenance, ensuring uninterrupted service.
• Example: An electricity provider could utilize CART to anticipate spikes in consumption
during specific events or times of the year, allowing them to manage resources more
effectively.
76
The myriad applications of the CART Algorithm underscore its adaptability and the broad value it offers
across industries. Its potential goes beyond these examples, permeating any sector that relies on data-
driven decision-making.
Advantages and Limitations of the CART Algorithm
Like all algorithms, the CART Algorithm has its set of strengths and challenges. Acknowledging both
sides of the coin is essential for informed application and to harness its full potential. This section offers
a balanced perspective on what makes CART shine and where it may need supplementary assistance.
Advantages of the CART Algorithm
1. Versatility: The dual nature of CART to handle both classification and regression tasks sets it apart,
allowing it to tackle a wide variety of problems.
2. Interpretability: Decision trees, the outcome of the CART algorithm, are visually intuitive and easy
to understand. This transparency is invaluable in sectors like finance and healthcare, where interpretability
is crucial.
3. Non-parametric: CART doesn't make any underlying assumptions about the distribution of the data,
making it adaptable to diverse datasets.
4. Handles Mixed Data Types: The algorithm can easily manage datasets containing both categorical
and numerical variables.
5. Automatic Feature Selection: Inherent in its design, CART will naturally give importance to the most
informative features, somewhat negating the need for manual feature selection.
Limitations of the CART Algorithm
1. Overfitting: Without proper pruning, CART can create complex trees that fit the training data too
closely, leading to poor generalization on unseen data.
2. Sensitivity to Data Changes: Small variations in the data can result in vastly different trees. This can
be addressed by using techniques like bagging and boosting.
3. Binary Splits: CART produces binary trees, meaning each node splits into exactly two child nodes.
This might not always be the most efficient representation, especially with categorical data that has
multiple levels.
4. Local Optima: The greedy nature of CART, which makes the best split at the current step without
considering future splits, can sometimes lead to suboptimal trees.
5. Difficulty with XOR Problems: Problems like XOR, where data isn't linearly separable, can be
challenging for decision trees, requiring deeper trees and potentially leading to overfitting.
Understanding these strengths and limitations is pivotal. While CART offers robust capabilities, in some
scenarios, it might be beneficial to consider it as part of an ensemble or in tandem with other algorithms
to counteract its limitations.

CHAID (Chi-Square Automatic Interaction Detection)

What is CHAID?
CHAID is a predictive model used to forecast scenarios and draw conclusions. It’s based
on significance testing and came into use during the 1980s after Gordon Kass published An
Exploratory Technique for Investigating Large Quantities of Categorical Data .
It involves:
• Regression: A statistical analysis to estimate the relationships between a dependent
response variable and other independent ones.
• Machine learning: Artificial intelligence that leverages data to absorb information,
then makes logical predictions or decisions.
• Decision trees: Branching models of decisions or attributes, followed by their event
outcomes.

77
Chi-Square Formula

• To find the most dominant feature, chi-square tests will use that is also called CHAID
whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses the GINI index.
• Today, most programming libraries (e.g. Pandas for Python) use Pearson metric for
correlation by default.
• The formula of chi-square:
• √((y – y’)2 / y’)
• where y is actual and y’ is expected.
• Practical Implementation of Chi-Square Test
• Let’s consider a sample dataset and calculate chi-square for various features.
• Sample Dataset
• We are going to build decision rules for the following data set. The decision column is the
target we would like to find based on some features.
• By The Way, we will ignore the day column because it’s just the row number.

• We need to find the most important feature w.r.t target columns to choose the node to split

data in this data set.

CHAID use cases


Now that we’ve established CHAID’s principles, what are some of the potential use cases
in B2B market research?
• Market segmentation: Statistical analysis techniques such as CHAID can help you
identify segments in your customer base. Understanding the variety of traits, wants,
needs, and behaviors in your target audience should inform a tailored approach to
sales and marketing.
78
• Perception/brand tracking: Understanding the different segments for specific
brand perceptions, plus these groups’ notable characteristics. For example, you
could use CHAID to analyze different groups’ customer satisfaction or likelihood to
recommend.
• New product development: Testing a detailed concept with fleshed -out features
using CHAID can support product development. A decision tree can show how
different groups would respond to a product concept and reveal the impact of
different characteristics. Additionally, you could use CHAID to evaluate how
customers would choose between multiple concepts.
• Marcomms testing: As with product testing, you could test how customer appeal
for different messages varies, or how different segments would respond to a single
marketing message.
Interpreting CHAID decision trees
Resembling a flow chart with multiple paths, CHAID decision trees are a highly visual way
to display data and are simple to interpret (once you know how to, that is).
A CHAID diagram typically includes:
• Root: This is the starting point for the decision tree, with lines – a.k.a. branches –
stemming from it. In market research, the root shows the overall factor used to find
contrasting segments.
• Branches: Branches connect the nodes, also known as leaves. In research, where the
prediction model identifies several customer segments, there will be many branches
– each with its own leaves.
• Leaves: Leaves show the criteria included in the final prediction model. In market
research analysis, leaves are the variables or traits found in a specific customer
segment.
Here is a simple decision tree root with branches and leaves. This example explores
hypothetical segments around likelihood to purchase a new product from Brand X after
launch, to help the company understand the customer journey.
The root shows the average likelihood to purchase among customers based on an online
survey, in this case, 6/10. From there, the CHAID algorithm creates branches to separate
those who are above or below the average.
This produces two contrasting leaves – customers who are 8/10 likely to buy versus those
who are only 4/10 likely.
CHAID then splits these nodes again looking for segments with common traits. It finds
that common ground for customer groups is based on criteria including CSAT scores,
previous purchases, spend value, membership type, having an account manager, and so on.
Advantages of using CHAID
CHAID’s benefits include:
• Reading between the lines: By scientifically evaluating the relationship between variables,
you could get more accurate results – compared to asking respondents directly. Sometimes,
respondents do not know exactly why they make certain decisions and cannot pinpoint the
driving factors themselves.
• Detailed outputs: CHAID’s decision trees can be broad e.g. with several long branches and
lots of leaves. In contrast, some other decision tree techniques are binary i.e. they only produce
two long branches showing a couple of outcomes.
Simple to understand and share: Decision trees are more visual than, for example, data
tables. They have a logical flow and show results in a more self-explanatory way than many other
statistical techniques. Therefore, the results are easier to share with different teams across your business.
CHAID isn’t right for every market research project. Sampling is a key factor, for example. Our best
practices explain how to get the most out of it and see if it works for your research objectives.

The distance-based algorithms in data mining


79
The algorithms are used to measure the distance between each text and to calculate the score.

Distance measures play an important role in machine learning


They provide the foundations for many popular and effective machine learning algorithms like KNN
(K-Nearest Neighbours) for supervised learning and K-Means clustering for unsupervised learning.
Different distance measures must be chosen and used depending on the types of data, As such, it is
important to know how to implement and calculate a range of different popular distance measures and
the intuitions for the resulting scores.
In this blog, we’ll discover distance measures in machine learning.
Overview:
1. Role of Distance Measures
2. Hamming Distance
3. Euclidean Distance
4. Manhattan Distance (Taxiable or City Block)
5. Minkowski Distance
6. Mahalanobis Distance
7. Cosine Similarity

Role of Distance Measures


Distance measures play an important role in machine learning
A distance measure is an objective score that summarizes the relative difference between two objects in
a problem domain.

Most commonly, the two objects are rows of data that describes a subject (such as a person, car, or
house), or an event (such as purchases, a claim, or a diagnosis)
Perhaps, the most likely way we can encounter distance measures is when we are using a specific
machine learning algorithm that uses distance measures at its core. The most famous algorithm is KNN
— [K-Nearest Neighbours Algorithm]
KNN
A Classification or Regression prediction is made for new examples by calculating the distance between
the new and all existing example sets in the training datasets.

The K examples in the training dataset with the smallest distance are then selected and a prediction is
made by averaging the outcome(mode of the class label or mean of the real value for regression)

80
KNN belongs to a broader field of algorithms called case-based or instance-based learning, most of
which uses distance measures in a similar manner. Another popular instance-based algorithm that uses
distance measures is the learning vector quantization or LVQ, the algorithm that may also be considered
a type of neural network.
Next, We have the Self-Organizing Map algorithm, or SOM, which is an algorithm that also uses
distance measures and can be used for supervised and unsupervised learning algorithms that use
distance measures at its core is the K-means clustering algorithm.
In Instance-Based Learning, the training examples are stored verbatim and a distance function is used to
determine which member of the training set is closest to an unknown test instance. Once the nearest
training instance has been located its class is predicted for the test instance.

Clustering in Data Mining


Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of data points
into clusters so that the objects belong to the same group.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar to each
other, and these subsets are called clusters. Now that the data from our customer base is divided into
clusters, we can make an informed decision about who we think is best suited for this product.

Let's understand this with an example, suppose we are a market manager, and we have a new tempting
product to sell. We are sure that the product would bring enormous profit, as long as it is sold to the
right people. So, how can we tell who is best suited for the product from our company's huge customer
base?

Clustering, falling under the category of unsupervised machine learning, is one of the problems that
machine learning algorithms solve.
Advertisement
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input data.
A good clustering algorithm aims to obtain clusters whose:
o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar
to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other
data.
81
What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the cluster is less
than the distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.
What is clustering in Data Mining?
o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses
called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a
stand-alone instrument to get a better insight into data distribution or as a pre-processing step for
other algorithms
Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is based on data
similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single
out important characteristics that differentiate between distinct groups.
Applications of cluster analysis in data mining:
o In many applications, clustering analysis is widely used, such as data analysis, market research,
pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, categorization of
genes with the same functionalities and gain insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
geographical location.

Why is clustering used in data mining?


Clustering analysis has been an evolving problem in data mining due to its variety of applications. The
advent of various data clustering tools in the last few years and their comprehensive use in a broad range
of applications, including image processing, computational biology, mobile communication, medicine,
and economics, must contribute to the popularity of these algorithms. The main issue with the data
clustering algorithms is that it cant be standardized. The advanced algorithm may give the best results
with one type of data set, but it may fail or perform poorly with other kinds of data set. Although many
efforts have been made to standardize the algorithms that can perform well in all situations, no
significant achievement has been achieved so far. Many clustering tools have been proposed so far.
However, each algorithm has its advantages or disadvantages and cant work on all real situations.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time to perform
clustering should approximately scale to the complexity order of the algorithm. For example, if we
perform K- means clustering, we know it is O(n), where n is the number of objects in the data. If we
raise the number of data objects 10 folds, then the time taken to cluster them should also approximately
increase 10 times. It means there should be a linear relationship. If that is not the case, then there is
some error with our implementation process.

82
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure
illustrates the graphical example where it may lead to the wrong result.
2. Interpretability:
Advertisement
The outcomes of clustering should be interpretable, comprehensible, and usable.
3. Discovery of clusters with attribute shape:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be limited to
only distance measurements that tend to discover a spherical cluster of small sizes.
4. Ability to deal with different types of attributes:
Algorithms should be capable of being applied to any data such as data based on intervals (numeric),
binary data, and categorical data.
5. Ability to deal with noisy data:
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such data and
may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the low-
dimensional space.

Partitioning Algorithms
A typical algorithmic strategy is partitioning, which involves breaking a big issue into smaller
subproblems that may be solved individually and then combining their solutions to solve the original
problem.
The fundamental concept underlying partitioning is to separate the input into subsets, solve each subset
independently, and then combine the results to get the whole answer. From sorting algorithms to parallel
computing, partitioning has a wide range of uses.
The quicksort algorithm is a well-known illustration of partitioning. The effective sorting algorithm
QuickSort uses partitioning to arrange an array of items. The algorithm divides the array into two
subarrays: one having elements smaller than the pivot and the other holding entries bigger than the
pivot. The pivot element is often the first or final member of the array. The pivot element is then
positioned correctly, and the procedure is then performed recursively on the two subarrays to sort the
full array.
The quicksort method is demonstrated using an array of numbers in the following example −
Input: [5, 3, 8, 4, 2, 7, 1, 6]
Step 1 − Select a pivot element (in this case, 5)
[5, 3, 8, 4, 2, 7, 1, 6]
Step 2 − Partition the array into two subarrays
[3, 4, 2, 1] [5] [8, 7, 6]
Step 3 − Recursively apply quicksort to the two subarrays
[3, 4, 2, 1] [5] [8, 7, 6]
[1, 2, 3, 4] [5] [6, 7, 8]
[1, 2, 3, 4, 5, 6, 7, 8]
83
Step 4 − The array is now sorted
[1, 2, 3, 4, 5, 6, 7, 8]
The quicksort method in this example divides the input array into three smaller subarrays, uses the
quicksort algorithm iteratively on the two of the smaller subarrays, then combines the sorted subarrays
to get the final sorted array.
Advantages
Performance gains
By allowing distinct sections of a problem to be addressed in parallel or on multiple processors,
partitioning can result in considerable performance gains. Faster processing speeds and improved
resource use may follow from this.
Improved Performance
The capacity of algorithms to handle bigger data sets and issue sizes is referred to as scalability.
Partitioning can aid in this process.
Scalability
Partitioning can simplify complicated issues by dividing them into smaller, easier-to-manage sub-issues.
This can facilitate understanding and solution of the issue.

Disadvantages
Added complexity
Partitioning can make an algorithm more complicated since it calls for extra logic to maintain the
various partitions and coordinate their operations.
Communication costs
When using partitioning to parallelize a task, communication costs might be a major obstacle. The
communication burden may outweigh the performance benefits of parallelization if the partitions must
communicate often.
Load imbalance
If the sub-problems are not equally large or challenging, partitioning may result in load imbalance.
Overall performance may suffer as a result of certain processors sitting idle while others are
overworked.

Types of Partitioning Algorithms


Operating systems utilise partitioning algorithms to manage how much memory is allocated to each
process. According to the processes' sizes and other requirements, these algorithms divide or segment
the memory.
Several partitioning algorithms are used by operating systems, each with its own advantages and
disadvantages.
Fixed Partitioning
Each block or partition of memory that is divided into fixed-size units is assigned to a particular process.
The number of divisions is fixed and cannot be changed dynamically. This implies that the memory that
is accessible is split up into a set number of partitions that are all the same size and can only hold one
process at a time.
Dynamic Partitioning
Memory is divided into blocks or divisions of variable sizes via dynamic partitioning, and each block or
partition is assigned to a process as needed. Because of this, the available memory is divided into
chunks of varying sizes, and each chunk may be assigned to a job based on how much memory it
requires.
Dynamic partitioning allows for efficient memory utilisation since processes are only given the memory
they require. It also reduces internal fragmentation since memory portions are allocated dynamically.
Best-Fit Partitioning
The smallest division that can accommodate the process is specified by the best-fit partitioning method.
Memory use is increasing but internal fragmentation is decreasing. Nevertheless, because the operating
system must locate the smallest partition that can accommodate the process, this approach could be
slower and less effective.
84
Worst-Fit Partitioning
With the worst-fit partitioning method, the process is allocated the biggest partition that can contain it.
Fragmentation could rise as a result of the decreased utilisation of partitions. In contrast to best-fit
partitioning, the operating system can quickly assign the biggest available partition to the process.
First-Fit Partitioning
With the first-fit partitioning method, the process is allocated the first available partition that can fit it.
This method is simple and efficient, however it could create greater fragmentation because certain small
partitions are left unused.
Next-Fit Partitioning
Instead of starting at the beginning while looking for a free partition, the next-fit partitioning method
starts from the most recently allocated partition. This can reduce fragmentation and improve efficiency
since fewer tiny divisions are likely to be left idle. An unequal distribution of memory may be the result
of long gaps between allocated partitions.

Limitation and Challenges


While partitioning algorithms have numerous advantages, they can have serious drawbacks and
difficulties. The overhead of using partitioning algorithms in practice is one of the major obstacles.
Performance may suffer if a larger resource or problem requires more processing power and memory.
Another challenge is the complexity of managing several divisions. When the number of partitions
increases, it could be harder to manage and keep track of each operating system component. Moreover,
some partitioning techniques can require extra maintenance and configuration, which would add to the
total workload of the system administrators.
Finally, partitioning algorithms may involve performance and security trade-offs. As an illustration,
network partitioning, which isolates different network components from one another to improve
security, can also cause increased latency and lower network performance.

Hierarchical clustering in data mining


Hierarchical clustering refers to an unsupervised learning procedure that determines successive clusters
based on previously defined clusters. It works via grouping data into a tree of clusters. Hierarchical
clustering stats by treating each data points as an individual cluster. The endpoint refers to a different set
of clusters, where each cluster is different from the other cluster, and the objects within each cluster are
the same as one another.

There are two types of hierarchical clustering


o Agglomerative Hierarchical Clustering
o Divisive Clustering
Agglomerative hierarchical clustering
Agglomerative clustering is one of the most common types of hierarchical clustering used to group
similar objects in clusters. Agglomerative clustering is also known as AGNES (Agglomerative Nesting).
In agglomerative clustering, each data point act as an individual cluster and at each step, data objects are
grouped in a bottom-up method. Initially, each data object is in its cluster. At each iteration, the clusters
are combined with different clusters until one cluster is formed.
Agglomerative hierarchical clustering algorithm
Advertisement
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.
85
Let’s understand this concept with the help of graphical representation using a dendrogram.
With the help of given demonstration, we can understand that how the actual algorithm work. Here no
calculation has been done below all the proximity among the clusters are assumed.
Let's suppose we have six different data points P, Q, R, S, T, V.

Step 1:
Consider each alphabet (P, Q, R, S, T, V) as an individual cluster and find the distance between the
individual cluster from all other clusters.
Step 2:
Now, merge the comparable clusters in a single cluster. Let’s say cluster Q and Cluster R are similar to
each other so that we can merge them in the second step. Finally, we get the clusters [ (P), (QR), (ST),
(V)]
Step 3:
Here, we recalculate the proximity as per the algorithm and combine the two closest clusters [(ST), (V)]
together to form new clusters as [(P), (QR), (STV)]
Step 4:
Repeat the same process. The clusters STV and PQ are comparable and combined together to form a
new cluster. Now we have [(P), (QQRSTV)].
Step 5:
Finally, the remaining two clusters are merged together to form a single cluster [(PQRSTV)]
Advertisement

Divisive Hierarchical Clustering


Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical clustering. In
Divisive Hierarchical clustering, all the data points are considered an individual cluster, and in every
iteration, the data points that are not similar are separated from the cluster. The separated data points are
treated as an individual cluster. Finally, we are left with N clusters.
Advantages of Hierarchical clustering
o It is simple to implement and gives the best output in some cases.
o It is easy and results in a hierarchy, a structure that contains more information.
o It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
o It breaks the large clusters.
o It is Difficult to handle different sized clusters and convex shapes.
o It is sensitive to noise and outliers.
o The algorithm can never be changed or deleted once it was done previously.

Density-based clustering in data mining

86
Density-based clustering refers to a method that is based on local cluster criterion, such as density
connected points. In this tutorial, we will discuss density-based clustering with examples.
What is Density-based clustering?
Density-Based Clustering refers to one of the most popular unsupervised learning methodologies used
in model building and machine learning algorithms. The data points in the region separated by two
clusters of low point density are considered as noise. The surroundings with a radius ε of a given object
are known as the ε neighborhood of the object. If the ε neighborhood of the object comprises at least a
minimum number, MinPts of objects, then it is called a core object.
Density-Based Clustering - Background
There are two different parameters to calculate the density-based clustering
EPS: It is considered as the maximum radius of the neighborhood.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
NEps (i) : { k belongs to D and dist (i,k) < = Eps}
Directly density reachable:
A point i is considered as the directly density reachable from a point k with respect to Eps, MinPts if
i belongs to NEps(k)
Core point condition:
NEps (k) >= MinPts

Density reachable:
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable from ii.

Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o such
that both i and j are considered as density reachable from o with respect to Eps and MinPts.

Working of Density-Based Clustering

87
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable form
the object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of objects,
D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that ii + 1 is directly
density reachable from ii with respect to ε and MinPts.
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D' only if
there is an object o belongs to D such that both point i and j are density reachable from o with respect to
ε and MinPts.
Major Features of Density-Based Clustering
The primary features of Density-based clustering are given below.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.

Density-Based Clustering Methods


DBSCAN
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on a
density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database with
outliers.

OPTICS
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant order of
database with respect to its density-based clustering structure. The order of the cluster comprises
information equivalent to the density-based clustering related to a long range of parameter settings.
OPTICS methods are beneficial for both automatic and interactive cluster analysis, including
determining an intrinsic clustering structure.
DENCLUE
Density-based clustering by Hinnebirg and Kiem. It enables a compact mathematical description of
arbitrarily shaped clusters in high dimension state of data, and it is good for data sets with a huge
amount of noise.

BIRCH in Data Mining


BIRCH (balanced iterative reducing and clustering using hierarchies) is an unsupervised data mining
algorithm that performs hierarchical clustering over large data sets. With modifications, it can also be
used to accelerate k-means clustering and Gaussian mixture modeling with the expectation-
maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster
incoming, multi-dimensional metric data points to produce the best quality clustering for a given set of
resources (memory and time constraints). In most cases, BIRCH only requires a single scan of the
database.

88
Its inventors claim BIRCH to be the "first clustering algorithm proposed in the database area to handle
'noise' (data points that are not part of the underlying pattern) effectively", beating DBSCAN by two
months. The BIRCH algorithm received the SIGMOD 10 year test of time award in 2006.
Basic clustering algorithms like K means and agglomerative clustering are the most commonly used
clustering algorithms. But when performing clustering on very large datasets, BIRCH and DBSCAN are
the advanced clustering algorithms useful for performing precise clustering on large datasets. Moreover,
BIRCH is very useful because of its easy implementation. BIRCH is a clustering algorithm that clusters
the dataset first in small summaries, then after small summaries get clustered. It does not directly cluster
the dataset. That is why BIRCH is often used with other clustering algorithms; after making the
summary, the summary can also be clustered by other clustering algorithms.
It is provided as an alternative to MinibatchKMeans. It converts data to a tree data structure with the
centroids being read off the leaf. And these centroids can be the final cluster centroid or the input for
other cluster algorithms like Agglomerative Clustering.

The BIRCH clustering algorithm consists of two stages:


1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions called
Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an ordered
triple (N, LS, SS) where 'N' is the number of data points in the cluster, 'LS' is the linear sum of
the data points, and 'SS' is the squared sum of the data points in the cluster. A CF entry can be
composed of other CF entries. Optionally, we can condense this initial CF tree into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A CF
tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree contains a
pointer to a child node, and a CF entry made up of the sum of CF entries in the child nodes.
Optionally, we can refine these clusters.
Due to this two-step process, BIRCH is also called Two-Step Clustering.
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering feature tree
(CF tree). This algorithm is based on the CF (clustering features) tree. In addition, this algorithm uses a
tree-structured summary to create clusters.

In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those nodes that
have several sub-clusters can be called CF subclusters. These CF subclusters are situated in no-terminal
CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds necessary
information of given data for further hierarchical clustering. This prevents the need to work with whole
data given as input. The tree cluster of data points as CF is represented by three numbers (N, LS, SS).
o N = number of items in subclusters
o LS = vector sum of the data points
o SS = sum of the squared data points
89
There are mainly four phases which are followed by the algorithm of BIRCH.
o Scanning data into memory.
o Condense data (resize data).
o Global clustering.
o Refining clusters.
Two of them (resize data and refining clusters) are optional in these four phases. They come in the
process when more clarity is required. But scanning data is just like loading data into a model. After
loading the data, the algorithm scans the whole data and fits them into the CF trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global clustering, it
sends CF trees for clustering using existing clustering algorithms. Finally, refining fixes the problem of
CF trees where the same valued points are assigned to different leaf nodes.
Cluster Features
BIRCH clustering achieves its high efficiency by clever use of a small set of summary statistics to
represent a larger set of data points. These summary statistics constitute a CF and represent a sufficient
substitute for the actual data for clustering purposes.
A CF is a set of three summary statistics representing a set of data points in a single cluster. These
statistics are as follows:
o Count [The number of data values in the cluster]
o Linear Sum [The sum of the individual coordinates. This is a measure of the location of the
cluster]
o Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the cluster]
NOTE: The linear sum and the squared sum are equivalent to the mean and variance of the data
point.
CF Tree
The building process of the CF Tree can be summarized in the following steps, such as:
Step 1: For each given record, BIRCH compares the location of that record with the location of each CF
in the root node, using either the linear sum or the mean of the CF. BIRCH passes the incoming record
to the root node CF closest to the incoming record.
Step 2: The record then descends down to the non-leaf child nodes of the root node CF selected in step
1. BIRCH compares the location of the record with the location of each non-leaf CF. BIRCH passes the
incoming record to the non-leaf node CF closest to the incoming record.
Step 3: The record then descends down to the leaf child nodes of the non-leaf node CF selected in step
2. BIRCH compares the location of the record with the location of each leaf. BIRCH tentatively passes
the incoming record to the leaf closest to the incoming record.
Step 4: Perform one of the below points (i) or (ii):
1. If the radius of the chosen leaf, including the new record, does not exceed the threshold T, then
the incoming record is assigned to that leaf. The leaf and its parent CF's are updated to account
for the new data point.
2. If the radius of the chosen leaf, including the new record, exceeds the Threshold T, then a new
leaf is formed, consisting of the incoming record only. The parent CFs is updated to account for
the new data point.
If step 4(ii) is executed, and the maximum L leaves are already in the leaf node, the leaf node is split
into two leaf nodes. If the parent node is full, split the parent node, and so on. The most distant leaf node
CFs are used as leaf node seeds, with the remaining CFs being assigned to whichever leaf node is closer.
Note that the radius of a cluster may be calculated even without knowing the data points, as long as we
have the count n, the linear sum LS, and the squared sum SS. This allows BIRCH to evaluate whether a
given data point belongs to a particular sub-cluster without scanning the original data set.
Clustering the Sub-Clusters
Once the CF tree is built, any existing clustering algorithm may be applied to the sub-clusters (the CF
leaf nodes) to combine these sub-clusters into clusters. The task of clustering becomes much easier as
the number of sub-clusters is much less than the number of data points. When a new data value is added,
these statistics may be easily updated, thus making the computation more efficient.
Parameters of BIRCH
90
There are three parameters in this algorithm that needs to be tuned. Unlike K-means, the optimal
number of clusters (k) need not be input by the user as the algorithm determines them.
o Threshold: Threshold is the maximum number of data points a sub-cluster in the leaf node of
the CF tree can hold.
o branching_factor: This parameter specifies the maximum number of CF sub-clusters in each
node (internal node).
o n_clusters: The number of clusters to be returned after the entire BIRCH algorithm is complete,
i.e., the number of clusters after the final clustering step. The final clustering step is not
performed if set to none, and intermediate clusters are returned.
Advantages of BIRCH
It is local in that each clustering decision is made without scanning all data points and existing clusters.
It exploits the observation that the data space is not usually uniformly occupied, and not every data point
is equally important.
It uses available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also
an incremental method that does not require the whole data set in advance.

Basic Understanding of CURE Algorithm


CURE(Clustering Using Representatives)
• It is a hierarchical based clustering technique, that adopts a middle ground between the centroid
based and the all-point extremes. Hierarchical clustering is a type of clustering, that starts with a
single point cluster, and moves to merge with another cluster, until the desired number of
clusters are formed.
• It is used for identifying the spherical and non-spherical clusters.
• It is useful for discovering groups and identifying interesting distributions in the underlying data.
• Instead of using one point centroid, as in most of data mining algorithms, CURE uses a set of
well-defined representative points, for efficiently handling the clusters and eliminating the
outliers.

Representation of Clusters and Outliers


Six steps in CURE algorithm:

91
CURE Architecture

• Idea: Random sample, say ‘s’ is drawn out of a given data. This random sample is partitioned,
say ‘p’ partitions with size s/p. The partitioned sample is partially clustered, into say ‘s/pq’
clusters. Outliers are discarded/eliminated from this partially clustered partition. The partially
clustered partitions need to be clustered again. Label the data in the disk.

Representation of partitioning and clustering

• Procedure :
1. Select target sample number ‘gfg’.
2. Choose ‘gfg’ well scattered points in a cluster.
3. These scattered points are shrunk towards centroid.
4. These points are used as representatives of clusters and used in ‘Dmin’ cluster merging
approach. In Dmin(distance minimum) cluster merging approach, the minimum distance
from the scattered point inside the sample ‘gfg’ and the points outside ‘gfg sample, is
calculated. The point having the least distance to the scattered point inside the sample,
when compared to other points, is considered and merged into the sample.
5. After every such merging, new sample points will be selected to represent the new
cluster.
6. Cluster merging will stop until target, say ‘k’ is reached.

92
Here are some comparisons of clustering with categorical attributes:
• Categorical data clustering
This is a machine learning task that involves grouping objects with similar categorical attributes into
clusters. Categorical attributes are discrete values that lack a natural order or distance function. Most
classical clustering algorithms can't be directly applied to categorical data because of this.
• Clustering mixed data
Most clustering algorithms can only work with data that's either entirely categorical or
numerical. However, real-world datasets often contain both types of data.
• K-means vs K-modes
A study compared the k-means and k-modes algorithms for clustering numerical and qualitative
data. The study found that the k-modes algorithm produced better results and was faster than the k-
means algorithm.
• Graph-based representation
A framework for clustering categorical data uses a graph-based representation method to learn how to
represent categorical values. The framework compares the proposed method with other representation
methods on benchmark datasets.
• Object-cluster similarity
A framework for clustering mixed data uses the concept of object-cluster similarity to create a unified
similarity metric. This metric can be applied to data with categorical, numerical, and mixed attributes.

Unit 5
Association Rule Learning
Association rule learning is a type of unsupervised learning technique that checks for the dependency of
one data item on another data item and maps accordingly so that it can be more profitable. It tries to find
some interesting relations or associations among the variables of dataset. It is based on different rules to
discover the interesting relations between variables in the database.
The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc. Here market
basket analysis is a technique used by the various big retailer to discover the associations between items.
We can understand it by taking an example of a supermarket, as in a supermarket, all products that are
purchased together are put together.
93
For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these
products are stored within a shelf or mostly nearby. Consider the below diagram:

Association rule learning can be divided into three types of algorithms:


Advertisement
1. Apriori
2. Eclat
3. F-P Growth Algorithm
We will understand these algorithms in later chapters.
How does Association Rule Learning work?
Association rule learning works on the concept of If and Else Statement, such as if A then B.

Here the If element is called antecedent, and then statement is called as Consequent. These types of
relationships where we can find out some association or relation between two items is known as single
cardinality. It is all about creating rules, and if the number of items increases, then cardinality also
increases accordingly. So, to measure the associations between thousands of data items, there are
several metrics. These metrics are given below:
o Support
o Confidence
o Lift
Let's understand each of them:
Support
Support is the frequency of A or how frequently an item appears in the dataset. It is defined as the
fraction of the transaction T that contains the itemset X. If there are X datasets, then for transactions T,
it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the items X and Y
occur together in the dataset when the occurrence of X is already given. It is the ratio of the transaction
that contains X and Y to the number of records that contain X.

Lift
It is the strength of any rule, which can be defined as below formula:

94
It is the ratio of the observed support measure and expected support if X and Y are independent of each
other. It has three possible values:
o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of each
other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has a
negative effect on another.
Types of Association Rule Lerning
Association rule learning can be divided into three algorithms:
Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work on the
databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate
the itemset efficiently.
It is mainly used for market basket analysis and helps to understand the products that can be bought
together. It can also be used in the healthcare field to find drug reactions for patients.
Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a depth-first search
technique to find frequent itemsets in a transaction database. It performs faster execution than Apriori
Algorithm.
F-P Growth Algorithm
The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of the Apriori
Algorithm. It represents the database in the form of a tree structure that is known as a frequent pattern or
tree. The purpose of this frequent tree is to extract the most frequent patterns.
Applications of Association Rule Learning
It has various applications in machine learning and data mining. Below are some popular applications of
association rule learning:
o Market Basket Analysis: It is one of the popular examples and applications of association rule
mining. This technique is commonly used by big retailers to determine the association between
items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in
identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.

Frequent Item set in Data set (Association Rule Mining)


INTRODUCTION:
1. Frequent item sets, also known as association rules, are a fundamental concept in association
rule mining, which is a technique used in data mining to discover relationships between items in
a dataset. The goal of association rule mining is to identify relationships between items in a
dataset that occur frequently together.
2. A frequent item set is a set of items that occur together frequently in a dataset. The frequency of
an item set is measured by the support count, which is the number of transactions or records in
the dataset that contain the item set. For example, if a dataset contains 100 transactions and the

95
item set {milk, bread} appears in 20 of those transactions, the support count for {milk, bread} is
20.
3. Association rule mining algorithms, such as Apriori or FP-Growth, are used to find frequent item
sets and generate association rules. These algorithms work by iteratively generating candidate
item sets and pruning those that do not meet the minimum support threshold. Once the frequent
item sets are found, association rules can be generated by using the concept of confidence, which
is the ratio of the number of transactions that contain the item set and the number of transactions
that contain the antecedent (left-hand side) of the rule.
4. Frequent item sets and association rules can be used for a variety of tasks such as market basket
analysis, cross-selling and recommendation systems. However, it should be noted that
association rule mining can generate a large number of rules, many of which may be irrelevant
or uninteresting. Therefore, it is important to use appropriate measures such as lift and
conviction to evaluate the interestingness of the generated rules.
Association Mining searches for frequent items in the data set. In frequent mining usually, interesting
associations and correlations between item sets in transactional and relational databases are found. In
short, Frequent Mining shows which items appear together in a transaction or relationship.
Need of Association Mining: Frequent mining is the generation of association rules from a
Transactional Dataset. If there are 2 items X and Y purchased frequently then it’s good to put them
together in stores or provide some discount offer on one item on purchase of another item. This can
really increase sales. For example, it is likely to find that if a customer buys Milk and bread he/she also
buys Butter. So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So the seller can suggest the
customer buy butter if he/she buys Milk and Bread.
Important Definitions :
• Support : It is one of the measures of interestingness. This tells about the usefulness and
certainty of rules. 5% Support means total 5% of transactions in the database follow the rule.
Support(A -> B) = Support_count(A ∪ B)
• Confidence: A confidence of 60% means that 60% of the customers who purchased a milk and
bread also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
If a rule satisfies both minimum support and minimum confidence, it is a strong rule.
• Support_count(X): Number of transactions in which X appears. If X is A union B then it is the
number of transactions in which A and B both are present.
• Maximal Itemset: An itemset is maximal frequent if none of its supersets are frequent.
• Closed Itemset: An itemset is closed if none of its immediate supersets have same support count
same as Itemset.
• K- Itemset: Itemset which contains K items is a K-itemset. So it can be said that an itemset is
frequent if the corresponding support count is greater than the minimum support count.
Example On finding Frequent Itemsets – Consider the given dataset with given transactions.

• Lets say minimum support count is 3


• Relation hold is maximal frequent => closed => frequent

96
Apriori Algorithm
Apriori algorithm refers to the algorithm which is used to calculate the association rules between
objects. It means how two or more objects are related to one another. In other words, we can say that the
apriori algorithm is an association rule leaning that analyzes that people who bought product A also
bought product B.
The primary objective of the apriori algorithm is to create the association rule between different objects.
The association rule describes how two or more objects are related to one another. Apriori algorithm is
also called frequent pattern mining. Generally, you operate the Apriori algorithm on a database that
consists of a huge number of transactions. Let's understand the apriori algorithm with the help of an
example; suppose you go to Big Bazar and buy different products. It helps the customers buy their
products with ease and increases the sales performance of the Big Bazar. In this tutorial, we will discuss
the apriori algorithm with examples.
Introduction
We take an example to understand the concept better. You must have noticed that the Pizza shop seller
makes a pizza, soft drink, and breadstick combo together. He also offers a discount to their customers
who buy these combos. Do you ever think why does he do so? He thinks that customers who buy pizza
also buy soft drinks and breadsticks. However, by making combos, he makes it easy for the customers.
At the same time, he also increases his sales performance.
Similarly, you go to Big Bazar, and you will find biscuits, chips, and Chocolate bundled together. It
shows that the shopkeeper makes it comfortable for the customers to buy these products in the same
place.
The above two examples are the best examples of Association Rules in Data Mining. It helps us to learn
the concept of apriori algorithms.
What is Apriori Algorithm?
Apriori algorithm refers to an algorithm that is used in mining frequent products sets and relevant
association rules. Generally, the apriori algorithm operates on a database containing a huge number of
transactions. For example, the items customers but at a Big Bazar.
Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.
Components of Apriori algorithm
The given three components comprise the apriori algorithm.
1. Support
2. Confidence
3. Lift
Let's take an example to understand this concept.
We have already discussed above; you need a huge database containing a large no of transactions.
Suppose you have 4000 customers transactions in a Big Bazar. You have to calculate the Support,
Confidence, and Lift for two products, and you may say Biscuits and Chocolate. This is because
customers frequently buy these two items together.
Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600
transactions include a 200 that includes Biscuits and chocolates. Using this data, we will find out the
support, confidence, and lift.
Support
Support refers to the default popularity of any product. You find the support as a quotient of the division
of the number of transactions comprising that product by the total number of transactions. Hence, we get
Support (Biscuits) = (Transactions relating biscuits) / (Total transactions)
= 400/4000 = 10 percent.
Confidence
Confidence refers to the possibility that the customers bought both biscuits and chocolates together. So,
you need to divide the number of transactions that comprise both biscuits and chocolates by the total
number of transactions to get the confidence.
Hence,

97
Confidence = (Transactions relating both biscuits and Chocolate) / (Total transactions involving
Biscuits)
= 200/400
= 50 percent.
It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift
Consider the above example; lift refers to the increase in the ratio of the sale of chocolates when you
sell biscuits. The mathematical equations of lift are given below.
Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)
= 50/10 = 5
It means that the probability of people buying both biscuits and chocolates together is five times more
than that of purchasing the biscuits alone. If the lift value is below one, it requires that the people are
unlikely to buy both the items together. Larger the value, the better is the combination.
How does the Apriori Algorithm work in Data Mining?
We will understand this algorithm with the help of an example
Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}. The
database comprises six transactions where 1 represents the presence of the product and 0 represents the
absence of the product.
Transaction ID Rice Pulse Oil Milk Apple
t1 1 1 1 0 0
t2 0 1 1 1 0
t3 0 0 0 1 1
t4 1 1 0 1 0
t5 1 1 1 0 1
t6 1 1 1 1 1
The Apriori Algorithm makes the given assumptions
o All subsets of a frequent itemset must be frequent.
o The subsets of an infrequent item set must be infrequent.
o Fix a threshold support level. In our case, we have fixed it at 50 percent.
Step 1
Make a frequency table of all the products that appear in all the transactions. Now, short the frequency
table to add only those products with a threshold support level of over 50 percent. We find the given
frequency table.
Product Frequency (Number of transactions)
Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M) 4
The above table indicated the products frequently bought by the customers.
Step 2
Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given frequency table.
Itemset Frequency (Number of transactions)
RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3
Implementing the same threshold support of 50 percent and consider the products that are more than 50
percent. In our case, it is more than 3
Thus, we get RP, RO, PO, and PM
Step 4
98
Now, look for a set of three products that the customers buy together. We get the given combination.
1. RP and RO give RPO
2. PO and PM give POM
Step 5
Calculate the frequency of the two itemsets, and you will get the given frequency table.
Itemset Frequency (Number of transactions)
RPO 4
POM 3
If you implement the threshold assumption, you can figure out that the customers' set of three products
is RPO.
We have considered an easy example to discuss the apriori algorithm in data mining. In reality, you find
thousands of such combinations.
How to improve the efficiency of the Apriori Algorithm?
There are various methods used for the efficiency of the Apriori algorithm
Hash-based itemset counting
In hash-based itemset counting, you need to exclude the k-itemset whose equivalent hashing bucket
count is least than the threshold is an infrequent itemset.
Transaction Reduction
In transaction reduction, a transaction not involving any frequent X itemset becomes not valuable in
subsequent scans.
Apriori Algorithm in data mining
We have already discussed an example of the apriori algorithm related to the frequent itemset
generation. Apriori algorithm has many applications in data mining.
The primary requirements to find the association rules in data mining are given below.
Use Brute Force
Analyze all the rules and find the support and confidence levels for the individual rule. Afterward,
eliminate the values which are less than the threshold support and confidence levels.
The two-step approaches
The two-step approach is a better option to find the associations rules than the Brute Force method.
Step 1
In this article, we have already discussed how to create the frequency table and calculate itemsets
having a greater support value than that of the threshold support.
Step 2
To create association rules, you need to use a binary partition of the frequent itemsets. You need to
choose the ones having the highest confidence levels.
In the above example, you can see that the RPO combination was the frequent itemset. Now, we find
out all the rules using RPO.
RP-O, RO-P, PO-R, O-RP, P-RO, R-PO
You can see that there are six different combinations. Therefore, if you have n elements, there will be
2n - 2 candidate association rules.
Advantages of Apriori Algorithm
o It is used to calculate large itemsets.
o Simple to understand and apply.
Disadvantages of Apriori Algorithms
o Apriori algorithm is an expensive method to find support since the calculation has to pass
through the whole database.
o Sometimes, you need a huge number of candidate rules, so it becomes computationally more
expensive.

Introduction to Dimensionality Reduction Technique


What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.
99
A dataset contains a huge number of input features in various cases, which makes the predictive
modeling task more complicated. Because it is very difficult to visualize or make predictions for the
training dataset with a high number of features, for such cases, dimensionality reduction techniques are
required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions
dataset into lesser dimensions dataset ensuring that it provides similar information." These
techniques are widely used in machine learning for obtaining a better fit predictive model while solving
the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction,
cluster analysis, etc.

The Curse of Dimensionality


Handling the high-dimensional data is very difficult in practice, commonly known as the curse of
dimensionality. If the dimensionality of the input dataset increases, any machine learning algorithm and
model becomes more complex. As the number of features increases, the number of samples also gets
increased proportionally, and the chance of overfitting also increases. If the machine learning model is
trained on high-dimensional data, it becomes overfitted and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with dimensionality
reduction.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
o By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction, which are given below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal components required to
consider are unknown.
o
Approaches of Dimension Reduction
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a way of
selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
100
In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some
common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The
performance decides whether to add those features or remove to increase the accuracy of the model.
This method is more accurate than the filtering method but complex to work. Some common techniques
of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some common techniques of Embedded
methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into space with
fewer dimensions. This approach is useful when we want to keep the whole information but use fewer
resources while processing the information.
Some common feature extraction techniques are:
1. Principal Component Analysis
2. Linear Discriminant Analysis
3. Kernel PCA
4. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


1. Principal Component Analysis
2. Backward Elimination
3. Forward Selection
4. Score comparison
5. Missing Value Ratio
6. Low Variance Filter
7. High Correlation Filter
8. Random Forest
9. Factor Analysis
10. Auto-Encoder

Principal Component Analysis (PCA)


Principal Component Analysis is a statistical process that converts the observations of correlated
features into a set of linearly uncorrelated features with the help of orthogonal transformation. These
new transformed features are called the Principal Components. It is one of the popular tools that is
used for exploratory data analysis and predictive modeling.
PCA works by considering the variance of each attribute because the high attribute shows the good split
between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA
are image processing, movie recommendation system, optimizing the power allocation in various
communication channels.
101
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing Linear Regression or
Logistic Regression model. Below steps are performed in this technique to reduce the dimensionality or
in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken to train the model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n times, and
will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance of the
model, and then we will drop that variable or features; after that, we will be left with n-1
features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum tolerable error rate,
we can define the optimal number of features require for the machine learning algorithms.
Forward Feature Selection
Advertisement
Forward feature selection follows the inverse process of the backward elimination process. It means, in
this technique, we don't eliminate the feature; instead, we will find the best features that can produce the
highest increase in the performance of the model. Below steps are performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of the
model.
Missing Value Ratio
If a dataset has too many missing values, then we drop those variables as they do not carry much useful
information. To perform this, we can set a threshold level, and if a variable has missing values more
than that threshold, we will drop that variable. The higher the threshold value, the more efficient the
reduction.
Low Variance Filter
As same as missing value ratio technique, data columns with some changes in the data have less
information. Therefore, we need to calculate the variance of each variable, and all data columns with
variance lower than a given threshold are dropped because low variance features will not affect the
target variable.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar information. Due to
this factor, the performance of the model can be degraded. This correlation between the independent
numerical variable gives the calculated value of the correlation coefficient. If this value is higher than
the threshold value, we can remove one of the variables from the dataset. We can consider those
variables or features that show a high correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning. This
algorithm contains an in-built feature importance package, so we do not need to program it separately.
In this technique, we need to generate a large set of trees against the target variable, and with the help of
usage statistics of each attribute, we need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the input data into
numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to the correlation
with other variables, it means variables within a group can have a high correlation between themselves,
but they have a low correlation with variables of other groups.
We can understand it by an example, such as if we have two variables Income and spend. These two
variables have a high correlation, which means people with high income spends more, and vice versa.

102
So, such variables are put into a group, and that group is known as the factor. The number of these
factors will be reduced as compared to the original dimension of the dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type of ANN
or artificial neural network, and its main aim is to copy the inputs to their outputs. In this, the input is
compressed into latent-space representation, and output is occurred using this representation. It has
mainly two parts:
o Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
o Decoder: The function of the decoder is to recreate the output from the latent-space
representation.

Spatial data mining


Spatial data mining refers to the process of extraction of knowledge, spatial relationships and interesting
patterns that are not specifically stored in a spatial database; on the other hand, temporal data mining
refers to the process of extraction of knowledge about the occurrence of an event whether they follow,
random, cyclic, seasonal variation, etc. Spatial means space, whereas temporal means time.
What is Spatial Data Mining?
The emergence of spatial data and extensive usage of spatial databases has led to spatial knowledge
discovery. Spatial data mining can be understood as a process that determines some exciting and
hypothetically valuable patterns from spatial databases.
Several tools are there that assist in extracting information from geospatial data. These tools play a vital
role for organizations like NASA, the National Imagery and Mapping Agency (NIMA), the National
Cancer Institute (NCI), and the United States Department of Transportation (USDOT) which tends to
make big decisions based on large spatial datasets.
Earlier, some general-purpose data mining like Clementine See5/C5.0, and Enterprise Miner were used.
These tools were utilized to analyze large commercial databases, and these tools were mainly designed
for understanding the buying patterns of all customers from the database.
Besides, the general-purpose tools were preferably used to analyze scientific and engineering data,
astronomical data, multimedia data, genomic data, and web data.
These are the given specific features of geographical data that prevent the use of general-purpose
data mining algorithms are:
1. spatial relationships among the variables,
2. spatial structure of errors
3. observations that are not independent
4. spatial autocorrelation among the features
5. non-linear interaction in feature space.
Spatial data must have latitude or longitude, UTM easting or northing, or some other coordinates
denoting a point's location in space. Beyond that, spatial data can contain any number of attributes
pertaining to a place. You can choose the types of attributes you want to describe a place. Government
websites provide a resource by offering spatial data, but you need not be limited to what they have
produced. You can produce your own.
Say, for example, you wanted to log information about every location you've visited in the past week.
This might be useful to provide insight into your daily habits. You could capture your destination's
coordinates and list a number of attributes such as place name, the purpose of visit, duration of visit, and
more. You can then create a shapefile in Quantum GIS or similar software with this information and use
the software to query and visualize the data. For example, you could generate a heatmap of the most
visited places or select all places you've visited within a radius of 8 miles from home.
Any data can be made spatial if it can be linked to a location, and one can even have spatiotemporal data
linked to locations in both space and time. For example, when geolocating tweets from Twitter in the
aftermath of a disaster, an animation might be generated that shows the spread of tweets from the
epicentre of the event.
Spatial data mining tasks
103
These are the primary tasks of spatial data mining.

Classification:
Classification determines a set of rules which find the class of the specified object as per its attributes.
Association rules:
Association rules determine rules from the data sets, and it describes patterns that are usually in the
database.
Characteristic rules:
Characteristic rules describe some parts of the data set.
Discriminate rules:
As the name suggests, discriminate rules describe the differences between two parts of the database,
such as calculating the difference between two cities as per employment rate.

What is temporal data mining?


Temporal data mining refers to the process of extraction of non-trivial, implicit, and potentially
important data from huge sets of temporal data. Temporal data are sequences of a primary data type,
usually numerical values, and it deals with gathering useful knowledge from temporal data.
With the increase of stored data, the interest in finding hidden data has shattered in the last decade. The
finding of hidden data has primarily been focused on classifying data, finding relationships, and data
clustering. The major drawback that comes during the discovery process is treating data with temporal
dependencies. The attributes related to the temporal data present in this type of dataset must be treated
differently from other types of attributes. Therefore, most data mining techniques treat temporal data as
an unordered collection of events, ignoring its temporal data.

Temporal data mining tasks


o Data characterization and comparison
o Cluster Analysis
o Classification
o Association rules
o Prediction and trend analysis
o Pattern Analysis
Difference between spatial and Temporal data mining

Spatial Data Mining Temporal Data Mining

104
Spatial data mining refers to the temporal data mining refers to the process of extraction of
extraction of knowledge, spatial knowledge about the occurrence of an event whether they
relationships and interesting patterns follow, random, cyclic, seasonal variation, etc
that are not specifically stored in a
spatial database.
It needs space. It needs time.
Primarily, it deals with spatial data such Primarily, it deals with implicit and explicit temporal
as location, geo-referenced. content, form a huge set of data.
It involves characteristic rules, It targets mining new patterns and unknown knowledge,
discriminant rules, evaluation rules, and which takes the temporal aspects of data.
association rules.
Examples: Finding hotspots, unusual Examples: An association rules which seems - "Any person
locations. who buys motorcycle also buys helmet". By temporal
aspect, this rule would be - "Any person who buys a
motorcycle also buy a helmet after that."

Data Mining- World Wide Web

Over the last few years, the World Wide Web has become a significant source of information and
simultaneously a popular platform for business. Web mining can define as the method of utilizing data
mining techniques and algorithms to extract useful information directly from the web, such as Web
documents and services, hyperlinks, Web content, and server logs. The World Wide Web contains a
large amount of data that provides a rich source to data mining. The objective of Web mining is to look
for patterns in Web data by collecting and examining data in order to gain insights.

What is Web Mining?


Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive property
to provide a set of various data types. The web has multiple aspects that yield different approaches for
the mining process, such as web pages consist of text, web pages are linked via hyperlinks, and user
activity can be monitored via web server logs. These three features lead to the differentiation between
the three areas are web content mining, web structure mining, web usage mining.
There are three types of data mining:

105
1. Web Content Mining:
Web content mining can be used to extract useful data, information, knowledge from the web page
content. In web content mining, each web page is considered as an individual document. The individual
can take advantage of the semi-structured nature of web pages, as HTML provides information that
concerns not only the layout but also logical structure. The primary task of content mining is data
extraction, where structured data is extracted from unstructured websites. The objective is to facilitate
data aggregation over various web sites by using the extracted structured data. Web content mining can
be utilized to distinguish topics on the web. For Example, if any user searches for a specific task on the
search engine, then the user will get a list of suggestions.
Advertisement
2. Web Structured Mining:
The web structure mining can be used to find the link structure of hyperlink. It is used to identify that
data either link the web pages or direct link network. In Web Structure Mining, an individual considers
the web as a directed graph, with the web pages being the vertices that are associated with hyperlinks.
The most important application in this regard is the Google search engine, which estimates the ranking
of its outcomes primarily with the PageRank algorithm. It characterizes a page to be exceptionally
relevant when frequently connected by other highly related pages. Structure and content mining
methodologies are usually combined. For example, web structured mining can be beneficial to
organizations to regulate the network between two commercial sites.
3. Web Usage Mining:
Web usage mining is used to extract useful data, information, knowledge from the weblog records, and
assists in recognizing the user access patterns for web pages. In Mining, the usage of web resources, the
individual is thinking about records of requests of visitors of a website, that are often collected as web
server logs. While the content and structure of the collection of web pages follow the intentions of the
authors of the pages, the individual requests demonstrate how the consumers see these pages. Web
usage mining may disclose relationships that were not proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
I. Session and visitor analysis:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates the guest
records, days, time, sessions, etc. This data can be utilized to analyze the visitor's behavior.
The document is created after this analysis, which contains the details of repeatedly visited web pages,
common entry, and exit.
II. OLAP (Online Analytical Processing):
OLAP accomplishes a multidimensional analysis of advanced data.
OLAP can be accomplished on various parts of log related data in a specific period.
OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:

106
The web pretends incredible challenges for resources, and knowledge discovery based on the following
observations:

o The complexity of web pages:


The site pages don't have a unifying structure. They are extremely complicated as compared to
traditional text documents. There are enormous amounts of documents in the digital library of the web.
These libraries are not organized according to a specific order.
o The web is a dynamic data source:
The data on the internet is quickly updated. For example, news, climate, shopping, financial news,
sports, and so on.
o Diversity of client networks:
The client network on the web is quickly expanding. These clients have different interests, backgrounds,
and usage purposes. There are over a hundred million workstations that are associated with the internet
and still increasing tremendously.
o Relevancy of data:
It is considered that a specific person is generally concerned about a small portion of the web, while the
rest of the segment of the web contains the data that is not familiar to the user and may lead to unwanted
results.
o The web is too broad:
The size of the web is tremendous and rapidly increasing. It appears that the web is too huge for data
warehousing and data mining.

Mining the Web's Link Structures to recognize Authoritative Web Pages:


The web comprises of pages as well as hyperlinks indicating from one to another page. When a creator
of a Web page creates a hyperlink showing another Web page, this can be considered as the creator's
authorization of the other page. The unified authorization of a given page by various creators on the web
may indicate the significance of the page and may naturally prompt the discovery of authoritative web
pages. The web linkage data provide rich data about the relevance, the quality, and structure of the
web's content, and thus is a rich source of web mining.

Application of Web Mining:


Web mining has an extensive application because of various uses of the web. The list of some
applications of web mining is given below.
o Marketing and conversion tool
o Data analysis on website and application accomplishment.
o Audience behaviour analysis
o Advertising and campaign accomplishment analysis.
o Testing and analysis of a site.

107

You might also like