Data Mining

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 89

DATA MINING

Introduction

Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation,
and Knowledge presentation.

This note on Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining techniques,
Clustering in data mining, Challenges in Data mining, etc.

What is Data Mining?

The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and generating
revenue.

Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also
called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular
data set, with an objective. This process includes various types of services such as text mining, web
mining, audio and video mining, pictorial data mining, and social media mining. It is done through
software that is simple or highly specific. By outsourcing data mining, all the work can be done
faster with low operation costs. Specialized firms can also use new technologies to collect data that
is impossible to locate manually. There are tonnes of information available on various platforms,
but very little knowledge is accessible. The biggest challenge is to analyze the data to extract
important information that can be used to solve a problem or for company development. There are
many powerful instruments and techniques available to mine data and find better insight from it.

Types of Data Mining


Data mining can be performed on the following types of data:

Relational Database
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.

Data warehouses
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.

Object-Relational Database
A combination of an object-oriented database model and relational database model is called an
object-relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.

Transactional Database
A transactional database refers to a database management system (DBMS) that has the potential to
undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.

Advantages of Data Mining

1. The Data Mining technique enables organizations to obtain knowledge-based data.


2. Data mining enables organizations to make lucrative modifications in operation and
production.
3. Compared with other statistical data applications, data mining is a cost-efficient.
4. Data Mining helps the decision-making process of an organization.
5. It Facilitates the automated discovery of hidden patterns as well as the prediction of
trends and behaviors.
6. It can be induced in the new system as well as the existing platforms.
7. It is a quick process that makes it easy for new users to analyze enormous amounts
of data in a short time.
Disadvantages of Data Mining
1. There is a probability that the organizations may sell useful data of customers to
other organizations for money. As per the report, American Express has sold credit
card purchases of their customers to other organizations.
2. Many data mining analytics software is difficult to operate and needs advance
training to work on.
3. Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining tools
is a very challenging task.
4. The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Applications of Data Mining


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences, product
positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining enables
a retailer to use point-of-sale records of customer purchases to develop products and promotions
that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare


Data mining in healthcare has excellent potential to improve the health system. It uses data and
analytics for better insights and to identify best practices that will enhance health care services and
reduce costs. Analysts use data mining approaches such as Machine learning, multi-dimensional
database, Data visualization, soft computing, and statistics. Data Mining can be used to forecast
patients in each category. The procedures ensure that the patients get intensive care at the right
place and at the right time. Data mining also enables healthcare insurers to recognize fraud and
abuse.

Data Mining in Market Basket Analysis


Market basket analysis is a modeling method based on a hypothesis. If you buy a specific group of
products, then you are more likely to buy another group of products. This technique may enable
the retailer to understand the purchase behavior of a buyer. This data may assist the retailer in
understanding the requirements of the buyer and altering the store's layout accordingly. Using a
different analytical comparison of results between various stores, between customers in different
demographic groups can be done.

Data mining in Education


Education data mining is a newly emerging field, concerned with developing techniques that
explore knowledge from the data generated from educational Environments. EDM objectives are
recognized as affirming student's future learning behavior, studying the impact of educational
support, and promoting learning science. An organization can use data mining to make precise
decisions and also to predict the results of the student. With the results, the institution can
concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering


Knowledge is the best asset possessed by a manufacturing company. Data mining tools can be
beneficial to find patterns in a complex manufacturing process. Data mining can be used in system-
level designing to obtain the relationships between product architecture, product portfolio, and
data needs of the customers. It can also be used to forecast the product development period, cost,
and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management)


Customer Relationship Management (CRM) is all about obtaining and holding Customers, also
enhancing customer loyalty and implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to collect data and analyze the data.
With data mining technologies, the collected data can be used for analytics.

Data Mining in Fraud detection


Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection are a little
bit time consuming and sophisticated. Data mining provides meaningful patterns and turning data
into information. An ideal fraud detection system should protect the data of all the users.
Supervised methods consist of a collection of sample records, and these records are classified as
fraudulent or non-fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.

Data Mining in Lie Detection


Apprehending a criminal is not a big deal, but bringing out the truth from him is a very challenging
task. Law enforcement may use data mining techniques to investigate offenses, monitor suspected
terrorist communications, etc. This technique includes text mining also, and it seeks meaningful
patterns in data, which is usually unstructured text. The information collected from the previous
investigations is compared, and a model for lie detection is constructed.

Data Mining Financial Banking


The Digitalization of the banking system is supposed to generate an enormous amount of data with
every new transaction. The data mining technique can help bankers by solving business-related
problems in banking and finance by identifying trends, casualties, and correlations in business
information and market costs that are not instantly evident to managers or executives because the
data volume is too large or are produced too rapidly on the screen by experts. The manager may
find these data for better targeting, acquiring, retaining, segmenting, and maintain a profitable
customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution. Various
challenges could be related to performance, data, methods, and techniques, etc. The process of
data mining becomes effective when the challenges or problems are correctly recognized and
adequately resolved.
Incomplete and Noisy Data
The process of extracting useful data from large volumes of data is data mining. The data in the
real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually be
inaccurate or unreliable. These problems may occur due to data measuring instrument or because
of human errors. Suppose a retail chain collects phone numbers of customers who spend more than
$ 500, and the accounting employees put the information into their system. The person may make a
digit mistake when entering the phone number, which results in incorrect data. Even some
customers may not be willing to disclose their phone numbers, which results in incomplete data.
The data could get changed due to human or system error. All these consequences (noisy and
incomplete data) makes data mining challenging.

Data Distribution
Real-worlds data is usually stored on various platforms in a distributed computing environment. It
might be in a database, individual systems, or even on the internet. Practically, It is a quite tough
task to make all the data to a centralized data repository mainly due to organizational and technical
concerns. For example, various regional offices may have their servers to store their data. It is not
feasible to store, all the data from all the offices on a central server. Therefore, data mining requires
the development of tools and algorithms that allow the mining of distributed data.

Complex Data
Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data and
extracting useful information is a tough task. Most of the time, new technologies, new tools, and
methodologies would have to be refined to obtain specific information.
Performance
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.

Data Privacy and Security


Data mining usually leads to serious issues in terms of data security, governance, and privacy. For
example, if a retailer analyzes the details of the purchased items, then it reveals data about buying
habits and preferences of the customers without their permission.

Data Visualization
In data mining, data visualization is a very important process because it is the primary method that
shows the output to the user in a presentable way. The extracted data should convey the exact
meaning of what it intends to express. But many times, representing the information to the end-
user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented to
make it successful.

There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining relies
on getting rid of all these difficulties.

Data Mining Techniques


Data mining includes the utilization of refined data analysis tools to find previously unknown, valid
patterns and relationships in huge data sets. These tools can incorporate statistical models,
machine learning techniques, and mathematical algorithms, such as neural networks or decision
trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers to
better understanding how to process and make conclusions from the huge amount of data, but
what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.

1. Classification

This technique is used to obtain important and relevant information about data and metadata. This
data mining technique helps to classify data in different classes.

Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data,
text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization, etc.
some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.

2. Clustering
Clustering is a division of information into groups of connected objects. Describing the data by a
few clusters mainly loses certain confine details, but accomplishes improvement. It models data by
its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the subsequent framework
represents a data concept. From a practical point of view, clustering plays an extraordinary job in
data mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify similar
data. This technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.

3. Regression
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the probability of
the specific variable. Regression, primarily a form of planning and modeling. For example, we might
use it to project certain costs, depending on other factors such as availability, consumer demand,
and competition. Primarily it gives the exact relationship between two or more variables in the
given data set.

4. Association Rules
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule mining
has several applications and is commonly used to help sales correlations in data or medical data
sets.

The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection
This type of data mining technique relates to the observation of data items in the data set, which do
not match an expected pattern or expected behavior. This technique may be used in various
domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset. The
majority of the real-world datasets have an outlier. Outlier detection plays a significant role in the
data mining field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor network
data, etc.

6. Sequential Patterns

The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different criteria like length, occurrence
frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.

7. Prediction
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.

Data Mining Implementation Process


Many different sectors are taking advantage of data mining to boost their business efficiency,
including manufacturing, chemical, marketing, aerospace, etc. Therefore, the need for a
conventional data mining process improved effectively. Data mining techniques must be reliable,
repeatable by company individuals with little or no knowledge of the data mining context. As a
result, a cross-industry standard process for data mining (CRISP-DM) was first introduced in 1990,
after going through many workshops, and contribution for more than 300 organizations.

Data mining is described as a process of finding hidden precious data by evaluating the huge
quantity of information stored in data warehouses, using multiple data mining techniques such as
Artificial Intelligence (AI), Machine learning and statistics.

Let's examine the implementation process for data mining in details:


The Cross-Industry Standard Process for Data Mining (CRISP-DM)

Cross-industry Standard Process of Data Mining (CRISP-DM) comprises of six phases designed as a
cyclical method as the given figure:

1. Business understanding
It focuses on understanding the project goals and requirements form a business point of view, then
converting this information into a data mining problem afterward a preliminary plan designed to
accomplish the target.

Tasks:
o Determine business objectives
o Access situation
o Determine data mining goals
o Produce a project plan

Determine Business Objectives


o It Understands the project targets and prerequisites from a business point of view.

o Thoroughly understand what the customer wants to achieve.

o Reveal significant factors, at the starting, it can impact the result of the project.

Access situation
o It requires a more detailed analysis of facts about all the resources, constraints,
assumptions, and others that ought to be considered.
Determine data mining goals
o A business goal states the target of the business terminology. For example, increase catalog sales
to the existing customer.
o A data mining goal describes the project objectives. For example, It assumes how many objects a
customer will buy, given their demographics details (Age, Salary, and City) and the price of the
item over the past three years.

Produce a project plan:

o It states the targeted plan to accomplish the business and data mining plan.
o The project plan should define the expected set of steps to be performed during the rest of the
project, including the latest technique and better selection of tools.

2. Data Understanding
Data understanding starts with an original data collection and proceeds with operations to get
familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.

Tasks:
o Collects initial data
o Describe data
o Explore data
o Verify data quality

Collect initial data


o It acquires the information mentioned in the project resources.
o It includes data loading if needed for data understanding.
o It may lead to original data preparation steps.
o If various information sources are acquired then integration is an extra issue, either here or at the
subsequent stage of data preparation.

Describe data
o It examines the "gross" or "surface" characteristics of the information obtained.
o It reports on the outcomes.

Explore data
o Addressing data mining issues that can be resolved by querying,
visualizing, and reporting, including:
o Distribution of important characteristics, results of simple aggregation.
o Establish the relationship between the small number of attributes.
o Characteristics of important sub-populations, simple statical analysis.
o It may refine the data mining objectives.
o It may contribute or refine the information description, and quality reports.
o It may feed into the transformation and other necessary information preparation.

Verify data quality


o It examines the data quality and addressing questions.

3. Data Preparation
o It usually takes more than 90 percent of the time.
o It covers all operations to build the final data set from the original raw information.
o Data preparation is probable to be done several times and not in any prescribed order.

Tasks
o Select data
o Clean data
o Construct data
o Integrate data
o Format data

Select data

o It decides which information to be used for evaluation.


o In the data selection criteria include significance to data mining objectives, quality and technical
limitations such as data volume boundaries or data types.
o It covers the selection of characteristics and the choice of the document in the table.

Clean data
o It may involve the selection of clean subsets of data, inserting appropriate defaults or more
ambitious methods, such as estimating missing information by modeling.
Construct data
o It comprises of Constructive information preparation, such as generating derived
characteristics,
o complete new documents, or transformed values of current characteristics.

Integrate data
o Integrate data refers to the methods whereby data is combined from various tables, or
documents to create new documents or values.

Format data
o Formatting data refer mainly to linguistic changes produced to information that does not
alter their significance but may require a modeling tool.

4. Modeling
In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of data.
Therefore, stepping back to the data preparation phase is necessary.

Tasks
o Select modeling technique
o Generate test design
o Build model
o Access model

Select modeling technique


o It selects the real modeling method that is to be used. For example, decision tree, neural
network.
o If various methods are applied,then it performs this task individually for each method.

Generate test Design


o Generate a procedure or mechanism for testing the validity and quality of the model before
constructing a model. For example, in classification, error rates are commonly used as
quality measures for data mining models. Therefore, typically separate the data set into train
and test
o set, build the model on the train set and assess its quality on the separate test set.

Build model
o To create one or more models, we need to run the modeling tool on the prepared data set.
Assess model
o It interprets the models according to its domain expertise, the data mining success criteria,
and the required design.
o It assesses the success of the application of modeling and discovers methods more
technically.
o It Contacts business analytics and domain specialists later to discuss the outcomes of data
mining in the business context.

5. Evaluation

o At the last of this phase, a decision on the use of the data mining results should be reached.

o It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.

o The main objective of the evaluation is to determine some significant business issue that has
not been regarded adequately.

o At the last of this phase, a decision on the use of the data mining outcomes should be
reached.

Tasks

o Evaluate results

o Review process

o Determine next steps

Evaluate results

o It assesses the degree to which the model meets the organization's business objectives.

o It tests the model on test apps in the actual implementation when time and budget
limitations permit and also assesses other data mining results produced.

o It unveils additional difficulties, suggestions, or information for future instructions.

Review process

o The review process does a more detailed evaluation of the data mining engagement to
determine when there is a significant factor or task that has been somehow ignored.

o It reviews quality assurance problems.

Determine next steps


o It decides how to proceed at this stage.

o It decides whether to complete the project and move on to deployment when necessary or
whether to initiate further iterations or set up new data-mining initiatives.it includes
resources analysis and budget that influence the decisions.

6. Deployment

Determine:
o Deployment refers to how the outcomes need to be utilized.

Deploy data mining results by:

o It includes scoring a database, utilizing results as company guidelines, interactive internet scoring.
o The information acquired will need to be organized and presented in a way that can be used by the
client. However, the deployment phase can be as easy as producing. However, depending on the
demands, the deployment phase may be as simple as generating a report or as complicated as
applying a repeatable data mining method across the organizations.

Tasks
o Plan deployment
o Plan monitoring and maintenance
o Produce final report
o Review project

Plan deployment:
o To deploy the data mining outcomes into the business, takes the assessment results and
concludes a strategy for deployment.
o It refers to documentation of the process for later deployment.

Plan monitoring and maintenance


o It is important when the data mining results become part of the day-to-day business and its
environment.
o It helps to avoid unnecessarily long periods of misuse of data mining results.
o It needs a detailed analysis of the monitoring process.

Produce final report


o A final report can be drawn up by the project leader and his team.
o It may only be a summary of the project and its experience.
o It may be a final and comprehensive presentation of data mining.

Review project
o Review projects evaluate what went right and what went wrong, what was done wrong, and
what needs to be improved.

Data Mining Architecture

Introduction
Data mining is a significant method where previously unknown and potentially useful information
is extracted from the vast amount of data. The data mining process involves several components,
and these components constitute a data mining system architecture.

The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.

Data Source
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may
comprise one or more databases, text files spreadsheets, or other repositories of data. Sometimes,
even plain text files or spreadsheets may contain information. Another primary source of data is
the World Wide Web or the internet.

Different Processes
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will be
collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.

Database or Data Warehouse Server


The database or data warehouse server consists of the original data that is ready to be processed.
Hence, the server is cause for retrieving the relevant data that is based on data mining as per user
request.

Data Mining Engine


The data mining engine is a major component of any data mining system. It contains several
modules for operating data mining tasks, including association, characterization, classification,
clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various
data sources and stored within the data warehouse.

Pattern Evaluation Module


The Pattern evaluation module is primarily responsible for the measure of investigation of the
pattern by using a threshold value. It collaborates with the data mining engine to focus the search
on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining modules to
focus the search towards fascinating patterns. It might utilize a stake threshold to filter out
discovered patterns. On the other hand, the pattern evaluation module might be coordinated with
the mining module, depending on the implementation of the data mining techniques used. For
efficient data mining, it is abnormally suggested to push the evaluation of pattern stake as much as
possible into the mining procedure to confine the search to only fascinating patterns.
Graphical User Interface
The graphical user interface (GUI) module communicates between the data mining system and the
user. This module helps the user to easily and efficiently use the system without knowing the
complexity of the process. This module cooperates with the data mining system when the user
specifies a query or a task and displays the results.

Knowledge Base
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process. The data
mining engine may receive inputs from the knowledge base to make the result more accurate and
reliable. The pattern assessment module regularly interacts with the knowledge base to get inputs,
and also update it.

KDD- Knowledge Discovery in Databases


The term KDD stands for Knowledge Discovery in Databases. It refers to the broad procedure of
discovering knowledge in data and emphasizes the high-level applications of specific Data Mining
techniques. It is a field of interest to researchers in various fields, including artificial intelligence,
machine learning, pattern recognition, databases, statistics, knowledge acquisition for expert
systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory analysis and


modeling of vast data repositories. KDD is the organized procedure of recognizing valid, useful, and
understandable patterns from huge and complex data sets. Data Mining is the root of the KDD
procedure, including the inferring of algorithms that investigate the data, develop the model, and
find previously unknown patterns. The model is used for extracting the knowledge from the data,
analyze the data, and predict the data.
The availability and abundance of data today make knowledge discovery and Data Mining a matter
of impressive significance and need. In the recent development of the field, it isn't surprising that a
wide variety of techniques is presently accessible to specialists and experts.

The KDD Process


The knowledge discovery process (illustrates in the given figure) is iterative and interactive,
comprises of nine steps. The process is iterative at each stage, implying that moving back to the
previous actions might be required. The process has many imaginative aspects in the sense that
one can’t presents one formula or make a complete scientific categorization for the correct
decisions for each step and application type. Thus, it is needed to understand the process and the
different requirements and possibilities in each stage.

The process begins with determining the KDD objectives and ends with the implementation of the
discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering
various features to cell phone users in order to reduce churn. This closes the loop, and the impacts
are then measured on the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:

1. Building up an understanding of the application domain


This is the initial preliminary step. It develops the scene for understanding what should be done
with the various decisions like transformation, algorithms, representation, etc. The individuals who
are in charge of a KDD venture need to understand and characterize the objectives of the end-user
and the environment in which the knowledge discovery process will occur (involves relevant prior
knowledge).
2. Choosing and creating a data set on which discovery will be performed
Once defined the objectives, the data that will be utilized for the knowledge discovery process
should be determined. This incorporates discovering what data is accessible, obtaining important
data, and afterward integrating all the data for knowledge discovery onto one set involves the
qualities that will be considered for the process. This process is important because of Data Mining
learns and discovers from the accessible data. This is the evidence base for building the models. If
some significant attributes are missing, at that point, then the entire study may be unsuccessful
from this respect, the more attributes are considered. On the other hand, to organize, collect, and
operate advanced data repositories is expensive, and there is an arrangement with the opportunity
for best understanding the phenomena. This arrangement refers to an aspect where the interactive
and iterative aspect of the KDD is taking place. This begins with the best available data sets and
later expands and observes the impact in terms of knowledge discovery and modeling.

3. Preprocessing and cleansing


In this step, data reliability is improved. It incorporates data clearing, for example, Handling the
missing quantities and removal of noise or outliers. It might include complex statistical techniques
or use a Data Mining algorithm in this context. For example, when one suspects that a specific
attribute of lacking reliability or has many missing data, at this point, this attribute could turn into
the objective of the Data Mining supervised algorithm. A prediction model for these attributes will
be created, and after that, missing data can be predicted. The expansion to which one pays
attention to this level relies upon numerous factors. Regardless, studying the aspects is significant
and regularly revealing by itself, to enterprise data frameworks.

4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and extraction
and record sampling), also attribute transformation (for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not utilize
the right transformation at the starting, then we may acquire an amazing effect that insights to us
about the transformation required in the next iteration. Thus, the KDD process follows upon itself
and prompts an understanding of the transformation required.

5. Prediction and description


We are now prepared to decide on which kind of Data Mining to use, for example, classification,
regression, clustering, etc. This mainly relies on the KDD objectives, and also on the previous steps.
There are two significant objectives in Data Mining, the first one is a prediction, and the second one
is the description. Prediction is usually referred to as supervised Data Mining, while descriptive
Data Mining incorporates the unsupervised and visualization aspects of Data Mining. Most Data
Mining techniques depend on inductive learning, where a model is built explicitly or implicitly by
generalizing from an adequate number of preparing models. The fundamental assumption of the
inductive approach is that the prepared model applies to future cases. The technique also takes
into account the level of meta-learning for the specific set of accessible data.

6. Selecting the Data Mining algorithm


Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For example,
considering precision versus understandability, the previous is better with neural networks, while
the latter is better with decision trees. For each system of meta-learning, there are several
possibilities of how it can be succeeded. Meta-learning focuses on clarifying what causes a Data
Mining algorithm to be fruitful or not in a specific issue. Thus, this methodology attempts to
understand the situation under which a Data Mining algorithm is most suitable. Each algorithm has
parameters and strategies of leaning, such as ten folds cross-validation or another division for
training and testing.

7. Utilizing the Data Mining Algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need to
utilize the algorithm several times until a satisfying outcome is obtained. For example, by turning
the algorithms control parameters, such as the minimum number of instances in a single leaf of a
decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.

9. Using the discovered knowledge

Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.
Data Mining vs Machine Learning

Data Mining relates to extracting information from a large quantity of data. Data mining is a
technique of discovering different kinds of patterns that are inherited in the data set and which are
precise, new, and useful data. Data Mining is working as a subset of business analytics and similar
to experimental studies. Data Mining's origins are databases, statistics.

Machine learning includes an algorithm that automatically improves through data-based


experience. Machine learning is a way to find a new algorithm from experience. Machine learning
includes the study of an algorithm that can automatically extract the data. Machine learning utilizes
data mining techniques and another learning algorithm to construct models of what is happening
behind certain information so that it can predict future results.

Data Mining and Machine learning are areas that have been influenced by each other, although they
have many common things, yet they have different ends.

Data Mining is performed on certain data sets by humans to find interesting patterns between the
items in the data set. Data Mining uses techniques created by machine learning for predicting the
results while machine learning is the capability of the computer to learn from a minded data set.

Machine learning algorithms take the information that represents the relationship between items
in data sets and creates models in order to predict future results. These models are nothing more
than actions that will be taken by the machine to achieve a result.

What is Data Mining?

Data Mining is the method of extraction of data or previously unknown data patterns from huge
sets of data. Hence as the word suggests, we 'Mine for specific data' from the large data set. Data
mining is also called Knowledge Discovery Process, is a field of science that is used to determine
the properties of the datasets. Gregory Piatetsky-Shapiro founded the term "Knowledge
Discovery in Databases" (KDD) in 1989. The term "data mining" came in the database
community in 1990. Huge sets of data collected from data warehouses or complex datasets such as
time series, spatial, etc. are extracted in order to extract interesting correlations and patterns
between the data items. For Machine Learning algorithms, the output of the data mining algorithm
is often used as input.
What is Machine learning?

Machine learning is related to the development and designing of a machine that can learn itself
from a specified set of data to obtain a desirable result without it being explicitly coded. Hence
Machine learning implies 'a machine which learns on its own. Arthur Samuel invented the term
Machine learning an American pioneer in the area of computer gaming and artificial
intelligence in 1959. He said that "it gives computers the ability to learn without being
explicitly programmed."

Machine learning is a technique that creates complex algorithms for large data processing and
provides outcomes to its users. It utilizes complex programs that can learn through experience and
make predictions.

The algorithms are enhanced by themselves by frequent input of training data. The aim of machine
learning is to understand information and build models from data that can be understood and used
by humans.

Machine learning algorithms are divided into two types:

1. Unsupervised Learning
2. Supervised Learning

1. Unsupervised Machine Learning:


Unsupervised learning does not depend on trained data sets to predict the results, but it utilizes
direct techniques such as clustering and association in order to predict the results. Trained data
sets are defined as the input for which the output is known.

2. Supervised Machine Learning:


As the name implies, supervised learning refers to the presence of a supervisor as a teacher.
Supervised learning is a learning process in which we teach or train the machine using data which
is well leveled implies that some data is already marked with the correct responses. After that, the
machine is provided with the new sets of data so that the supervised learning algorithm analyzes
the training data and gives an accurate result from labeled data.
Major Difference between Data mining and Machine learning
1. Two-component is used to introduce data mining techniques first one is the database, and the
second one is machine learning. The database provides data management techniques, while
machine learning provides methods for data analysis. But to introduce machine learning methods,
it used algorithms.

2. Data Mining utilizes more data to obtain helpful information, and that specific data will help to
predict some future results. For example, In a marketing company that utilizes last year's data to
predict the sale, but machine learning does not depend much on data. It uses algorithms. Many
transportation companies such as OLA, UBER machine learning techniques to calculate ETA
(Estimated Time of Arrival) for rides is based on this technique.

3. Data mining is not capable of self-learning. It follows the guidelines that are predefined. It will
provide the answer to a specific problem, but machine learning algorithms are self-defined and can
alter their rules according to the situation, and find out the solution for a specific problem and
resolves it in its way.

4. The main and most important difference between data mining and machine learning is that
without the involvement of humans, data mining can't work, but in the case of machine learning
human effort only involves at the time when the algorithm is defined after that it will conclude
everything on its own. Once it implemented, we can use it forever, but this is not possible in the
case of data mining.

5. As machine learning is an automated process, the result produces by machine learning will be
more precise as compared to data mining.

6. Data mining utilizes the database, data warehouse server, data mining engine, and pattern
assessment techniques to obtain useful information, whereas machine learning utilizes neural
networks, predictive models, and automated algorithms to make the decisions.
Data Mining Vs Machine Learning

Factors Data Mining Machine Learning

Origin Traditional databases with It has an existing algorithm and


unstructured data. data.

Meaning Extracting information from a huge Introduce new Information from


amount of data. data as well as previous
experience.

History In 1930, it was known as knowledge The first program, i.e., Samuel's
discovery in databases(KDD). checker playing program, was
established in 1950.

Responsibility Data Mining is used to obtain the Machine learning teaches the
rules from the existing data. computer, how to learn and
comprehend the rules.

Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.

Applications In compare to machine learning, It needs a large amount of data to


data mining can produce outcomes obtain accurate results. It has
on the lesser volume of data. It is various applications, used in web
also used in cluster analysis. search, spam filter, credit scoring,
computer design, etc.

Nature It involves human interference more It is automated, once designed and


towards the manual. implemented, there is no need for
human effort.

Techniques Data mining is more of research It is a self-learned and train system


involve using a technique like a machine to do the task precisely.
learning.
Scope Applied in the limited fields. It can be used in a vast area.

Facebook Data Mining

In this digital era, the social platform has become inevitable. Whether we like this platform or
not, there is no escape. Facebook allows us to interact with friends and family or to stay up to
date about the latest stuff happening around the world. Facebook has made the world seems
much smaller. Facebook is one of the most important sources of online business
communication. The business holders make the most out of this platform. The most important
reason for which this platform is most accessed is because of its characteristic of being the
oldest video and photo sharing social media tool.

A Facebook page helps the people to get aware of the brand through the media content
shared. The platform supports the businesses to reach out to their audience and then establish
their business belonging to Facebook usage itself.

Not only for the users with business accounts, but this platform is also useful for the accounts
which have personal blogs. The bloggers or even the influencers who deal with posting the
content that attracts the customers give another reason to the users to access Facebook.

As far as the usage by normal users is concerned, people nowadays cannot live without
Facebook. This has become a habit to such an extent, that people have the addiction of going
through this site every once in half an hour.

ADVERTISEMENT

Facebook is one of the most popular social media platforms created in 2004; it now has almost
two billion monthly active users with five new profiles, every second. Anyone who is over the
age of 13 can use the site. Users create a free account which is a profile of them in which they
share as much as some information about themselves as they wish.
Some Facts about Facebook:
ADVERTISEMENT

ADVERTISEMENT

o Headquarters: California, US
o Established: February 2004
o Founded by: Mark Zuckerberg
o There are approximately 52 percent Female users and 48 percent Male users on Facebook.
o Facebook stories are viewed by 0.6 Billion viewers on a daily basis.
o In 2019, in 60 seconds on the internet, 1 million people Log In to Facebook.
o More than 5 billion messages are posted on Facebook pages collectively, on a monthly basis.

On a Facebook page, a user can incorporate many different kinds of personal data, including
the user's date of birth, hobbies and interests, education, sexual preferences, political party, and
religious affiliations, and current employment. Users can also post photos of themselves as well
as other peoples, and they can offer other Facebook users the opportunity to search for and
communicate with them via the website. Researchers have realized that plenty of personal data
on Facebook, as well as other social networking platform, can easily be collected or mined, to
search for patterns in people's behavior. For example, Social researchers at various universities
around the world have collected data from Facebook pages to become familiar with the lives
and social networks of college students. They have also mined for data on MySpace to find out
how people express feelings on the web and to assess- based on data posted on MySpace,
what youths think about appropriate internet conduct.

Because academic specialists, particularly those in the social sciences, are collecting data from
Facebook and other internet websites and distributing their discoveries, numerous university
Institutional Review Boards (IRBs), councils charged by government guidelines to review
research with human subjects, have built up policies and procedures that govern research on
the internet. Some have been made strategies specifically relating to data mining on social
media platforms like Facebook. These strategies serve as institutional- specific supplements to
the Department of Health and Human Services (HHS) guidelines guiding the conduct of
research with human subjects. The formation of these institutional-specific strategies that at
least some university IRBs view data mining on Facebook as research with human subjects. Thus,
the universities where this case has happened, research involving data mining on Facebook
must experience the IRB survey before the research may start.

According to the HHS guidelines, all research with human subjects must experience IRB survey
and get IRB endorsement before the research may start. The administrative requirement tries to
assure that human subjects research is conducted as ethically as possible, in specific requiring
that subject participation in research is voluntary, that the risks to subjects are corresponding to
the benefits and that no subject population is unfairly excluded or incorporated in the research.
Social Media Data Mining Methods
Applying data mining techniques to social media is relatively new as compared to other fields of
research related to social network analytics. When we acknowledge the research in social media
network analysis dates back to the 1930s. The application that uses data mining techniques
developed by industry and academia are already being used commercially. For example, a
"Social Media Analytics" organization offers services to us and track social media to provide
customers data about how goods and services recognized and discussed through social media
networks. Analysts in the organizations have applied text mining algorithms, and detect the
propagation models to blogs to create techniques to understand better how data moves
through the blogosphere.

Data mining techniques can be implemented to social media sites to comprehend information
better and to make use of data for analytics, research, and business purposes. Representative
Fields include a community or group detection, data diffusion, propagation of audiences,
subject detection and tracking, individual behavior analysis, group behavior analysis, and market
research for organizations.

Representation of Data
Similar to other social media data, it is accepted to use a graph representation to study social
media data sets. A graph comprises a set including vertexes (nodes) and edges (links). Users are
usually shown as the nodes in the graph. Relationships or corporation between individuals
(nodes) is shown as the links in the graph.

The graph depiction is common for information extracted from social networking sites where
people interact with friends, family, and business associates. It helps to create a social network
of friends, family, or business associates. Less apparent is how the graph structure is applied to
blogs, wikis, opinion mining, and similar types of online social media platforms.

ADVERTISEMENT

If we consider blogs, One graph representation blogged as nodes and can be regarded as "blog
network," and another graph description has blog posts as the nodes, and can be regarded as
"post-network." Edges are created in a blog post network when another blog post references
another blog post. Other techniques used to represent blog networks concurrently account for
individuals, relationships, content, and time simultaneously- called Internet Online Analytical
Processing (iOLAP). Wikis can be considered from the context of depicting authors as nodes,
and edges are created when the authors contribute to an object.

The graphical representation allows the application of classic mathematical graph theory,
traditional techniques of analyzing social media platforms and work on mining graph data. The
probably big size of the graph used to depict social media platforms can present difficulties for
automated processing as restricts on computer memory. The processing speeds are maximized
and usually exceeded when trying to cope with huge social media data set. Other challenges to
implementing automated procedures to allow social media data mining include identifying and
dealing with spam, the variety of formats used in the same subcategory of social media, and
continuously altering content and structure.

Data Mining- A Process


No matter what sort of social media is being studied, some fundamental things are essential to
consider the most meaningful outcomes are feasible. Every kind of social media and every data
mining purpose applied to social media may involve distinctive methods and algorithms to
produce an advantage from data mining. Various data sets and data issues include different
kinds of tools. If it is known how to organize the data, a classification tool might be appropriate.
If we understand what the data is about, but cannot determine trends and patterns in the data,
the use of a clustering tool may be the best.

The problem itself can conclude the best approach. There is no other option for understanding
the data as much possible before applying data mining techniques as well as understanding the
various data mining tools that are available. A subject analyst might be required to help better
understand the data set. To better understand the various tools available for data mining, there
are a host of data mining and machine learning text and different resources that are available to
support more accurate information about a variety of particular data mining techniques and
algorithms.

Once you understand the issues and select an appropriate data mining approach, consider any
preprocessing that needs to be done. A systematic process may also be required to develop an
adequate set of data to allow reasonable processing times. Pre-processing should include
suitable privacy protection mechanisms. Although social media platforms incorporate huge
amounts of openly accessible data, it is important to guarantee individual rights, and social
media platform copyrights are secured. The effect of spam should be considered along with the
temporal representation.

In addition to preprocessing, it is essential to think about the effect of time. Depending upon
the inquiry and the research, we may get different outcomes at one time compared to another,
although the time segment is an accessible consideration for specific areas. For example,
subject detection, influence propagation, and network development, less evident is the effect of
time on network identification, group behavior, and marketing. What defines a network at one
point in time can be significantly different at another point in time. Group behavior and
interests will change after some time, and what was offered to the individuals or groups today
may not be trendy tomorrow.

With data depicted as a graph, the tasks start with a selected number of nodes, known as seeds.
Graphs are traversed, starting with the arrangement of seeds, and as the link structure from the
seed nodes is used, data is collected, and the structure itself is also reviewed. Utilizing the link
structure to stretch out from the seed set and gather new information is known as crawling the
network. The application and algorithms that are executed as a crawler should effectively
manage the challenges present in powerful social media platforms such as restricted sites,
format changes, and structure errors (invalid links). As the crawler finds the new data, it stores
the new data in a repository for further analysis. As link data is found, the crawler updates the
data about the network structure.

Some social media platforms such as Facebook, Twitter, and Technorati provide Application
Programmer Interfaces (APIs) that allow crawler applications to interact with the data sources
directly. However, these platforms usually restrict the number of API transactions per day,
relying on the affiliation the API user has with the platform. For some platforms, it is possible to
collect data (crawl) without utilizing APIs. Given the huge size of the social media platform data
available, it might be necessary to restrict the amount of data that the crawler collects. When
the crawler has collected the data, some postprocessing may be needed to validate and clean
up the data. Traditional social media platforms analysis methods can be applied, for
example, centrality measures and group structure studies. In many cases, additional data will be
related to a node or a link opening opportunities for more complex methods to consider the
more thoughtful semantics that can be exposed with text and data mining techniques.

We now focus on two particular social media platform data to further represent how data
mining techniques are applied to social media sites. The two major areas are social media
platforms, and Blogs are powerful, and rich data sources portray both these areas. The two
areas offer potential value to the more extensive scientific network as well as a business
organization.

Social media platforms: Illustrative Examples


Social media platforms like Facebook or LinkedIn comprises of connected users with unique
profiles. Users can interact with their friends and family and can share news, photos, story,
videos, favorite links, etc. Users have an option to customize their profiles relying on individual
preferences, but some common data may incorporate relationship status, birthday, an Email
address, and hometown. Users have alternatives to choose how much data they include in their
profile and who has access to it. The amount of data accessible via social media platforms have
raised security concerns and is a related societal issue.

Here, the figure illustrates the hypothetical graph structure diagram for typical social
media platforms, and Arrows indicate links to a larger part of the graph.

It is important to secure personal identity when working with social media platforms data.
Recent reports highlight the need to secure privacy as it has been demonstrated that even
anonymizing this sort of data can still reveal individual data when advanced data analysis
strategies are utilized. Security settings also can restrict the ability of data mining applications to
think about each data on social media platforms. However, some heinous techniques can be
utilized to take over the security settings.

Clustering in Data Mining


Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group of
data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains data similar to
each other, and these subsets are called clusters. Now that the data from our customer base is
divided into clusters, we can make an informed decision about who we think is best suited for
this product.
Let's understand this with an example, suppose we are a market manager, and we have a new
tempting product to sell. We are sure that the product would bring enormous profit, as long as
it is sold to the right people. So, how can we tell who is best suited for the product from our
company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.

Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input
data.

A good clustering algorithm aims to obtain clusters whose:

o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar
to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other
data.

What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the cluster is less
than the distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.

What is clustering in Data Mining?


o Clustering is the method of converting a group of abstract objects into classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant subclasses
called clusters.
o It helps users to understand the structure or natural grouping in a data set and used either as a
stand-alone instrument to get a better insight into data distribution or as a pre-processing step
for other algorithms

Important points:

o Data objects of a cluster can be considered as one group.


o We first partition the information set into groups while doing cluster analysis. It is based on data
similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and it helps single
out important characteristics that differentiate between distinct groups.

Applications of cluster analysis in data mining:


o In many applications, clustering analysis is widely used, such as data analysis, market research,
pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the purchasing
patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of
data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies, categorization of
genes with the same functionalities and gain insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth observation
database and the identification of house groups in a city according to house type, value, and
geographical location.

Why is clustering used in data mining?


Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing, computational
biology, mobile communication, medicine, and economics, must contribute to the popularity of
these algorithms. The main issue with the data clustering algorithms is that it cannott be
standardized. The advanced algorithm may give the best results with one type of data set, but it
may fail or perform poorly with other kinds of data set. Although many efforts have been made
to standardize the algorithms that can perform well in all situations, no significant achievement
has been achieved so far. Many clustering tools have been proposed so far. However, each
algorithm has its advantages or disadvantages and cannot work on all real situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm. For
example, if we perform K- means clustering, we know it is O(n), where n is the number of
objects in the data. If we raise the number of data objects 10 folds, then the time taken to
cluster them should also approximately increase 10 times. It means there should be a linear
relationship. If that is not the case, then there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure illustrates the
graphical example where it may lead to the wrong result.

2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such
data and may result in poor quality clusters.

6. High dimensionality:

The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.
Text Data Mining
Text data mining can be described as the process of extracting essential data from standard
language text. All the data that we generate via text messages, documents, emails, files are
written in common language text. Text mining is primarily used to draw useful insights or
patterns from such data.

The text mining market has experienced exponential growth and adoption over the last few
years and also expected to gain significant growth and adoption in the coming future. One of
the primary reasons behind the adoption of text mining is higher competition in the business
market, many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing customer perspectives,
organizations are making huge investments to find a solution that is capable of analyzing
customer and competitor data to improve competitiveness. The primary source of data is e-
commerce websites, social media platforms, published articles, survey, and many more. The
larger part of the generated data is unstructured, which makes it challenging and expensive for
the organizations to analyze with the help of the people. This challenge integrates with the
exponential growth in data generation has led to the growth of analytical tools. It is not only
able to handle large volumes of text data but also helps in decision-making purposes. Text
mining software empowers a user to draw useful information from a huge set of data available
sources.

Areas of text mining in data mining:


These are the following area of text mining :

ADVERTISEMENT

ADVERTISEMENT

o Information Extraction:
The automatic extraction of structured data such as entities, entities relationships, and attributes
describing entities from an unstructured source is called information extraction.
o Natural Language Processing:
NLP stands for Natural language processing. Computer software can understand human
language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The
development of the NLP application is difficult because computers generally expect humans to
"Speak" to them in a programming language that is accurate, clear, and exceptionally structured.
Human speech is usually not authentic so that it can depend on many complex variables,
including slang, social context, and regional dialects.
o Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data
mining tools can predict behaviors and future trends that allow businesses to make a better data-
driven decision. Data mining tools can be used to resolve many business problems that have
traditionally been too time-consuming.
o Information Retrieval:
Information retrieval deals with retrieving useful data from data that is stored in our systems.
Alternately, as an analogy, we can view search engines that happen on websites such as e-
commerce sites or any other sites as part of information retrieval.
Text Mining Process:
The text mining process incorporates the following steps to extract the data from the
document.

ADVERTISEMENT

o Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.

a. Bag of words

b. Vector Space

o Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language Processing
(NLP), and information retrieval(IR). In the field of text mining, data pre-processing is used for
extracting useful information and knowledge from unstructured text data. Information Retrieval
(IR) is a matter of choosing which documents in a collection should be retrieved to fulfill the
user's need.
o Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
o Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic Data
Mining procedures are used in the structural database.
o Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
o Applications:
These are the following text mining applications:
o Risk Management:
Risk Management is a systematic and logical procedure of analyzing, identifying, treating, and
monitoring the risks involved in any action or process in organizations. Insufficient risk analysis is
usually a leading cause of disappointment. It is particularly true in the financial organizations
where adoption of Risk Management Software based on text mining technology can effectively
enhance the ability to diminish risk. It enables the administration of millions of sources and
petabytes of text documents, and giving the ability to connect the data. It helps to access the
appropriate data at the right time.
o Customer Care Service

Text mining methods, particularly NLP, are finding increasing significance in the field of customer
care. Organizations are spending in text analytics programming to improve their overall
experience by accessing the textual data from different sources such as customer feedback,
surveys, customer calls, etc. The primary objective of text analysis is to reduce the response time
of the organizations and help to address the complaints of the customer rapidly and productively.

o Business Intelligence

Companies and business firms have started to use text mining strategies as a major aspect of
their business intelligence. Besides providing significant insights into customer behavior and
trends, text mining strategies also support organizations to analyze the qualities and weaknesses
of their opponent's so, giving them a competitive advantage in the market.

o Social Media Analysis:

Social media analysis helps to track the online data, and there are numerous text mining tools
designed particularly for performance analysis of social media sites. These tools help to monitor
and interpret the text generated via the internet from the news, emails, blogs, etc. Text mining
tools can precisely analyze the total no of posts, followers, and total no of likes of your brand on
a social media platform that enables you to understand the response of the individuals who are
interacting with your brand and content.
Text Mining Approaches in Data Mining

These are the following text mining approaches that are used in data mining.

1. Keyword-based Association Analysis

It collects sets of keywords or terms that often happen together and afterward discover the
association relationship among them. First, it preprocesses the text data by parsing, stemming,
removing stop words, etc. Once it pre-processed the data, then it induces association mining
algorithms. Here, human effort is not required, so the number of unwanted results and the
execution time is reduced.

2. Document Classification Analysis

Automatic document classification:

This analysis is used for the automatic classification of the huge number of online text
documents like web pages, emails, etc. Text document classification varies with the classification
of relational data as document databases are not organized according to attribute values pairs.

Numericizing text

o Stemming algorithms

A significant pre-processing step before ordering of input documents starts with the stemming of
words. The terms "stemming" can be defined as a reduction of words to their roots. For example,
different grammatical forms of words and ordered are the same. The primary purpose of
stemming is to ensure a similar word by text mining program.

o Support for different languages

There are some highly language-dependent operations such as stemming, synonyms, the letters
that are allowed in words. Therefore, support for various languages is important.

o Exclude certain character

Excluding numbers, specific characters, or series of characters, or words that are shorter or longer
than a specific number of letters can be done before the ordering of the input documents.

o Include lists, exclude lists (stop-words)

A particular list of words to be listed can be characterized, and it is useful when we want to search
for a specific word. It also classifies the input documents based on the frequencies with which
those words occur. Additionally, "stop words," which means terms that are to be rejected from the
ordering can be characterized. Normally, a default list of English stop words incorporates "the,"
"a," "since," etc. These words are used in the respective language very often but communicate
very little data in the document.

Bagging Vs Boosting

We all use the Decision Tree Technique on day to day life to make the
decision. Organizations use these supervised machine learning techniques
like Decision trees to make a better decision and to generate more surplus
and profit.

Ensemble methods combine different decision trees to deliver better


predictive results, afterward utilizing a single decision tree. The primary
principle behind the ensemble model is that a group of weak learners come
together to form an active learner.

There are two techniques given below that are used to perform ensemble
decision tree.

Bagging

Bagging is used when our objective is to reduce the variance of a decision


tree. Here the concept is to create a few subsets of data from the training
sample, which is chosen randomly with replacement. Now each collection of
subset data is used to prepare their decision trees thus, we end up with an
ensemble of various models. The average of all the assumptions from
numerous tress is used, which is more powerful than a single decision tree.

Random Forest is an expansion over bagging. It takes one additional step to


predict a random subset of data. It also makes the random selection of
features rather than using all features to develop trees. When we have
numerous random trees, it is called the Random Forest.

These are the following steps which are taken to implement a Random
forest:

o Let us consider X observations Y features in the training data set.


First, a model from the training data set is taken randomly with
substitution.
o The tree is developed to the largest.
o The given steps are repeated, and prediction is given, which is based
on the collection of predictions from n number of trees.

Advantages of using Random Forest technique:

o It manages a higher dimension data set very well.


o It manages missing quantities and keeps accuracy for missing data.

Disadvantages of using Random Forest technique

Since the last prediction depends on the mean predictions from subset trees,
it won't give precise value for the regression model.

Boosting

Boosting is another ensemble procedure to make a collection of predictors. In other words, we


fit consecutive trees, usually random samples, and at each step, the objective is to solve net
error from the prior trees.

If a given input is misclassified by theory, then its weight is increased so that the upcoming
hypothesis is more likely to classify it correctly by consolidating the entire set at last converts
weak learners into better performing models.

Gradient Boosting is an expansion of the boosting procedure.

1. Gradient Boosting = Gradient Descent + Boosting

It utilizes a gradient descent algorithm that can optimize any differentiable loss function. An
ensemble of trees is constructed individually, and individual trees are summed successively. The
next tree tries to restore the loss ( It is the difference between actual and predicted values).

Advantages of using Gradient Boosting methods:

o It supports different loss functions.


o It works well with interactions.

Disadvantages of using a Gradient Boosting methods:

o It requires cautious tuning of different hyper-parameters.


Difference between Bagging and Boosting

Bagging Boosting

Various training data subsets are randomly drawn Each new subset contains the components that
with replacement from the whole training dataset. were misclassified by previous models.

Bagging attempts to tackle the over-fitting issue. Boosting tries to reduce bias.

If the classifier is unstable (high variance), then we If the classifier is steady and straightforward
need to apply bagging. (high bias), then we need to apply boosting.

Every model receives an equal weight. Models are weighted by their performance.

Objective to decrease variance, not bias. Objective to decrease bias, not variance.

It is the easiest way of connecting predictions that It is a way of connecting predictions that belong
belong to the same type. to the different types.

Every model is constructed independently. New models are affected by the performance of
the previously developed model.

Data Mining Vs Data Warehousing

Data warehouse refers to the process of compiling and organizing data into
one common database, whereas data mining refers to the process of
extracting useful data from the databases. The data mining process depends
on the data compiled in the data warehousing phase to recognize meaningful
patterns. A data warehousing is created to support management systems.

Data Warehouse

A Data Warehouse refers to a place where data can be stored for useful
mining. It is like a quick computer system with exceptionally huge data
storage capacity. Data from the various organization's systems are copied to
the Warehouse, where it can be fetched and conformed to delete errors.
Here, advanced requests can be made against the warehouse storage of
data.

Data warehouse combines data from numerous sources which ensure the data quality, accuracy,
and consistency. Data warehouse boosts system execution by separating analytics processing
from transnational databases. Data flows into a data warehouse from different databases. A
data warehouse works by sorting out data into a pattern that depicts the format and types of
data. Query tools examine the data tables using patterns.

Data warehouses and databases both are relative data systems, but both are made to serve
different purposes. A data warehouse is built to store a huge amount of historical data and
empowers fast requests over all the data, typically using Online Analytical Processing (OLAP).
A database is made to store current transactions and allow quick access to specific transactions
for ongoing business processes, commonly known as Online Transaction Processing (OLTP).

Important Features of Data Warehouse

The Important features of Data Warehouse are given below:

1. Subject Oriented

A data warehouse is subject-oriented. It provides useful data about a subject instead of the
company's ongoing operations, and these subjects can be customers, suppliers, marketing,
product, promotion, etc. A data warehouse usually focuses on modeling and analysis of data
that helps the business organization to make data-driven decisions.

2. Time-Variant:

The different data present in the data warehouse provides information for a specific period.

3. Integrated

A data warehouse is built by joining data from heterogeneous sources, such as social databases,
level documents, etc.

4. Non- Volatile

It means, once data entered into the warehouse cannot be change.

Advantages of Data Warehouse:


o More accurate data access
o Improved productivity and performance
o Cost-efficient
o Consistent and quality data

Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of analyzing
huge sets of data that have either been compiled by computer systems or have been
downloaded into the computer. In the data mining process, the computer analyzes the data and
extract useful information from it. It looks for hidden patterns within the data set and try to
predict future behavior. Data mining is primarily used to discover and indicate relationships
among the data sets.
Data mining aims to enable business organizations to view business behaviors, trends
relationships that allow the business to make data-driven decisions. It is also known as
knowledge Discover in Database (KDD). Data mining tools utilize AI, statistics, databases, and
machine learning systems to discover the relationship between the data. Data mining tools can
support business-related questions that traditionally time-consuming to resolve any issue.

Important features of Data Mining:

The important features of Data Mining are given below:

o It utilizes the Automated discovery of patterns.


o It predicts the expected results.
o It focuses on large data sets and databases
o It creates actionable information.

Advantages of Data Mining:

i. Market Analysis:

ADVERTISEMENT

ADVERTISEMENT

Data Mining can predict the market that helps the business to make the decision. For example,
it predicts who is keen to purchase what type of products.

ii. Fraud detection:

Data Mining methods can help to find which cellular phone calls, insurance claims, credit, or
debit card purchases are going to be fraudulent.

iii. Financial Market Analysis:

Data Mining techniques are widely used to help Model Financial Market

iv. Trend Analysis:

Analyzing the current existing trend in the marketplace is a strategic benefit because it helps in
cost reduction and manufacturing process as per market demand.

Differences between Data Mining and Data


Warehousing:
Data Mining Data Warehousing

Data mining is the process of determining data A data warehouse is a database system designed
patterns. for analytics.

Data mining is generally considered as the process of Data warehousing is the process of combining all
extracting useful data from a large set of data. the relevant data.

Business entrepreneurs carry data mining with the Data warehousing is entirely carried out by the
help of engineers. engineers.

In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.

Data mining uses pattern recognition techniques to Data warehousing is the process of extracting
identify patterns. and storing data that allow easier reporting.

One of the most amazing data mining technique is One of the advantages of the data warehouse is
the detection and identification of the unwanted its ability to update frequently. That is the reason
errors that occur in the system. why it is ideal for business entrepreneurs who
want up to date with the latest stuff.

The data mining techniques are cost-efficient as The responsibility of the data warehouse is to
compared to other statistical data applications. simplify every type of business data.

The data mining techniques are not 100 percent In the data warehouse, there is a high possibility
accurate. It may lead to serious consequences in a that the data required for analysis by the
certain condition. company may not be integrated into the
warehouse. It can simply lead to loss of data.
Companies can benefit from this analytical tool by Data warehouse stores a huge amount of
equipping suitable and accessible knowledge-based historical data that helps users to analyze
data. different periods and trends to make future
predictions.

Clustering in Data Mining

Social media is a great source of information and a perfect platform for communication.
Businesses and individuals can make the best of it instead of only sharing their photos and
videos on the platform. The platform gives freedom to its users to connect with their target
group easily and fantastically. Either a group or an established business, both face difficulties in
standing up with the competitive social media industry. But through the social media platform,
users can market or develop his/her brand or content with others.

Social media mining includes social media platforms, social network analysis, and data mining
to provide a convenient and consistent platform for learners, professionals, scientists, and
project managers to understand the fundamentals and potentials of social media mining. It
suggests various problems arising from social media data and presents fundamental concepts,
emerging issues, and effective algorithms for data mining, and network analysis. It includes
multiple degrees of difficulty that enhance knowledge and help in applying ideas, principles,
and techniques in distinct social media mining situations.

As per the "Global Digital Report," the total number of active users on social media platforms
worldwide in 2019 is 2.41 billion and increases up to 9 % year-on-year. With the universal use
of Social media platforms via the internet, a huge amount of data is accessible. Social media
platforms include many fields of study, such as sociology, business, psychology, entertainment,
politics, news, and other cultural aspects of societies. Applying data mining to social media can
provide exciting views on human behavior and human interaction. Data mining can be used in
combination with social media to understand user's opinions about a subject, identifying a
group of individuals among the masses of a population, to study group modifications over time,
find influential people, or even suggest a product or activity to an individual.
For example, The presidential election during 2008 marked an unprecedented use of social
media platforms in the United States. Social media platforms, including Facebook, YouTube
played a vital role in raising funds and getting candidate's messages to voters. Researcher's
extracted blog data to demonstrate correlations between the amount of social media platform
used by candidates and the winner of the 2008 presidential campaign.

This effective example emphasizes the potential for data mining social media data to forecast
results at the national level. Data mining social media can also produce personal and corporate
benefits.

Social media mining refers to social computing. Social computing is defined as "Any computing
application where software is used as an intermediary or Centre for a social relationship." Social
computing involves application used for interpersonal communication as well as application and
research activities related to "computational social studies" or Social behavior."

Social media platform refers to various kinds of information services used collaboratively by
many people placed into the subcategories shown below.

Category Examples

Blogs Blogger, LiveJournal, WordPress

Social news Digg, Slashdot


Social bookmarking Delicious, StumbleUpon

Social networking platform Facebook, LinkedIn, Myspace, Orkut

Microblogs Twitter, GoogleBuzz

Opinion mining Epinions, Yelp

Photo and video sharing Flickr, YouTube

Wikis Scholarpedia, Wikihow, Wikipedia, Event

With popular traditional media such as radio, newspaper, and television, communication is
entirely one-way that comes from the media source or advertiser to the mass of media
consumers. Web 2.0 technologies and modern social media platforms have changed the scene
moving from one-way media communication driven by media providers to where almost
anyone can publish written, audio, video, or image content to the masses.

This media environment is significantly changing the way of business communication with their
clients. It provides exceptionally unprecedented opportunities for individuals to interact with a
huge number of peoples at a very low cost. The relationships present online and shown through
the social media platform are digitalized data sets of social media platforms on a scale. The
resulting data offers rich opportunities for sociology and insights to consumer behavior and
marketing among a host of apps linked to similar fields.

The growth and number of users on social media platforms are incredible. For example,
consider the most tempting social media networking site, Facebook. Facebook reached over
400 million active users during the first six years of operation, and it has been growing
exponentially. The given figure illustrates the exponential growth of Facebook over the first six
years. As per the report, Facebook is ranked 2 nd in the world for websites based on the traffic
engagement of the users on the site daily.

ADVERTISEMENT
The broad use of social media platforms is not limited to one geographical region of the world.
Orkut, a popular social networking platform operated by Google has most of the users from the
outside the United States, and the use of social media among Internet users is now mainstream
in many parts of the globe including countries Aisa, Africa, Europe, South America, and the
middle east. Social media also drive significant changes in company and business need to
decide on their policies to keep pace with this new media.

Motivations for Data Mining in Social Media:


The data accessible through Social media platform can give us insights into social networks and
societies that had not been feasible in both scale and extent previously. This digital media can
transform the physical world limitations to study human relationships and help to measure
popular social and political beliefs to the regional community without specific studies. Social
media records viral marketing trends efficiently and is the ideal source to understand better and
leverage the influence mechanisms. However, it is quite difficult to gain valuable information
from social networking sites data without implementing data mining techniques due to specific
challenges.

Data Mining techniques can assist effectively in dealing with the three primary challenges with
social media data. First, social media data sets are large. Consider the example of the most
popular social media platform Facebook with 2.41 billion active users. Without automated data
processing to analyze social media, social media data analytics becomes inaccessible in any
reasonable time frame.

Second, Social media site's data sets can be noisy. For example, Spam blogs are large in number
in the blogosphere, as well as unimportant tweets on Twitter.
Third, data from online social media platforms are dynamic, regular modifications and updates
over a short period are not common but also a significant aspect to consider in dealing with
social media data.

Applying data mining methods to huge data sets can improve search results for everyday
search engines, realize specified target marketing for business, help psychologists study
behavior, personalize consumer web services, provide new insights into the social structure for
sociologists, and help to identify and prevent spam for all of us.

Moreover, open access to data offers an unprecedented amount of data for researchers to
improve efficiency and optimize data mining techniques. The progress of data mining is based
on huge data sets. Social media is an optimal data source on the edge of data mining for
progressing and testing new data mining techniques for academic and allied data mining
analysts.

Data Mining Bayesian Classifiers


In numerous applications, the connection between the attribute set and the class variable is
non- deterministic. In other words, we can say the class label of a test record cant be assumed
with certainty even though its attribute set is the same as some of the training examples. These
circumstances may emerge due to the noisy data or the presence of certain confusing factors
that influence classification, but it is not included in the analysis. For example, consider the task
of predicting the occurrence of whether an individual is at risk for liver illness based on
individuals eating habits and working efficiency. Although most people who eat healthly and
exercise consistently having less probability of occurrence of liver disease, they may still do so
due to other factors. For example, due to consumption of the high-calorie street foods and
alcohol abuse. Determining whether an individual's eating routine is healthy or the workout
efficiency is sufficient is also subject to analysis, which in turn may introduce vulnerabilities into
the leaning issue.

Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability
to provide an algorithm that uses evidence to calculate limits on an unknown parameter.

Bayes's theorem is expressed mathematically by the following equation that is given below.
Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.

P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem


connects the degree of belief in a hypothesis before and after accounting for evidence. For
example, Lets us consider an example of the coin. If we toss a coin, then we get either heads or
tails, and the percent of occurrence of either heads and tails is 50%. If the coin is flipped
numbers of times, and the outcomes are observed, the degree of belief may rise, fall, or remain
the same depending on the outcomes.

For proposition X and evidence Y,

ADVERTISEMENT

ADVERTISEMENT

o P(X), the prior, is the primary degree of belief in X


o P(X/Y), the posterior is the degree of belief having accounted for Y.

o The quotient represents the supports Y provides for X.

Bayes theorem can be derived from the conditional probability:

Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept.
Generally known as Belief Networks, Bayesian Networks are used to show uncertainties
using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection
between the nodes.

The nodes here represent random variables, and the edges define the relationship between
these variables.

A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network.
Data Mining- World Wide Web

Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.

What is Web Mining?


Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns on
mostly structured data embedded into a knowledge discovery process. Web mining has a
distinctive property to provide a set of various data types. The web has multiple aspects that
yield different approaches for the mining process, such as web pages consist of text, web pages
are linked via hyperlinks, and user activity can be monitored via web server logs. These three
features lead to the differentiation between the three areas are web content mining, web
structure mining, web usage mining.
There are three types of data mining:

1. Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical structure. The primary
task of content mining is data extraction, where structured data is extracted from unstructured
websites. The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish topics on the web.
For Example, if any user searches for a specific task on the search engine, then the user will get
a list of suggestions.

2. Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is used to
identify that data either link the web pages or direct link network. In Web Structure Mining, an
individual considers the web as a directed graph, with the web pages being the vertices that are
associated with hyperlinks. The most important application in this regard is the Google search
engine, which estimates the ranking of its outcomes primarily with the PageRank algorithm. It
characterizes a page to be exceptionally relevant when frequently connected by other highly
related pages. Structure and content mining methodologies are usually combined. For example,
web structured mining can be beneficial to organizations to regulate the network between two
commercial sites.

3. Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the usage
of web resources, the individual is thinking about records of requests of visitors of a website,
that are often collected as web server logs. While the content and structure of the collection of
web pages follow the intentions of the authors of the pages, the individual requests
demonstrate how the consumers see these pages. Web usage mining may disclose relationships
that were not proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which incorporates
the guest records, days, time, sessions, etc. This data can be utilized to analyze the visitor's
behavior.

The document is created after this analysis, which contains the details of repeatedly visited web
pages, common entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.

OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:


The web pretends incredible challenges for resources, and knowledge discovery based on the
following observations:

o The complexity of web pages:

The site pages don't have a unifying structure. They are extremely complicated as compared to
traditional text documents. There are enormous amounts of documents in the digital library of
the web. These libraries are not organized according to a specific order.

o The web is a dynamic data source:

The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.

o Diversity of client networks:


The client network on the web is quickly expanding. These clients have different interests,
backgrounds, and usage purposes. There are over a hundred million workstations that are
associated with the internet and still increasing tremendously.

o Relevancy of data:

It is considered that a specific person is generally concerned about a small portion of the web,
while the rest of the segment of the web contains the data that is not familiar to the user and
may lead to unwanted results.

o The web is too broad:

The size of the web is tremendous and rapidly increasing. It appears that the web is too huge
for data warehousing and data mining.

Mining the Web's Link Structures to recognize


Authoritative Web Pages:
The web comprises of pages as well as hyperlinks indicating from one to another page. When a
creator of a Web page creates a hyperlink showing another Web page, this can be considered
as the creator's authorization of the other page. The unified authorization of a given page by
various creators on the web may indicate the significance of the page and may naturally prompt
the discovery of authoritative web pages. The web linkage data provide rich data about the
relevance, the quality, and structure of the web's content, and thus is a rich source of web
mining.

Application of Web Mining:


Web mining has an extensive application because of various uses of the web. The list of some
applications of web mining is given below.

o Marketing and conversion tool


o Data analysis on website and application accomplishment.
o Audience behavior analysis
o Advertising and campaign accomplishment analysis.
o Testing and analysis of a site.
Different types of Clustering

Cluster Analysis separates data into groups, usually known as clusters. If meaningful groups are
the objective, then the clusters catch the general information of the data. Some time cluster
analysis is only a useful initial stage for other purposes, such as data summarization. In the case
of understanding or utility, cluster analysis has long played a significant role in a wide area such
as biology, psychology, statistics, pattern recognition machine learning, and mining.

What is Cluster Analysis?


Cluster analysis is the group's data objects that primarily depend on information found in the
data. It defines the objects and their relationships. The objective of the objects within a group
be similar or different from the objects of the other groups.

The given Figure 1 illustrates different ways of Clustering at the same sets of the point.

In various applications, the concept of a cluster is not briefly defined. To better understand the
challenge of choosing what establishes a group, figure 1 illustrates twenty points and three
different ways to separate them into clusters. The design of the markers shows the cluster
membership. The figures divide the data into two and six sections, respectively. The division of
each of the two more significant clusters into three subclusters may be a product of the human
visual system. It may not be logical to state that the points from four clusters. The figure
represents that the meaning of a cluster is inaccurate. The best definition of cluster relies upon
the nature of the data and the outcomes.

PlayNext

Unmute
Current Time 0:00

Duration 18:10

Loaded: 0.37%
Â

Fullscreen

Backward Skip 10sPlay VideoForward Skip 10s

Cluster analysis is similar to other methods that are used to divide data objects into groups. For
example, Clustering can be view as a form of Classification. It constructs the labeling of objects
with Classification, i.e., new unlabeled objects are allowed a class label using a model developed
from objects with known class labels. So that, cluster analysis is sometimes defined as
unsupervised Classification. If the term classification is used without any ability within data
mining, then it typically refers to supervised Classification.

The terms segmentation and partitioning are generally used as synonyms for Clustering.
These terms are commonly used for techniques outside the traditional bounds of cluster
analysis. For example, the term partitioning is usually used in making relation with techniques
that separate graphs into subgraphs and that are not connected to
Clustering. Segmentation often introduces the division of data into groups using simple
methods. For example, an image can be broken into various sections depends on pixel
frequency and color, or people can be divided into different groups based on their annual
income. However, some work in graph division and market segmentation is connected to cluster
analysis.

Different types of Clustering


A whole group of clusters is usually referred to as Clustering. Here, we have distinguished
different kinds of Clustering, such as Hierarchical(nested)
vs. Partitional(unnested), Exclusive vs. Overlapping vs. Fuzzy, and Complete vs. Partial.

ADVERTISEMENT

ADVERTISEMENT

o Hierarchical versus Partitional

The most frequently discussed different features among various types of Clustering is whether
the clusters sets are nested or unnested, or in more conventional terminology, partitional or
hierarchical. A partitional Clustering is usually a distribution of the set of data objects into non-
overlapping subsets (clusters) so that each data object is in precisely one subset.

If we allow clusters to have subclusters, then we get a hierarchical Clustering, which is a group
of nested clusters that are organized as a tree. Each node (cluster) in the tree (Not for the leaf
nodes) is the association of its subclusters, and the tree roots are the cluster, including all the
objects. Usually, the leaves of the tree are individual clusters of individual data objects. If we
enable the cluster to be nested, then one clarification of figure 1 ( a) is that it has two
subclusters figure 1 (b) illustrates this, each of which has three subclusters shown in figure 1 (d).
The clusters have appeared in figure 1 (a-d) when taken in a specific order, also from a
hierarchical (nested) Clustering, 1, 2, 4, and 6 clusters on each level. Finally, a hierarchical
Clustering can be seen as an arrangement of partitional Clustering, and a partitional Clustering
can be acquired by taking any member of that sequence, it means by cutting the hierarchical
tree at the specific level.

o Exclusive versus Overlapping versus Fuzzy

The Clustering that appeared in the figure is all exclusive, as they give the responsibility to each
object to a single cluster. There are numerous circumstances in which a point could sensibly be
set in more than one cluster, and these circumstances are better addressed by non-exclusive
Clustering. In general terms, an overlapping or non-exclusive Clustering is used to reflect the
fact that an object can together belong to more than one group (class). For example, a person
at a company can be both a trainee student and an employee of the company. A non-exclusive
Clustering is also usually used if an object is "between" two or more then two clusters and could
sensibly be allocated to any of these clusters. Consider a point somewhere between two of the
clusters rather than make an entirely random task of the object to a single cluster. it is put in all
of the clusters to "equally good" clusters.

In fuzzy Clustering, each object belongs to each cluster with a membership weight that is
between 0 and 1. In other words, clusters are considered as fuzzy sets. Mathematically, a fuzzy
set is defined as one in which an object is associated with any set with a weight that ranges
between 0 and 1. In fuzzy Clustering, we usually set the additional constraint, and the sum of
weights for each object must be equal to 1. Similarly, probabilistic Clustering systems compute
the probability in which each point belongs to a cluster, and these probabilities must sum to 1.
Since the membership weights or probabilities for any object sum to 1, a fuzzy or probabilistic
Clustering doesn't address actual multiclass situations.

Complete versus Partial


A complete Clustering allocates each object to a cluster, whereas partial Clustering does not.
The inspiration for a partial Clustering is that a few objects in a data set may not belong to
distinct groups. Most of the time, objects in the data set may produce outliers, noise, or
"uninteresting background." For example, some news headlines stories may share a common
subject, such that " Industrial production shrinks globally by 1.1 percent," While different stories
are more frequent or one-of-a-kind. Consequently, to locate the significant topics in the last
month's stories, we might need to search only for clusters of documents that are firmly related
by a common subject. In other cases, a complete Clustering of objects is desired. For example,
an application that utilizes Clustering to sort out documents for browsing needs to ensure that
all documents can be browsed.

Different types of Clusters


Clustering addresses to discover helpful groups of objects (Clusters), where the objectives of the
data analysis characterize utility. Of course, there are various notions of a cluster that
demonstrate utility in practice. In order to visually show the differences between these kinds of
clusters, we utilize two-dimensional points, as shown in the figure that types of clusters
described here are equally valid for different sorts of data.

o Well-separated cluster

A cluster is a set of objects where each object is closer or more similar to every other object in
the cluster. Sometimes a limit is used to indicate that all the objects in a cluster must be
adequately close or similar to each other. The definition of a cluster is satisfied only when the
data contains natural clusters that are quite far from one another. The figure illustrates an
example of well-separated clusters that comprise of two points in a two-dimensional space.
Well-separated clusters do not require to be spherical but can have any shape.

o Prototype-Based cluster
A cluster is a set of objects where each object is closer or more similar to the prototype that
characterizes the cluster to the prototype of any other cluster. For data with continuous
characteristics, the prototype of a cluster is usually a centroid. It means the average (Mean) of
all the points in the cluster when a centroid is not significant. For example, when the data has
definite characteristics, the prototype is usually a medoid that is the most representative point
of a cluster. For some sorts of data, the model can be viewed as the most central point, and in
such examples, we commonly refer to prototype-based clusters as center-based clusters. As
anyone might expect, such clusters tend to be spherical. The figure illustrates an example of
center-based clusters.

o Graph-Based cluster

If the data is depicted as a graph, where the nodes are the objects, then a cluster can be
described as a connected component. It is a group of objects that are associated with each
other, but that has no association with objects that is outside the group. A significant example
of graph-based clusters is contiguity-based clusters, where two objects are associated when
they are placed at a specified distance from each other. It suggests that every object in
a contiguity-based cluster is the same as some other object in the cluster. Figures
demonstrate an example of such clusters for two-dimensional points. The meaning of a cluster
is useful when clusters are unpredictable or intertwined but can experience difficulty when noise
present. It is shown by the two circular clusters in the figure; the little extension of points can
join two different clusters.

Other kinds of graph-based clusters are also possible. One such way describes a cluster as
a clique. Clique is a set of nodes in a graph that is completely associated with each other.
Particularly, we add connections between the objects according to their distance from one
another. A cluster is generated when a set of objects forms a clique. It is like prototype-based
clusters, and such clusters tend to be spherical.
o Density-Based Cluster

A cluster is a compressed domain of objects that are surrounded by a region of low density. The
two spherical clusters are not merged, as in the figure, because the bridge between them fades
into the noise. Similarly, the curve that is present in the Figure disappears into the noise and
does not form a cluster in Figure. It also disappears into the noise and does not form a cluster
shown in the figure. A density-based definition of a cluster is usually occupied when the clusters
are irregularly and intertwined, and when noise and outliers exist. The other hand contiguity-
based definition of a cluster would not work properly for the data of Figure. Since the noise
would tend to form a network between clusters.

o Shared- property or Conceptual Clusters


We can describe a cluster as a set of objects that offer some property. The
object in a center-based cluster shares the property that they are all closest
to the similar centroid or medoid. However, the shared-property approach
additionally incorporates new types of the cluster. Consider the cluster given
in the figure. A triangular area (cluster) is next to a rectangular one, and
there are two intertwined circles (clusters). In both cases, a Clustering
algorithm would require a specific concept of a cluster to recognize these
clusters effectively. The way of discovering such clusters is called
conceptual Clustering

Bitcoin Data Mining

Bitcoin mining refers to the process of authenticating and adding transactional records to the
public ledger. The public ledge is known as the blockchain because it comprises a chain of the
block.
Before we understand the Bitcoin mining concept, we should understand what Bitcoin
is. Bitcoin is virtual money having some value, and its value is not static, it varies according to
time. There is no Bitcoin regulatory body that regulates the Bitcoin transactions.

Let's understand the bitcoin concept with an example. The company


manager takes a dummy thing and announces that who will get this thing
will be the happiest employer of the organization and get an international
holiday ticket. So everyone trying to buy that dummy thing that has no value
and in this way, this dummy thing will have some value may be lies between
10$ to 20$ or anything. We can relate these things with the Bitcoin if the
number of purchasers of Bitcoin increases, then the value of Bitcoin also
increases to a saturated value afterward it stops.

Bitcoin was created under the pseudonym (False name) Satoshi Nakamoto,
who announced the invention, and later it was implemented as open-source
code. An only end-to-end version of electronic money would enable online
payments to be sent directly from one person to another without the
interference of an economic body. Bitcoin is a network practice that
empowers people to transfer assets rights on account units called Bitcoin's,
made in limited quantity. When an individual sends a couple of bitcoins to
another individual, this data is communicated to the peer-to-peer bitcoin
network.

This technology remains similar to purchasing something with virtual


currency. However, one advantage of Bitcoins is that the arrangement
remains unidentified. The personal identity of the sender and the beneficiary
(receiver) remain encrypted. It is the primary reason that's why it has
become a trusted form of money transaction on the web. By convention, the
complexity in making distributed money is the requirement for a proposal to
avoid double-spending. One individual may simultaneously transmit two
transactions, sending similar coins to two distinct parties on the network.
Bitcoin settles this difficulty and ensures agreement of rights by keeping up
a community ledger of all transactions, called the blockchain. New
transactions are grouped mutually and are checked against the existing
record to make sure all new communications are valid. Bitcoin's accuracy is
ensured by individuals who give computation authority to its system known
as miners to validate and affix transactions to a public ledger.

Bitcoins don't exist physically and are only an arrangement of virtual data. It can be exchanged
for genuine money, and are broadly acceptable in most countries around the globe. There is no
central authority for Bitcoins, similar to a central bank (RBI in India) that controls the monetary
policy. Alternatively, developers solve complex puzzles to support Bitcoin transactions. This
process is called Bitcoin mining.

How to Mine Bitcoins


It is quite a complex process, but if you want to take it directly, then here is the process of how
it works. You need to get a CPU(Central Processing Unit) with excellent processing power and a
speedy web interface. In the next step, there are numerous online networks that list out the
latest Bitcoin transactions taking place in real-time. Afterward, Sign in with a Bitcoin customer
and attempt to approve those transactions by assessing blocks of data, called hash. Now,
communication goes through several systems, called nodes, which are simply blocks of data,
and since the data is encoded, a miner is needed to check if his answers are accurate.

It is quite a complex process, but if you want to take it directly, then here is the process of how
it works. You need to get a CPU(Central Processing Unit) with excellent processing power and a
speedy web interface. In the next step, there are numerous online networks that list out the
latest Bitcoin transactions taking place in real-time. Afterward, Sign-in with a Bitcoin customer
and attempt to approve those transactions by assessing blocks of data, called hash. Now,
communication goes through several systems, called nodes, which are simply blocks of data,
and since the data is encoded, a miner is needed to check if his answers are accurate.

How the Bitcoin Mining Works

Bitcoin Mining requires a task that is exceptionally tricky to perform, but simple to verify. It uses
cryptography, with a hash function called double SHA-256( a one-way function that converts a
text of any dimension into a string of 256 bits). A hash accepts a portion of data as input and
reduces it down into a smaller hash value (256 bits). With a cryptographic hash, there is no
other option to get a hash value we want without attempting a ton of sources. Once we find an
input that gives the value we want, it is a simple task for anybody to validate the hash. So,
cryptographic hashing turns into a decent method to apply the Bitcoin
"Proof-of-work" (data that is complex to produce but easy
for others to verify).
If we consider a block to mine first, we need to collect the new transactions
into a block, and then we hash the block to form a 256-bit block hash value.
When the hash initiates with sufficient zeros, the block has been
successfully mined and is directed to the Bitcoin network, and that has
turned into the identifier for the block. In many cases, the hash is not
successful, so we need to alter the block to some extent and try again and
again.

Bitcoin Transaction

A Bitcoin transaction is a section of data that is transmitted to the network


and, if valid, it ends up in a block in the blockchain. The concept of a Bitcoin
transaction is to transfer the responsibility of an amount of Bitcoin address.

When we send Bitcoin, an individual data structure, namely a Bitcoin transaction, is made by
your wallet customer and afterward communicate to rebroadcast the transaction. If the
operation is valid, nodes will incorporate it in the block they are mining, within 10-20 minutes,
the transaction will be included, along with other transactions, in a block in the blockchain.
Finally, the receiver can see the transaction amount in their wallet.

Some facts about transactions

o The Bitcoin amount that we send is always sent to a particular address.


o The Bitcoin amount we get is locked to the receiving address, which is associated with our wallet.
o Every time we spend Bitcoin, the amount we spend will consistently come from funds received
earlier and currently present in our wallet.
o Addresses receive Bitcoin, but they don?t send Bitcoin, it is sent from a wallet.

Bitcoin Wallets

Bitcoin wallets compile the private keys through which we access a bitcoin address and payout
our funds. They appear in different forms, designed for specific types of devices. We can even
use hardcopy to store data to avoid having them on the computer. It is important to secure and
back up our Bitcoin wallet. Bitcoins are the latest technology of cash, and very soon, other
merchants start accepting them as payment.

We know how a bitcoin transaction mechanism works and how they are created, but how they
are stored? We store money in a physical wallet, and bitcoin works similarly, except it is
generally digital. In brief, we don't need to stock bitcoins anywhere. What we store are the
secured digital keys used to access our public bitcoin address and sign transactions.

There are mainly five types of wallets that are given below:

Desktop Wallets

First, we need to install the original bitcoin customer (Bitcoin Core). If we have already installed,
then we are running a wallet, but may not know it. In addition to depend on transactions on the
network, this software also empowers us to create a bitcoin address for transfer and getting the
virtual currency. MultiBit (Bitcoin wallet) runs on Mac OSX, Windows, and Linux. Hive is an OS X-
based wallet with some particular features, including an application store that associates directly
to bitcoin services.
Mobile Wallets

An application on our cell phone, the wallet can store up the security key for our bitcoin
addresses, and enable us to pay for things straightforwardly with our phone. Many times, a
bitcoin wallet will even take advantage of a cell phones near-field communication (NFC) aspect,
empowering us to tap the mobile phone against a reader and pay bitcoins without entering any
data at all. A bitcoin customer has to download the whole bitcoin blockchain, which is always
developing and is multiple gigabytes in size. A ton of mobile phones wouldn't be able to hold
the blockchain in their memory. In such a case, they can use alternative options, and these
mobile users are repeatedly designed with simplified payment verification (SPV) in mind. They
download a confined subset of the blockchain and depend on other trusted nodes in the
bitcoin system to ensure that they have the precise data. Mycelium is the example of mobile
wallets that comprises of the Android-based Bitcoin wallet.

Online Wallets

Electronic wallets stores our security keys on the web, on a computer, limited by someone else
and coupled to the Internet. Various online services are accessible, and the network to mobile
and desktop wallets copying our address among various devices that we own. One significant
advantage of online wallets is that we can access them from anywhere, in spite of which device
we are using.

Hardware Wallets
Hardware wallets are incomplete numbers. These are sharp devices that can hold private keys
electronically and make easy payments. The compact Ledger USB bitcoin Wallet utilizes
smartcard protection and is accessible at a reasonable cost.

Paper Wallets

The cheapest alternative for keeping our bitcoins safe and sound is significantly called a paper
wallet. There are various sites offering paper bitcoin wallet services. They deliver a bitcoin
address for us and generate an image containing two QR codes. The first one is the public
address that we can use to receive bitcoins, and the other is the private key that we use to pay
out bitcoins stored at the address. The primary advantage of a paper wallet is that the private
keys are not stored digitally anyplace, so it secures our wallet from cyber attacks.
Orange Data Mining

Orange is a C++ core object and routines library that incorporates a huge
variety of standard and non-standard machine learning and data mining
algorithms. It is an open-source data visualization, data mining, and
machine learning tool. Orange is a scriptable environment for quick
prototyping of the latest algorithms and testing patterns. It is a group of
python-based modules that exist in the core library. It implements some functionalities for
which execution time is not essential, and that is done in Python.

It incorporates a variety of tasks such as pretty-print of decision trees, bagging and boosting,
attribute subset, and many more. Orange is a set of graphical widgets that utilizes strategies
from the core library and orange modules and gives a decent user interface. The widget
supports digital-based communication and can be gathered together into an application by a
visual programming tool called an orange canvas.

All these together make an orange an exclusive component-based algorithm for data mining
and machine learning. Orange is proposed for both experienced users and analysts in data
mining and machine learning who want to create and test their own algorithms while reusing as
much of the code as possible, and for those simply entering the field who can either write short
python contents for data analysis.

The objective of Orange is to provide a platform for experiment-based selection, predictive


modeling, and recommendation system. It primarily used in bioinformatics, genomic research,
biomedicine, and teaching. In education, it is used for providing better teaching methods for
data mining and machine learning to students of biology, biomedicine, and informatics.

Orange Data Mining

Orange supports a flexible domain for developers, analysts, and data mining
specialists. Python, a new generation scripting language and programming
environment, where our data mining scripts may be easy but powerful.
Orange employs a component-based approach for fast prototyping. We can
implement our analysis technique simply like putting the LEGO bricks, or
even utilize an existing algorithm. What are Orange components for
scripting Orange widgets for visual programming?. Widgets utilize a specially
designed communication mechanism for passing objects like classifiers,
regressors, attribute lists, and data sets permitting to build easily rather
complex data mining schemes that use modern approaches and techniques.

Orange core objects and Python modules incorporate numerous data mining
tasks that are far from data preprocessing for evaluation and modeling. The
operating principle of Orange is cover techniques and perspective in data
mining and machine learning. For example, Orange's top-down induction of
decision tree is a technique build of numerous components of which anyone
can be prototyped in python and used in place of the original one. Orange
widgets are not simply graphical objects that give a graphical interface for a
specific strategy in Orange, but it includes an adaptable signaling
mechanism that is for communication and exchange of objects like data
sets, classification models, learners, objects that store the results of the
assessment. All these ideas are significant and together recognize Orange
from other data mining structures.

Orange Widgets

Orange widgets give us a graphical user


interface to orange's data mining and machine
learning techniques. They incorporate widgets
for data entry and preprocessing, classification,
regression, association rules and clustering a
set of widgets for model assessment and
visualization of assessment results, and widgets for exporting the models into PMML.
Widgets convey the data by tokens that are passed from the sender to the
receiver widget. For example, a file widget outputs the data objects, that
can be received by a widget classification tree learner widget. The
classification tree builds a classification model that sends the data to the
widget that graphically shows the tree. An evaluation widget may get a data
set from the file widget and objects.

Orange Scripting

If we want to access Orange objects, then we need to write our components


and design our test schemes and machine learning applications through the
script. Orange interfaces to Python, a model simple to use a scripting
language with clear and powerful syntax and a broad set of additional
libraries. Same as any scripting language, Python can be used to test a few
ideas mutually or to develop more detailed scripts and programs.

We can see how it uses Python and Orange with an example, consider an
easy script that reads the data set and prints the number of attributes used.
We will utilize a classification data set called "voting" from UCI Machine
Learning Repository that records sixteen key votes of each of the Parliament
of India MP (Member of Parliament), and labels each MP with a party
membership:

import orange

data1 = orange.ExampleTable('voting.tab')

print('Instance:', len(data1))

print(Attributes:', 1len(data.domain.attributes))

Here, we can see that the script first loads in the orange library, reads the data file, and prints
out what we were concerned about. If we store this script in script.py and run it by shell
command "python script.py" ensure that the data file is in the same directory then we get
Instances: 543

Attributes: 16

Let us proceed with our script that uses the same data created by a naïve Bayesian classifier and
print the classification of the first five instances:

model = orange.BayesLearner(data1)

for i in range(5):

print(model(data1[i]))

It is easy to produce the classification model; we have called Orange?s object (Bayes Learner) and gave it
the data set. It returned another object (naïve Bayesian classifier) when given an instance returns the
label of the possible class. Here we can see the output of this part of the script:

inc

inc

inc

bjp

bjp

Here, we need to discover what the correct classifications were; we can print the original labels
of our five instances:

or i in range(5):

print(model(data1[i])), 'originally' , data[i].getclass()

What we cover is that naïve Bayesian classifier has misclassified the third instance:

inc originally inc

inc originally inc

inc originally bjp

bjp originally bjp

bjp originally bjp


All classifiers implemented in Orange are probabilistic. For example, they assume the class
probabilities. So in the naïve Bayesian classifier, and we may be concerned about how much we
have missed in the third case:

n = model(data1[2], orange.GetProbabilities)

print data,domain.classVar.values[0], ':', n[0]

Here we recognize that Python's indices initiate with 0, and that classification model returns a
probability vector when a classifier is called with argument orange.-Getprobabilities. Our
model was estimating a very high probability for an inc:

Inc : 0.878529638542
Data Mining Vs Big Data

Data Mining uses tools such as statistical models, machine learning, and visualization to "Mine"
(extract) the useful data and patterns from the Big Data, whereas Big Data processes high-
volume and high-velocity data, which is challenging to do in older databases and analysis
program.

Big Data

Big Data refers to the vast amount that can be structured, semi-structured, and
unstructured sets of data ranging in terms of tera-bytes. It is challenging to
process a huge amount of data on a single system that's why the RAM of our
computer stores the interim calculations during the processing and
analyzing. When we try to process such a huge amount of data, it takes
much time to do these processing steps on a single system. Also, our
computer system doesn't work correctly due to overload.

Here we will understand the concept (how much data is produced) with a
live example. We all know about Big Bazaar. We as a customer goes to Big
Bazaar at least once a month. These stores monitor each of its product that
the customers purchase from them, and from which store location over the
world. They have a live information feeding system that stores all the data in
huge central servers. Imagine the number of Big bazaar stores in India alone
is around 250. Monitoring every single item purchased by every customer
along with the item description will make the data go around 1 TB in a
month.
What does Big Bazaar do with that data

We know some promotions are running in Big Bazaar on some items. Do we genuinely believe
Big Bazaar would just run those products without any full back-up to find those promotions
would increase their sales and generate a surplus? That is where Big Data analysis plays a vital
role. Using Data Analysis techniques, Big Bazaar targets its new customers as well as existing
customers to purchase more from its stores. rd Skip 10s

Big data comprises of 5Vs that is Volume, Variety, Velocity, Veracity, and Value.

Volume: In Big Data, volume refers to an amount of data that can be huge when it comes to big
data.

Variety: In Big Data, variety refers to various types of data such as web server logs, social media
data, company data.
Velocity: In Big Data, velocity refers to how data is growing with respect to time. In general,
data is increasing exponentially at a very fast rate.

Veracity: Big Data Veracity refers to the uncertainty of data.

Value: In Big Data, value refers to the data which we are storing, and processing is valuable or
not and how we are getting the advantage of these huge data sets.

How to Process Big Data

A very efficient method, known as Hadoop, is primarily used for Big data processing. It is an
Open-source software that works on a Distributed Parallel processing method.

The Apache Hadoop methods are comprised of the given modules:

Hadoop Common

It contains dictionaries and utilities required by other Hadoop modules.

Hadoop Distributed File System (HDFS)

A distributed file-system which stores data on commodity machine, supporting very high gross
bandwidth over the cluster.

Hadoop YARN

It is a resource-management Platform responsible for administrating various

resources in clusters and using them for scheduling of user's application.

Hadoop MapReduce

It is a programming model for huge-scale data processing.

Data Mining

As the name suggests, Data Mining refers to the mining of huge data sets to identify trends,
patterns, and extract useful information is called data mining.

In data Mining, we are looking for hidden data but without any idea about what exactly type of
data we are looking for and what we plan to use it for once you find it. When we discover
interesting information, we start thinking about how to make use of it to boost business.
We will understand the data mining concept with an example:

A Data Miner starts discovering the call records of a mobile network operator without any
specific target from his manager. The manager probably gives him a significant objective to
discover at least a few new patterns in a month. As he begins extracting the data to discover a
pattern that there are some international calls on Friday (example) compared to all other days.
Now he shares this data with management, and they come up with a plan to shrink international
call rates on Friday and start a campaign. Call duration goes high, and customers are happy with
low call rates, more customers join, the organization makes more profit as utilization percentage
has increased.

There are various steps involved in Data Mining:

Data Integration

In step first, Data are integrated and collected from various sources.
Data Selection

In the first step, we may not collect all the data simultaneously, so in this step, we select only
those data which are left, and we think it is useful for data mining.

Data Cleaning

In this step, the information we have collected is not clean and may consist of errors, noisy or
inconsistent data, missing values. So we need to implement various strategies to get rid of such
problems.

Data Transformation

The data even after cleaning is not prepared for mining, so we need to transform them into
structures for mining. The methods used to achieve this are aggregation, normalization,
smoothing, etc.

Data Mining

Once the data has transformed, we are ready to implement data mining methods on data to
extract useful data and patterns from data sets. Techniques like clustering association rules are
among the many various techniques used for data mining.

Pattern Evaluation

Patten evaluation contains visualization, removing random patterns, transformation, etc. from
the patterns we generated.

Decision

It is the last step in data mining. It helps users to make use of the acquired user data to make
better data-driven decisions.

Difference Between Data Mining and Big Data


Data Mining Big Data

It primarily targets an analysis of data to extract It primarily targets the data relationship.
useful information.

It can be used for large volume as well as low It contains a huge volume of data.
volume data.

It is a method primarily used for data analysis. It is a whole concept than a brief term.

It is primarily based on Statistical Analysis, It is primarily based on data analysis, generally target
generally target prediction, and finding business prediction, and finding business factors on a large
factors on a small scale. scale.

It uses the following data types e.g., Structured It uses the following data types e.g., Structured,
data, relational, and dimensional database. Semi-Structured, and unstructured data.

It expresses what about the data. It refers to why of the data.

It is the closest view of the data. It is a broad view of the data.

It is primarily used for strategic decision-making It is primarily used for Dashboards and predictive
purposes. measures.

You might also like