Data Mining
Data Mining
Data Mining
Introduction
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process includes Data
cleaning, Data integration, Data selection, Data transformation, Data mining, Pattern evaluation,
and Knowledge presentation.
This note on Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining techniques,
Clustering in data mining, Challenges in Data mining, etc.
The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and generating
revenue.
Data mining is the act of automatically searching for large stores of information to find trends and
patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical
algorithms for data segments and evaluates the probability of future events. Data Mining is also
called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to solve
business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a particular
data set, with an objective. This process includes various types of services such as text mining, web
mining, audio and video mining, pictorial data mining, and social media mining. It is done through
software that is simple or highly specific. By outsourcing data mining, all the work can be done
faster with low operation costs. Specialized firms can also use new technologies to collect data that
is impossible to locate manually. There are tonnes of information available on various platforms,
but very little knowledge is accessible. The biggest challenge is to analyze the data to extract
important information that can be used to solve a problem or for company development. There are
many powerful instruments and techniques available to mine data and find better insight from it.
Relational Database
A relational database is a collection of multiple data sets formally organized by tables, records, and
columns from which data can be accessed in various ways without having to recognize the
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.
Data warehouses
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
Data Repositories
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.
Object-Relational Database
A combination of an object-oriented database model and relational database model is called an
object-relational model. It supports Classes, Objects, Inheritance, etc.
One of the primary objectives of the Object-relational data model is to close the gap between the
Relational database and the object-oriented model practices frequently utilized in many
programming languages, for example, C++, Java, C#, and so on.
Transactional Database
A transactional database refers to a database management system (DBMS) that has the potential to
undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
These are the following areas where data mining is widely used:
Data Distribution
Real-worlds data is usually stored on various platforms in a distributed computing environment. It
might be in a database, individual systems, or even on the internet. Practically, It is a quite tough
task to make all the data to a centralized data repository mainly due to organizational and technical
concerns. For example, various regional offices may have their servers to store their data. It is not
feasible to store, all the data from all the offices on a central server. Therefore, data mining requires
the development of tools and algorithms that allow the mining of distributed data.
Complex Data
Real-world data is heterogeneous, and it could be multimedia data, including audio and video,
images, complex data, spatial data, time series, and so on. Managing these various types of data and
extracting useful information is a tough task. Most of the time, new technologies, new tools, and
methodologies would have to be refined to obtain specific information.
Performance
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.
Data Visualization
In data mining, data visualization is a very important process because it is the primary method that
shows the output to the user in a presentable way. The extracted data should convey the exact
meaning of what it intends to express. But many times, representing the information to the end-
user in a precise and easy way is difficult. The input data and the output information being
complicated, very efficient, and successful data visualization processes need to be implemented to
make it successful.
There are many more challenges in data mining in addition to the problems above-mentioned. More
problems are disclosed as the actual data mining process begins, and the success of data mining relies
on getting rid of all these difficulties.
Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their careers to
better understanding how to process and make conclusions from the huge amount of data, but
what are the methods they use to make it happen?
In recent data mining projects, various major data mining techniques have been developed and
used, including association, classification, clustering, prediction, sequential patterns, and
regression.
1. Classification
This technique is used to obtain important and relevant information about data and metadata. This
data mining technique helps to classify data in different classes.
i. Classification of Data mining frameworks as per the type of data sources mined:
This classification is as per the type of data handled. For example, multimedia, spatial data,
text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented database,
transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering, characterization, etc.
some frameworks tend to be extensive frameworks offering a few data mining
functionalities together..
iv. Classification of data mining frameworks according to data mining techniques used:
This classification is as per the data analysis approach utilized, such as neural networks,
machine learning, genetic algorithms, visualization, statistics, data warehouse-oriented or
database-oriented, etc.
The classification can also take into account, the level of user interaction involved in the
data mining procedure, such as query-driven systems, autonomous systems, or interactive
exploratory systems.
2. Clustering
Clustering is a division of information into groups of connected objects. Describing the data by a
few clusters mainly loses certain confine details, but accomplishes improvement. It models data by
its clusters. Data modeling puts clustering from a historical point of view rooted in statistics,
mathematics, and numerical analysis. From a machine learning point of view, clusters relate to
hidden patterns, the search for clusters is unsupervised learning, and the subsequent framework
represents a data concept. From a practical point of view, clustering plays an extraordinary job in
data mining applications. For example, scientific data exploration, text mining, information
retrieval, spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify similar
data. This technique helps to recognize the differences and similarities between the data.
Clustering is very similar to the classification, but it involves grouping chunks of data together
based on their similarities.
3. Regression
Regression analysis is the data mining process is used to identify and analyze the relationship
between variables because of the presence of the other factor. It is used to define the probability of
the specific variable. Regression, primarily a form of planning and modeling. For example, we might
use it to project certain costs, depending on other factors such as availability, consumer demand,
and competition. Primarily it gives the exact relationship between two or more variables in the
given data set.
4. Association Rules
This data mining technique helps to discover a link between two or more items. It finds a hidden
pattern in the data set.
Association rules are if-then statements that support to show the probability of interactions
between data items within large data sets in different types of databases. Association rule mining
has several applications and is commonly used to help sales correlations in data or medical data
sets.
The way the algorithm works is that you have various data, For example, a list of grocery items
that you have been buying for the last six months. It calculates a percentage of items being
purchased together.
o Lift:
This measurement technique measures the accuracy of the confidence over how often
item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased and
compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item A is
purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection
This type of data mining technique relates to the observation of data items in the data set, which do
not match an expected pattern or expected behavior. This technique may be used in various
domains like intrusion, detection, fraud detection, etc. It is also known as Outlier Analysis or
Outilier mining. The outlier is a data point that diverges too much from the rest of the dataset. The
majority of the real-world datasets have an outlier. Outlier detection plays a significant role in the
data mining field. Outlier detection is valuable in numerous fields like network interruption
identification, credit or debit card fraud detection, detecting outlying in wireless sensor network
data, etc.
6. Sequential Patterns
The sequential pattern is a data mining technique specialized for evaluating sequential data to
discover sequential patterns. It comprises of finding interesting subsequences in a set of sequences,
where the stake of a sequence can be measured in terms of different criteria like length, occurrence
frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar patterns in
transaction data over some time.
7. Prediction
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a future
event.
Data mining is described as a process of finding hidden precious data by evaluating the huge
quantity of information stored in data warehouses, using multiple data mining techniques such as
Artificial Intelligence (AI), Machine learning and statistics.
Cross-industry Standard Process of Data Mining (CRISP-DM) comprises of six phases designed as a
cyclical method as the given figure:
1. Business understanding
It focuses on understanding the project goals and requirements form a business point of view, then
converting this information into a data mining problem afterward a preliminary plan designed to
accomplish the target.
Tasks:
o Determine business objectives
o Access situation
o Determine data mining goals
o Produce a project plan
o Reveal significant factors, at the starting, it can impact the result of the project.
Access situation
o It requires a more detailed analysis of facts about all the resources, constraints,
assumptions, and others that ought to be considered.
Determine data mining goals
o A business goal states the target of the business terminology. For example, increase catalog sales
to the existing customer.
o A data mining goal describes the project objectives. For example, It assumes how many objects a
customer will buy, given their demographics details (Age, Salary, and City) and the price of the
item over the past three years.
o It states the targeted plan to accomplish the business and data mining plan.
o The project plan should define the expected set of steps to be performed during the rest of the
project, including the latest technique and better selection of tools.
2. Data Understanding
Data understanding starts with an original data collection and proceeds with operations to get
familiar with the data, to data quality issues, to find better insight in data, or to detect
interesting subsets for concealed information hypothesis.
Tasks:
o Collects initial data
o Describe data
o Explore data
o Verify data quality
Describe data
o It examines the "gross" or "surface" characteristics of the information obtained.
o It reports on the outcomes.
Explore data
o Addressing data mining issues that can be resolved by querying,
visualizing, and reporting, including:
o Distribution of important characteristics, results of simple aggregation.
o Establish the relationship between the small number of attributes.
o Characteristics of important sub-populations, simple statical analysis.
o It may refine the data mining objectives.
o It may contribute or refine the information description, and quality reports.
o It may feed into the transformation and other necessary information preparation.
3. Data Preparation
o It usually takes more than 90 percent of the time.
o It covers all operations to build the final data set from the original raw information.
o Data preparation is probable to be done several times and not in any prescribed order.
Tasks
o Select data
o Clean data
o Construct data
o Integrate data
o Format data
Select data
Clean data
o It may involve the selection of clean subsets of data, inserting appropriate defaults or more
ambitious methods, such as estimating missing information by modeling.
Construct data
o It comprises of Constructive information preparation, such as generating derived
characteristics,
o complete new documents, or transformed values of current characteristics.
Integrate data
o Integrate data refers to the methods whereby data is combined from various tables, or
documents to create new documents or values.
Format data
o Formatting data refer mainly to linguistic changes produced to information that does not
alter their significance but may require a modeling tool.
4. Modeling
In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of data.
Therefore, stepping back to the data preparation phase is necessary.
Tasks
o Select modeling technique
o Generate test design
o Build model
o Access model
Build model
o To create one or more models, we need to run the modeling tool on the prepared data set.
Assess model
o It interprets the models according to its domain expertise, the data mining success criteria,
and the required design.
o It assesses the success of the application of modeling and discovers methods more
technically.
o It Contacts business analytics and domain specialists later to discuss the outcomes of data
mining in the business context.
5. Evaluation
o At the last of this phase, a decision on the use of the data mining results should be reached.
o It evaluates the model efficiently, and review the steps executed to build the model and to
ensure that the business objectives are properly achieved.
o The main objective of the evaluation is to determine some significant business issue that has
not been regarded adequately.
o At the last of this phase, a decision on the use of the data mining outcomes should be
reached.
Tasks
o Evaluate results
o Review process
Evaluate results
o It assesses the degree to which the model meets the organization's business objectives.
o It tests the model on test apps in the actual implementation when time and budget
limitations permit and also assesses other data mining results produced.
Review process
o The review process does a more detailed evaluation of the data mining engagement to
determine when there is a significant factor or task that has been somehow ignored.
o It decides whether to complete the project and move on to deployment when necessary or
whether to initiate further iterations or set up new data-mining initiatives.it includes
resources analysis and budget that influence the decisions.
6. Deployment
Determine:
o Deployment refers to how the outcomes need to be utilized.
o It includes scoring a database, utilizing results as company guidelines, interactive internet scoring.
o The information acquired will need to be organized and presented in a way that can be used by the
client. However, the deployment phase can be as easy as producing. However, depending on the
demands, the deployment phase may be as simple as generating a report or as complicated as
applying a repeatable data mining method across the organizations.
Tasks
o Plan deployment
o Plan monitoring and maintenance
o Produce final report
o Review project
Plan deployment:
o To deploy the data mining outcomes into the business, takes the assessment results and
concludes a strategy for deployment.
o It refers to documentation of the process for later deployment.
Review project
o Review projects evaluate what went right and what went wrong, what was done wrong, and
what needs to be improved.
Introduction
Data mining is a significant method where previously unknown and potentially useful information
is extracted from the vast amount of data. The data mining process involves several components,
and these components constitute a data mining system architecture.
The significant components of data mining systems are a data source, data mining engine, data
warehouse server, the pattern evaluation module, graphical user interface, and knowledge base.
Data Source
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and
other documents. You need a huge amount of historical data for data mining to be successful.
Organizations typically store data in databases or data warehouses. Data warehouses may
comprise one or more databases, text files spreadsheets, or other repositories of data. Sometimes,
even plain text files or spreadsheets may contain information. Another primary source of data is
the World Wide Web or the internet.
Different Processes
Before passing the data to the database or data warehouse server, the data must be cleaned,
integrated, and selected. As the information comes from various sources and in different formats, it
can't be used directly for the data mining procedure because the data may not be complete and
accurate. So, the first data requires to be cleaned and unified. More information than needed will be
collected from various data sources, and only the data of interest will have to be selected and
passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
In other words, we can say data mining is the root of our data mining architecture. It comprises
instruments and software used to obtain insights and knowledge from data collected from various
data sources and stored within the data warehouse.
Knowledge Base
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the
search or evaluate the stake of the result patterns. The knowledge base may even contain user
views and data from user experiences that might be helpful in the data mining process. The data
mining engine may receive inputs from the knowledge base to make the result more accurate and
reliable. The pattern assessment module regularly interacts with the knowledge base to get inputs,
and also update it.
The main objective of the KDD process is to extract information from data in the context of large
databases. It does this by using Data Mining algorithms to identify what is deemed knowledge.
The process begins with determining the KDD objectives and ends with the implementation of the
discovered knowledge. At that point, the loop is closed, and the Active Data Mining starts.
Subsequently, changes would need to be made in the application domain. For example, offering
various features to cell phone users in order to reduce churn. This closes the loop, and the impacts
are then measured on the new data repositories, and the KDD process again. Following is a concise
description of the nine-step KDD process, Beginning with a managerial step:
4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and extraction
and record sampling), also attribute transformation (for example, discretization of numerical
attributes and functional transformation). This step can be essential for the success of the entire
KDD project, and it is typically very project-specific. For example, in medical assessments, the
quotient of attributes may often be the most significant factor and not each one by itself. In
business, we may need to think about impacts beyond our control as well as efforts and transient
issues. For example, studying the impact of advertising accumulation. However, if we do not utilize
the right transformation at the starting, then we may acquire an amazing effect that insights to us
about the transformation required in the next iteration. Thus, the KDD process follows upon itself
and prompts an understanding of the transformation required.
At last, the implementation of the Data Mining algorithm is reached. In this stage, we may need to
utilize the algorithm several times until a satisfying outcome is obtained. For example, by turning
the algorithms control parameters, such as the minimum number of instances in a single leaf of a
decision tree.
8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the objective
characterized in the first step. Here we consider the preprocessing steps as for their impact on the
Data Mining algorithm results. For example, including a feature in step 4, and repeat from there.
This step focuses on the comprehensibility and utility of the induced model. In this step, the
identified knowledge is also recorded for further use. The last step is the use, and overall feedback
and discovery results acquire by Data Mining.
Now, we are prepared to include the knowledge into another system for further activity. The
knowledge becomes effective in the sense that we may make changes to the system and measure
the impacts. The accomplishment of this step decides the effectiveness of the whole KDD process.
There are numerous challenges in this step, such as losing the "laboratory conditions" under which
we have worked. For example, the knowledge was discovered from a certain static depiction, it is
usually a set of data, but now the data becomes dynamic. Data structures may change certain
quantities that become unavailable, and the data domain might be modified, such as an attribute
that may have a value that was not expected previously.
Data Mining vs Machine Learning
Data Mining relates to extracting information from a large quantity of data. Data mining is a
technique of discovering different kinds of patterns that are inherited in the data set and which are
precise, new, and useful data. Data Mining is working as a subset of business analytics and similar
to experimental studies. Data Mining's origins are databases, statistics.
Data Mining and Machine learning are areas that have been influenced by each other, although they
have many common things, yet they have different ends.
Data Mining is performed on certain data sets by humans to find interesting patterns between the
items in the data set. Data Mining uses techniques created by machine learning for predicting the
results while machine learning is the capability of the computer to learn from a minded data set.
Machine learning algorithms take the information that represents the relationship between items
in data sets and creates models in order to predict future results. These models are nothing more
than actions that will be taken by the machine to achieve a result.
Data Mining is the method of extraction of data or previously unknown data patterns from huge
sets of data. Hence as the word suggests, we 'Mine for specific data' from the large data set. Data
mining is also called Knowledge Discovery Process, is a field of science that is used to determine
the properties of the datasets. Gregory Piatetsky-Shapiro founded the term "Knowledge
Discovery in Databases" (KDD) in 1989. The term "data mining" came in the database
community in 1990. Huge sets of data collected from data warehouses or complex datasets such as
time series, spatial, etc. are extracted in order to extract interesting correlations and patterns
between the data items. For Machine Learning algorithms, the output of the data mining algorithm
is often used as input.
What is Machine learning?
Machine learning is related to the development and designing of a machine that can learn itself
from a specified set of data to obtain a desirable result without it being explicitly coded. Hence
Machine learning implies 'a machine which learns on its own. Arthur Samuel invented the term
Machine learning an American pioneer in the area of computer gaming and artificial
intelligence in 1959. He said that "it gives computers the ability to learn without being
explicitly programmed."
Machine learning is a technique that creates complex algorithms for large data processing and
provides outcomes to its users. It utilizes complex programs that can learn through experience and
make predictions.
The algorithms are enhanced by themselves by frequent input of training data. The aim of machine
learning is to understand information and build models from data that can be understood and used
by humans.
1. Unsupervised Learning
2. Supervised Learning
2. Data Mining utilizes more data to obtain helpful information, and that specific data will help to
predict some future results. For example, In a marketing company that utilizes last year's data to
predict the sale, but machine learning does not depend much on data. It uses algorithms. Many
transportation companies such as OLA, UBER machine learning techniques to calculate ETA
(Estimated Time of Arrival) for rides is based on this technique.
3. Data mining is not capable of self-learning. It follows the guidelines that are predefined. It will
provide the answer to a specific problem, but machine learning algorithms are self-defined and can
alter their rules according to the situation, and find out the solution for a specific problem and
resolves it in its way.
4. The main and most important difference between data mining and machine learning is that
without the involvement of humans, data mining can't work, but in the case of machine learning
human effort only involves at the time when the algorithm is defined after that it will conclude
everything on its own. Once it implemented, we can use it forever, but this is not possible in the
case of data mining.
5. As machine learning is an automated process, the result produces by machine learning will be
more precise as compared to data mining.
6. Data mining utilizes the database, data warehouse server, data mining engine, and pattern
assessment techniques to obtain useful information, whereas machine learning utilizes neural
networks, predictive models, and automated algorithms to make the decisions.
Data Mining Vs Machine Learning
History In 1930, it was known as knowledge The first program, i.e., Samuel's
discovery in databases(KDD). checker playing program, was
established in 1950.
Responsibility Data Mining is used to obtain the Machine learning teaches the
rules from the existing data. computer, how to learn and
comprehend the rules.
Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.
In this digital era, the social platform has become inevitable. Whether we like this platform or
not, there is no escape. Facebook allows us to interact with friends and family or to stay up to
date about the latest stuff happening around the world. Facebook has made the world seems
much smaller. Facebook is one of the most important sources of online business
communication. The business holders make the most out of this platform. The most important
reason for which this platform is most accessed is because of its characteristic of being the
oldest video and photo sharing social media tool.
A Facebook page helps the people to get aware of the brand through the media content
shared. The platform supports the businesses to reach out to their audience and then establish
their business belonging to Facebook usage itself.
Not only for the users with business accounts, but this platform is also useful for the accounts
which have personal blogs. The bloggers or even the influencers who deal with posting the
content that attracts the customers give another reason to the users to access Facebook.
As far as the usage by normal users is concerned, people nowadays cannot live without
Facebook. This has become a habit to such an extent, that people have the addiction of going
through this site every once in half an hour.
ADVERTISEMENT
Facebook is one of the most popular social media platforms created in 2004; it now has almost
two billion monthly active users with five new profiles, every second. Anyone who is over the
age of 13 can use the site. Users create a free account which is a profile of them in which they
share as much as some information about themselves as they wish.
Some Facts about Facebook:
ADVERTISEMENT
ADVERTISEMENT
o Headquarters: California, US
o Established: February 2004
o Founded by: Mark Zuckerberg
o There are approximately 52 percent Female users and 48 percent Male users on Facebook.
o Facebook stories are viewed by 0.6 Billion viewers on a daily basis.
o In 2019, in 60 seconds on the internet, 1 million people Log In to Facebook.
o More than 5 billion messages are posted on Facebook pages collectively, on a monthly basis.
On a Facebook page, a user can incorporate many different kinds of personal data, including
the user's date of birth, hobbies and interests, education, sexual preferences, political party, and
religious affiliations, and current employment. Users can also post photos of themselves as well
as other peoples, and they can offer other Facebook users the opportunity to search for and
communicate with them via the website. Researchers have realized that plenty of personal data
on Facebook, as well as other social networking platform, can easily be collected or mined, to
search for patterns in people's behavior. For example, Social researchers at various universities
around the world have collected data from Facebook pages to become familiar with the lives
and social networks of college students. They have also mined for data on MySpace to find out
how people express feelings on the web and to assess- based on data posted on MySpace,
what youths think about appropriate internet conduct.
Because academic specialists, particularly those in the social sciences, are collecting data from
Facebook and other internet websites and distributing their discoveries, numerous university
Institutional Review Boards (IRBs), councils charged by government guidelines to review
research with human subjects, have built up policies and procedures that govern research on
the internet. Some have been made strategies specifically relating to data mining on social
media platforms like Facebook. These strategies serve as institutional- specific supplements to
the Department of Health and Human Services (HHS) guidelines guiding the conduct of
research with human subjects. The formation of these institutional-specific strategies that at
least some university IRBs view data mining on Facebook as research with human subjects. Thus,
the universities where this case has happened, research involving data mining on Facebook
must experience the IRB survey before the research may start.
According to the HHS guidelines, all research with human subjects must experience IRB survey
and get IRB endorsement before the research may start. The administrative requirement tries to
assure that human subjects research is conducted as ethically as possible, in specific requiring
that subject participation in research is voluntary, that the risks to subjects are corresponding to
the benefits and that no subject population is unfairly excluded or incorporated in the research.
Social Media Data Mining Methods
Applying data mining techniques to social media is relatively new as compared to other fields of
research related to social network analytics. When we acknowledge the research in social media
network analysis dates back to the 1930s. The application that uses data mining techniques
developed by industry and academia are already being used commercially. For example, a
"Social Media Analytics" organization offers services to us and track social media to provide
customers data about how goods and services recognized and discussed through social media
networks. Analysts in the organizations have applied text mining algorithms, and detect the
propagation models to blogs to create techniques to understand better how data moves
through the blogosphere.
Data mining techniques can be implemented to social media sites to comprehend information
better and to make use of data for analytics, research, and business purposes. Representative
Fields include a community or group detection, data diffusion, propagation of audiences,
subject detection and tracking, individual behavior analysis, group behavior analysis, and market
research for organizations.
Representation of Data
Similar to other social media data, it is accepted to use a graph representation to study social
media data sets. A graph comprises a set including vertexes (nodes) and edges (links). Users are
usually shown as the nodes in the graph. Relationships or corporation between individuals
(nodes) is shown as the links in the graph.
The graph depiction is common for information extracted from social networking sites where
people interact with friends, family, and business associates. It helps to create a social network
of friends, family, or business associates. Less apparent is how the graph structure is applied to
blogs, wikis, opinion mining, and similar types of online social media platforms.
ADVERTISEMENT
If we consider blogs, One graph representation blogged as nodes and can be regarded as "blog
network," and another graph description has blog posts as the nodes, and can be regarded as
"post-network." Edges are created in a blog post network when another blog post references
another blog post. Other techniques used to represent blog networks concurrently account for
individuals, relationships, content, and time simultaneously- called Internet Online Analytical
Processing (iOLAP). Wikis can be considered from the context of depicting authors as nodes,
and edges are created when the authors contribute to an object.
The graphical representation allows the application of classic mathematical graph theory,
traditional techniques of analyzing social media platforms and work on mining graph data. The
probably big size of the graph used to depict social media platforms can present difficulties for
automated processing as restricts on computer memory. The processing speeds are maximized
and usually exceeded when trying to cope with huge social media data set. Other challenges to
implementing automated procedures to allow social media data mining include identifying and
dealing with spam, the variety of formats used in the same subcategory of social media, and
continuously altering content and structure.
The problem itself can conclude the best approach. There is no other option for understanding
the data as much possible before applying data mining techniques as well as understanding the
various data mining tools that are available. A subject analyst might be required to help better
understand the data set. To better understand the various tools available for data mining, there
are a host of data mining and machine learning text and different resources that are available to
support more accurate information about a variety of particular data mining techniques and
algorithms.
Once you understand the issues and select an appropriate data mining approach, consider any
preprocessing that needs to be done. A systematic process may also be required to develop an
adequate set of data to allow reasonable processing times. Pre-processing should include
suitable privacy protection mechanisms. Although social media platforms incorporate huge
amounts of openly accessible data, it is important to guarantee individual rights, and social
media platform copyrights are secured. The effect of spam should be considered along with the
temporal representation.
In addition to preprocessing, it is essential to think about the effect of time. Depending upon
the inquiry and the research, we may get different outcomes at one time compared to another,
although the time segment is an accessible consideration for specific areas. For example,
subject detection, influence propagation, and network development, less evident is the effect of
time on network identification, group behavior, and marketing. What defines a network at one
point in time can be significantly different at another point in time. Group behavior and
interests will change after some time, and what was offered to the individuals or groups today
may not be trendy tomorrow.
With data depicted as a graph, the tasks start with a selected number of nodes, known as seeds.
Graphs are traversed, starting with the arrangement of seeds, and as the link structure from the
seed nodes is used, data is collected, and the structure itself is also reviewed. Utilizing the link
structure to stretch out from the seed set and gather new information is known as crawling the
network. The application and algorithms that are executed as a crawler should effectively
manage the challenges present in powerful social media platforms such as restricted sites,
format changes, and structure errors (invalid links). As the crawler finds the new data, it stores
the new data in a repository for further analysis. As link data is found, the crawler updates the
data about the network structure.
Some social media platforms such as Facebook, Twitter, and Technorati provide Application
Programmer Interfaces (APIs) that allow crawler applications to interact with the data sources
directly. However, these platforms usually restrict the number of API transactions per day,
relying on the affiliation the API user has with the platform. For some platforms, it is possible to
collect data (crawl) without utilizing APIs. Given the huge size of the social media platform data
available, it might be necessary to restrict the amount of data that the crawler collects. When
the crawler has collected the data, some postprocessing may be needed to validate and clean
up the data. Traditional social media platforms analysis methods can be applied, for
example, centrality measures and group structure studies. In many cases, additional data will be
related to a node or a link opening opportunities for more complex methods to consider the
more thoughtful semantics that can be exposed with text and data mining techniques.
We now focus on two particular social media platform data to further represent how data
mining techniques are applied to social media sites. The two major areas are social media
platforms, and Blogs are powerful, and rich data sources portray both these areas. The two
areas offer potential value to the more extensive scientific network as well as a business
organization.
Here, the figure illustrates the hypothetical graph structure diagram for typical social
media platforms, and Arrows indicate links to a larger part of the graph.
It is important to secure personal identity when working with social media platforms data.
Recent reports highlight the need to secure privacy as it has been demonstrated that even
anonymizing this sort of data can still reveal individual data when advanced data analysis
strategies are utilized. Security settings also can restrict the ability of data mining applications to
think about each data on social media platforms. However, some heinous techniques can be
utilized to take over the security settings.
Clustering helps to splits data into several subsets. Each of these subsets contains data similar to
each other, and these subsets are called clusters. Now that the data from our customer base is
divided into clusters, we can make an informed decision about who we think is best suited for
this product.
Let's understand this with an example, suppose we are a market manager, and we have a new
tempting product to sell. We are sure that the product would bring enormous profit, as long as
it is sold to the right people. So, how can we tell who is best suited for the product from our
company's huge customer base?
Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.
Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its input
data.
o The intra-cluster similarities are high, It implies that the data present inside the cluster is similar
to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not similar to other
data.
What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the cluster is less
than the distance between any object in the cluster and any object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density of objects.
Important points:
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the time to
perform clustering should approximately scale to the complexity order of the algorithm. For
example, if we perform K- means clustering, we know it is O(n), where n is the number of
objects in the data. If we raise the number of data objects 10 folds, then the time taken to
cluster them should also approximately increase 10 times. It means there should be a linear
relationship. If that is not the case, then there is some error with our implementation process.
Data should be scalable if it is not scalable, then we can't get the appropriate result. The figure illustrates the
graphical example where it may lead to the wrong result.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They should not be
limited to only distance measurements that tend to discover a spherical cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on intervals
(numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are sensitive to such
data and may result in poor quality clusters.
6. High dimensionality:
The clustering tools should not only able to handle high dimensional data space but also the
low-dimensional space.
Text Data Mining
Text data mining can be described as the process of extracting essential data from standard
language text. All the data that we generate via text messages, documents, emails, files are
written in common language text. Text mining is primarily used to draw useful insights or
patterns from such data.
The text mining market has experienced exponential growth and adoption over the last few
years and also expected to gain significant growth and adoption in the coming future. One of
the primary reasons behind the adoption of text mining is higher competition in the business
market, many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing customer perspectives,
organizations are making huge investments to find a solution that is capable of analyzing
customer and competitor data to improve competitiveness. The primary source of data is e-
commerce websites, social media platforms, published articles, survey, and many more. The
larger part of the generated data is unstructured, which makes it challenging and expensive for
the organizations to analyze with the help of the people. This challenge integrates with the
exponential growth in data generation has led to the growth of analytical tools. It is not only
able to handle large volumes of text data but also helps in decision-making purposes. Text
mining software empowers a user to draw useful information from a huge set of data available
sources.
ADVERTISEMENT
ADVERTISEMENT
o Information Extraction:
The automatic extraction of structured data such as entities, entities relationships, and attributes
describing entities from an unstructured source is called information extraction.
o Natural Language Processing:
NLP stands for Natural language processing. Computer software can understand human
language as same as it is spoken. NLP is primarily a component of artificial intelligence(AI). The
development of the NLP application is difficult because computers generally expect humans to
"Speak" to them in a programming language that is accurate, clear, and exceptionally structured.
Human speech is usually not authentic so that it can depend on many complex variables,
including slang, social context, and regional dialects.
o Data Mining:
Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data
mining tools can predict behaviors and future trends that allow businesses to make a better data-
driven decision. Data mining tools can be used to resolve many business problems that have
traditionally been too time-consuming.
o Information Retrieval:
Information retrieval deals with retrieving useful data from data that is stored in our systems.
Alternately, as an analogy, we can view search engines that happen on websites such as e-
commerce sites or any other sites as part of information retrieval.
Text Mining Process:
The text mining process incorporates the following steps to extract the data from the
document.
ADVERTISEMENT
o Text transformation
A text transformation is a technique that is used to control the capitalization of the text.
Here the two major way of document representation is given.
a. Bag of words
b. Vector Space
o Text Pre-processing
Pre-processing is a significant task and a critical step in Text Mining, Natural Language Processing
(NLP), and information retrieval(IR). In the field of text mining, data pre-processing is used for
extracting useful information and knowledge from unstructured text data. Information Retrieval
(IR) is a matter of choosing which documents in a collection should be retrieved to fulfill the
user's need.
o Feature selection:
Feature selection is a significant part of data mining. Feature selection can be defined as the
process of reducing the input of processing or finding the essential information sources. The
feature selection is also called variable selection.
o Data Mining:
Now, in this step, the text mining procedure merges with the conventional process. Classic Data
Mining procedures are used in the structural database.
o Evaluate:
Afterward, it evaluates the results. Once the result is evaluated, the result abandon.
o Applications:
These are the following text mining applications:
o Risk Management:
Risk Management is a systematic and logical procedure of analyzing, identifying, treating, and
monitoring the risks involved in any action or process in organizations. Insufficient risk analysis is
usually a leading cause of disappointment. It is particularly true in the financial organizations
where adoption of Risk Management Software based on text mining technology can effectively
enhance the ability to diminish risk. It enables the administration of millions of sources and
petabytes of text documents, and giving the ability to connect the data. It helps to access the
appropriate data at the right time.
o Customer Care Service
Text mining methods, particularly NLP, are finding increasing significance in the field of customer
care. Organizations are spending in text analytics programming to improve their overall
experience by accessing the textual data from different sources such as customer feedback,
surveys, customer calls, etc. The primary objective of text analysis is to reduce the response time
of the organizations and help to address the complaints of the customer rapidly and productively.
o Business Intelligence
Companies and business firms have started to use text mining strategies as a major aspect of
their business intelligence. Besides providing significant insights into customer behavior and
trends, text mining strategies also support organizations to analyze the qualities and weaknesses
of their opponent's so, giving them a competitive advantage in the market.
Social media analysis helps to track the online data, and there are numerous text mining tools
designed particularly for performance analysis of social media sites. These tools help to monitor
and interpret the text generated via the internet from the news, emails, blogs, etc. Text mining
tools can precisely analyze the total no of posts, followers, and total no of likes of your brand on
a social media platform that enables you to understand the response of the individuals who are
interacting with your brand and content.
Text Mining Approaches in Data Mining
These are the following text mining approaches that are used in data mining.
It collects sets of keywords or terms that often happen together and afterward discover the
association relationship among them. First, it preprocesses the text data by parsing, stemming,
removing stop words, etc. Once it pre-processed the data, then it induces association mining
algorithms. Here, human effort is not required, so the number of unwanted results and the
execution time is reduced.
This analysis is used for the automatic classification of the huge number of online text
documents like web pages, emails, etc. Text document classification varies with the classification
of relational data as document databases are not organized according to attribute values pairs.
Numericizing text
o Stemming algorithms
A significant pre-processing step before ordering of input documents starts with the stemming of
words. The terms "stemming" can be defined as a reduction of words to their roots. For example,
different grammatical forms of words and ordered are the same. The primary purpose of
stemming is to ensure a similar word by text mining program.
There are some highly language-dependent operations such as stemming, synonyms, the letters
that are allowed in words. Therefore, support for various languages is important.
Excluding numbers, specific characters, or series of characters, or words that are shorter or longer
than a specific number of letters can be done before the ordering of the input documents.
A particular list of words to be listed can be characterized, and it is useful when we want to search
for a specific word. It also classifies the input documents based on the frequencies with which
those words occur. Additionally, "stop words," which means terms that are to be rejected from the
ordering can be characterized. Normally, a default list of English stop words incorporates "the,"
"a," "since," etc. These words are used in the respective language very often but communicate
very little data in the document.
Bagging Vs Boosting
We all use the Decision Tree Technique on day to day life to make the
decision. Organizations use these supervised machine learning techniques
like Decision trees to make a better decision and to generate more surplus
and profit.
There are two techniques given below that are used to perform ensemble
decision tree.
Bagging
These are the following steps which are taken to implement a Random
forest:
Since the last prediction depends on the mean predictions from subset trees,
it won't give precise value for the regression model.
Boosting
If a given input is misclassified by theory, then its weight is increased so that the upcoming
hypothesis is more likely to classify it correctly by consolidating the entire set at last converts
weak learners into better performing models.
It utilizes a gradient descent algorithm that can optimize any differentiable loss function. An
ensemble of trees is constructed individually, and individual trees are summed successively. The
next tree tries to restore the loss ( It is the difference between actual and predicted values).
Bagging Boosting
Various training data subsets are randomly drawn Each new subset contains the components that
with replacement from the whole training dataset. were misclassified by previous models.
Bagging attempts to tackle the over-fitting issue. Boosting tries to reduce bias.
If the classifier is unstable (high variance), then we If the classifier is steady and straightforward
need to apply bagging. (high bias), then we need to apply boosting.
Every model receives an equal weight. Models are weighted by their performance.
Objective to decrease variance, not bias. Objective to decrease bias, not variance.
It is the easiest way of connecting predictions that It is a way of connecting predictions that belong
belong to the same type. to the different types.
Every model is constructed independently. New models are affected by the performance of
the previously developed model.
Data warehouse refers to the process of compiling and organizing data into
one common database, whereas data mining refers to the process of
extracting useful data from the databases. The data mining process depends
on the data compiled in the data warehousing phase to recognize meaningful
patterns. A data warehousing is created to support management systems.
Data Warehouse
A Data Warehouse refers to a place where data can be stored for useful
mining. It is like a quick computer system with exceptionally huge data
storage capacity. Data from the various organization's systems are copied to
the Warehouse, where it can be fetched and conformed to delete errors.
Here, advanced requests can be made against the warehouse storage of
data.
Data warehouse combines data from numerous sources which ensure the data quality, accuracy,
and consistency. Data warehouse boosts system execution by separating analytics processing
from transnational databases. Data flows into a data warehouse from different databases. A
data warehouse works by sorting out data into a pattern that depicts the format and types of
data. Query tools examine the data tables using patterns.
Data warehouses and databases both are relative data systems, but both are made to serve
different purposes. A data warehouse is built to store a huge amount of historical data and
empowers fast requests over all the data, typically using Online Analytical Processing (OLAP).
A database is made to store current transactions and allow quick access to specific transactions
for ongoing business processes, commonly known as Online Transaction Processing (OLTP).
1. Subject Oriented
A data warehouse is subject-oriented. It provides useful data about a subject instead of the
company's ongoing operations, and these subjects can be customers, suppliers, marketing,
product, promotion, etc. A data warehouse usually focuses on modeling and analysis of data
that helps the business organization to make data-driven decisions.
2. Time-Variant:
The different data present in the data warehouse provides information for a specific period.
3. Integrated
A data warehouse is built by joining data from heterogeneous sources, such as social databases,
level documents, etc.
4. Non- Volatile
Data Mining:
Data mining refers to the analysis of data. It is the computer-supported process of analyzing
huge sets of data that have either been compiled by computer systems or have been
downloaded into the computer. In the data mining process, the computer analyzes the data and
extract useful information from it. It looks for hidden patterns within the data set and try to
predict future behavior. Data mining is primarily used to discover and indicate relationships
among the data sets.
Data mining aims to enable business organizations to view business behaviors, trends
relationships that allow the business to make data-driven decisions. It is also known as
knowledge Discover in Database (KDD). Data mining tools utilize AI, statistics, databases, and
machine learning systems to discover the relationship between the data. Data mining tools can
support business-related questions that traditionally time-consuming to resolve any issue.
i. Market Analysis:
ADVERTISEMENT
ADVERTISEMENT
Data Mining can predict the market that helps the business to make the decision. For example,
it predicts who is keen to purchase what type of products.
Data Mining methods can help to find which cellular phone calls, insurance claims, credit, or
debit card purchases are going to be fraudulent.
Data Mining techniques are widely used to help Model Financial Market
Analyzing the current existing trend in the marketplace is a strategic benefit because it helps in
cost reduction and manufacturing process as per market demand.
Data mining is the process of determining data A data warehouse is a database system designed
patterns. for analytics.
Data mining is generally considered as the process of Data warehousing is the process of combining all
extracting useful data from a large set of data. the relevant data.
Business entrepreneurs carry data mining with the Data warehousing is entirely carried out by the
help of engineers. engineers.
In data mining, data is analyzed repeatedly. In data warehousing, data is stored periodically.
Data mining uses pattern recognition techniques to Data warehousing is the process of extracting
identify patterns. and storing data that allow easier reporting.
One of the most amazing data mining technique is One of the advantages of the data warehouse is
the detection and identification of the unwanted its ability to update frequently. That is the reason
errors that occur in the system. why it is ideal for business entrepreneurs who
want up to date with the latest stuff.
The data mining techniques are cost-efficient as The responsibility of the data warehouse is to
compared to other statistical data applications. simplify every type of business data.
The data mining techniques are not 100 percent In the data warehouse, there is a high possibility
accurate. It may lead to serious consequences in a that the data required for analysis by the
certain condition. company may not be integrated into the
warehouse. It can simply lead to loss of data.
Companies can benefit from this analytical tool by Data warehouse stores a huge amount of
equipping suitable and accessible knowledge-based historical data that helps users to analyze
data. different periods and trends to make future
predictions.
Social media is a great source of information and a perfect platform for communication.
Businesses and individuals can make the best of it instead of only sharing their photos and
videos on the platform. The platform gives freedom to its users to connect with their target
group easily and fantastically. Either a group or an established business, both face difficulties in
standing up with the competitive social media industry. But through the social media platform,
users can market or develop his/her brand or content with others.
Social media mining includes social media platforms, social network analysis, and data mining
to provide a convenient and consistent platform for learners, professionals, scientists, and
project managers to understand the fundamentals and potentials of social media mining. It
suggests various problems arising from social media data and presents fundamental concepts,
emerging issues, and effective algorithms for data mining, and network analysis. It includes
multiple degrees of difficulty that enhance knowledge and help in applying ideas, principles,
and techniques in distinct social media mining situations.
As per the "Global Digital Report," the total number of active users on social media platforms
worldwide in 2019 is 2.41 billion and increases up to 9 % year-on-year. With the universal use
of Social media platforms via the internet, a huge amount of data is accessible. Social media
platforms include many fields of study, such as sociology, business, psychology, entertainment,
politics, news, and other cultural aspects of societies. Applying data mining to social media can
provide exciting views on human behavior and human interaction. Data mining can be used in
combination with social media to understand user's opinions about a subject, identifying a
group of individuals among the masses of a population, to study group modifications over time,
find influential people, or even suggest a product or activity to an individual.
For example, The presidential election during 2008 marked an unprecedented use of social
media platforms in the United States. Social media platforms, including Facebook, YouTube
played a vital role in raising funds and getting candidate's messages to voters. Researcher's
extracted blog data to demonstrate correlations between the amount of social media platform
used by candidates and the winner of the 2008 presidential campaign.
This effective example emphasizes the potential for data mining social media data to forecast
results at the national level. Data mining social media can also produce personal and corporate
benefits.
Social media mining refers to social computing. Social computing is defined as "Any computing
application where software is used as an intermediary or Centre for a social relationship." Social
computing involves application used for interpersonal communication as well as application and
research activities related to "computational social studies" or Social behavior."
Social media platform refers to various kinds of information services used collaboratively by
many people placed into the subcategories shown below.
Category Examples
With popular traditional media such as radio, newspaper, and television, communication is
entirely one-way that comes from the media source or advertiser to the mass of media
consumers. Web 2.0 technologies and modern social media platforms have changed the scene
moving from one-way media communication driven by media providers to where almost
anyone can publish written, audio, video, or image content to the masses.
This media environment is significantly changing the way of business communication with their
clients. It provides exceptionally unprecedented opportunities for individuals to interact with a
huge number of peoples at a very low cost. The relationships present online and shown through
the social media platform are digitalized data sets of social media platforms on a scale. The
resulting data offers rich opportunities for sociology and insights to consumer behavior and
marketing among a host of apps linked to similar fields.
The growth and number of users on social media platforms are incredible. For example,
consider the most tempting social media networking site, Facebook. Facebook reached over
400 million active users during the first six years of operation, and it has been growing
exponentially. The given figure illustrates the exponential growth of Facebook over the first six
years. As per the report, Facebook is ranked 2 nd in the world for websites based on the traffic
engagement of the users on the site daily.
ADVERTISEMENT
The broad use of social media platforms is not limited to one geographical region of the world.
Orkut, a popular social networking platform operated by Google has most of the users from the
outside the United States, and the use of social media among Internet users is now mainstream
in many parts of the globe including countries Aisa, Africa, Europe, South America, and the
middle east. Social media also drive significant changes in company and business need to
decide on their policies to keep pace with this new media.
Data Mining techniques can assist effectively in dealing with the three primary challenges with
social media data. First, social media data sets are large. Consider the example of the most
popular social media platform Facebook with 2.41 billion active users. Without automated data
processing to analyze social media, social media data analytics becomes inaccessible in any
reasonable time frame.
Second, Social media site's data sets can be noisy. For example, Spam blogs are large in number
in the blogosphere, as well as unimportant tweets on Twitter.
Third, data from online social media platforms are dynamic, regular modifications and updates
over a short period are not common but also a significant aspect to consider in dealing with
social media data.
Applying data mining methods to huge data sets can improve search results for everyday
search engines, realize specified target marketing for business, help psychologists study
behavior, personalize consumer web services, provide new insights into the social structure for
sociologists, and help to identify and prevent spam for all of us.
Moreover, open access to data offers an unprecedented amount of data for researchers to
improve efficiency and optimize data mining techniques. The progress of data mining is based
on huge data sets. Social media is an optimal data source on the edge of data mining for
progressing and testing new data mining techniques for academic and allied data mining
analysts.
Bayesian classification uses Bayes theorem to predict the occurrence of any event. Bayesian
classifiers are the statistical classifiers with the Bayesian probability understandings. The theory
expresses how a level of belief, expressed as a probability.
Bayes theorem came into existence after Thomas Bayes, who first utilized conditional probability
to provide an algorithm that uses evidence to calculate limits on an unknown parameter.
Bayes's theorem is expressed mathematically by the following equation that is given below.
Where X and Y are the events and P (Y) ≠ 0
P(X/Y) is a conditional probability that describes the occurrence of event X is given that Y is
true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that X is
true.
P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This is
known as the marginal probability.
Bayesian interpretation:
ADVERTISEMENT
ADVERTISEMENT
Where P (X⋂Y) is the joint probability of both X and Y being true, because
Bayesian network:
A Bayesian Network falls under the classification of Probabilistic Graphical Modelling (PGM)
procedure that is utilized to compute uncertainties by utilizing the probability concept.
Generally known as Belief Networks, Bayesian Networks are used to show uncertainties
using Directed Acyclic Graphs (DAG)
A Directed Acyclic Graph is used to show a Bayesian Network, and like some other statistical
graph, a DAG consists of a set of nodes and links, where the links signify the connection
between the nodes.
The nodes here represent random variables, and the edges define the relationship between
these variables.
A DAG models the uncertainty of an event taking place based on the Conditional Probability
Distribution (CDP) of each random variable. A Conditional Probability Table (CPT) is used to
represent the CPD of each variable in a network.
Data Mining- World Wide Web
Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.
Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML
provides information that concerns not only the layout but also logical structure. The primary
task of content mining is data extraction, where structured data is extracted from unstructured
websites. The objective is to facilitate data aggregation over various web sites by using the
extracted structured data. Web content mining can be utilized to distinguish topics on the web.
For Example, if any user searches for a specific task on the search engine, then the user will get
a list of suggestions.
The web structure mining can be used to find the link structure of hyperlink. It is used to
identify that data either link the web pages or direct link network. In Web Structure Mining, an
individual considers the web as a directed graph, with the web pages being the vertices that are
associated with hyperlinks. The most important application in this regard is the Google search
engine, which estimates the ranking of its outcomes primarily with the PageRank algorithm. It
characterizes a page to be exceptionally relevant when frequently connected by other highly
related pages. Structure and content mining methodologies are usually combined. For example,
web structured mining can be beneficial to organizations to regulate the network between two
commercial sites.
Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the usage
of web resources, the individual is thinking about records of requests of visitors of a website,
that are often collected as web server logs. While the content and structure of the collection of
web pages follow the intentions of the authors of the pages, the individual requests
demonstrate how the consumers see these pages. Web usage mining may disclose relationships
that were not proposed by the creator of the pages.
Some of the methods to identify and analyze the web usage patterns are given below:
The analysis of preprocessed data can be accomplished in session analysis, which incorporates
the guest records, days, time, sessions, etc. This data can be utilized to analyze the visitor's
behavior.
The document is created after this analysis, which contains the details of repeatedly visited web
pages, common entry, and exit.
OLAP can be accomplished on various parts of log related data in a specific period.
The site pages don't have a unifying structure. They are extremely complicated as compared to
traditional text documents. There are enormous amounts of documents in the digital library of
the web. These libraries are not organized according to a specific order.
The data on the internet is quickly updated. For example, news, climate, shopping, financial
news, sports, and so on.
o Relevancy of data:
It is considered that a specific person is generally concerned about a small portion of the web,
while the rest of the segment of the web contains the data that is not familiar to the user and
may lead to unwanted results.
The size of the web is tremendous and rapidly increasing. It appears that the web is too huge
for data warehousing and data mining.
Cluster Analysis separates data into groups, usually known as clusters. If meaningful groups are
the objective, then the clusters catch the general information of the data. Some time cluster
analysis is only a useful initial stage for other purposes, such as data summarization. In the case
of understanding or utility, cluster analysis has long played a significant role in a wide area such
as biology, psychology, statistics, pattern recognition machine learning, and mining.
The given Figure 1 illustrates different ways of Clustering at the same sets of the point.
In various applications, the concept of a cluster is not briefly defined. To better understand the
challenge of choosing what establishes a group, figure 1 illustrates twenty points and three
different ways to separate them into clusters. The design of the markers shows the cluster
membership. The figures divide the data into two and six sections, respectively. The division of
each of the two more significant clusters into three subclusters may be a product of the human
visual system. It may not be logical to state that the points from four clusters. The figure
represents that the meaning of a cluster is inaccurate. The best definition of cluster relies upon
the nature of the data and the outcomes.
PlayNext
Unmute
Current Time 0:00
Duration 18:10
Loaded: 0.37%
Â
Fullscreen
Cluster analysis is similar to other methods that are used to divide data objects into groups. For
example, Clustering can be view as a form of Classification. It constructs the labeling of objects
with Classification, i.e., new unlabeled objects are allowed a class label using a model developed
from objects with known class labels. So that, cluster analysis is sometimes defined as
unsupervised Classification. If the term classification is used without any ability within data
mining, then it typically refers to supervised Classification.
The terms segmentation and partitioning are generally used as synonyms for Clustering.
These terms are commonly used for techniques outside the traditional bounds of cluster
analysis. For example, the term partitioning is usually used in making relation with techniques
that separate graphs into subgraphs and that are not connected to
Clustering. Segmentation often introduces the division of data into groups using simple
methods. For example, an image can be broken into various sections depends on pixel
frequency and color, or people can be divided into different groups based on their annual
income. However, some work in graph division and market segmentation is connected to cluster
analysis.
ADVERTISEMENT
ADVERTISEMENT
The most frequently discussed different features among various types of Clustering is whether
the clusters sets are nested or unnested, or in more conventional terminology, partitional or
hierarchical. A partitional Clustering is usually a distribution of the set of data objects into non-
overlapping subsets (clusters) so that each data object is in precisely one subset.
If we allow clusters to have subclusters, then we get a hierarchical Clustering, which is a group
of nested clusters that are organized as a tree. Each node (cluster) in the tree (Not for the leaf
nodes) is the association of its subclusters, and the tree roots are the cluster, including all the
objects. Usually, the leaves of the tree are individual clusters of individual data objects. If we
enable the cluster to be nested, then one clarification of figure 1 ( a) is that it has two
subclusters figure 1 (b) illustrates this, each of which has three subclusters shown in figure 1 (d).
The clusters have appeared in figure 1 (a-d) when taken in a specific order, also from a
hierarchical (nested) Clustering, 1, 2, 4, and 6 clusters on each level. Finally, a hierarchical
Clustering can be seen as an arrangement of partitional Clustering, and a partitional Clustering
can be acquired by taking any member of that sequence, it means by cutting the hierarchical
tree at the specific level.
The Clustering that appeared in the figure is all exclusive, as they give the responsibility to each
object to a single cluster. There are numerous circumstances in which a point could sensibly be
set in more than one cluster, and these circumstances are better addressed by non-exclusive
Clustering. In general terms, an overlapping or non-exclusive Clustering is used to reflect the
fact that an object can together belong to more than one group (class). For example, a person
at a company can be both a trainee student and an employee of the company. A non-exclusive
Clustering is also usually used if an object is "between" two or more then two clusters and could
sensibly be allocated to any of these clusters. Consider a point somewhere between two of the
clusters rather than make an entirely random task of the object to a single cluster. it is put in all
of the clusters to "equally good" clusters.
In fuzzy Clustering, each object belongs to each cluster with a membership weight that is
between 0 and 1. In other words, clusters are considered as fuzzy sets. Mathematically, a fuzzy
set is defined as one in which an object is associated with any set with a weight that ranges
between 0 and 1. In fuzzy Clustering, we usually set the additional constraint, and the sum of
weights for each object must be equal to 1. Similarly, probabilistic Clustering systems compute
the probability in which each point belongs to a cluster, and these probabilities must sum to 1.
Since the membership weights or probabilities for any object sum to 1, a fuzzy or probabilistic
Clustering doesn't address actual multiclass situations.
o Well-separated cluster
A cluster is a set of objects where each object is closer or more similar to every other object in
the cluster. Sometimes a limit is used to indicate that all the objects in a cluster must be
adequately close or similar to each other. The definition of a cluster is satisfied only when the
data contains natural clusters that are quite far from one another. The figure illustrates an
example of well-separated clusters that comprise of two points in a two-dimensional space.
Well-separated clusters do not require to be spherical but can have any shape.
o Prototype-Based cluster
A cluster is a set of objects where each object is closer or more similar to the prototype that
characterizes the cluster to the prototype of any other cluster. For data with continuous
characteristics, the prototype of a cluster is usually a centroid. It means the average (Mean) of
all the points in the cluster when a centroid is not significant. For example, when the data has
definite characteristics, the prototype is usually a medoid that is the most representative point
of a cluster. For some sorts of data, the model can be viewed as the most central point, and in
such examples, we commonly refer to prototype-based clusters as center-based clusters. As
anyone might expect, such clusters tend to be spherical. The figure illustrates an example of
center-based clusters.
o Graph-Based cluster
If the data is depicted as a graph, where the nodes are the objects, then a cluster can be
described as a connected component. It is a group of objects that are associated with each
other, but that has no association with objects that is outside the group. A significant example
of graph-based clusters is contiguity-based clusters, where two objects are associated when
they are placed at a specified distance from each other. It suggests that every object in
a contiguity-based cluster is the same as some other object in the cluster. Figures
demonstrate an example of such clusters for two-dimensional points. The meaning of a cluster
is useful when clusters are unpredictable or intertwined but can experience difficulty when noise
present. It is shown by the two circular clusters in the figure; the little extension of points can
join two different clusters.
Other kinds of graph-based clusters are also possible. One such way describes a cluster as
a clique. Clique is a set of nodes in a graph that is completely associated with each other.
Particularly, we add connections between the objects according to their distance from one
another. A cluster is generated when a set of objects forms a clique. It is like prototype-based
clusters, and such clusters tend to be spherical.
o Density-Based Cluster
A cluster is a compressed domain of objects that are surrounded by a region of low density. The
two spherical clusters are not merged, as in the figure, because the bridge between them fades
into the noise. Similarly, the curve that is present in the Figure disappears into the noise and
does not form a cluster in Figure. It also disappears into the noise and does not form a cluster
shown in the figure. A density-based definition of a cluster is usually occupied when the clusters
are irregularly and intertwined, and when noise and outliers exist. The other hand contiguity-
based definition of a cluster would not work properly for the data of Figure. Since the noise
would tend to form a network between clusters.
Bitcoin mining refers to the process of authenticating and adding transactional records to the
public ledger. The public ledge is known as the blockchain because it comprises a chain of the
block.
Before we understand the Bitcoin mining concept, we should understand what Bitcoin
is. Bitcoin is virtual money having some value, and its value is not static, it varies according to
time. There is no Bitcoin regulatory body that regulates the Bitcoin transactions.
Bitcoin was created under the pseudonym (False name) Satoshi Nakamoto,
who announced the invention, and later it was implemented as open-source
code. An only end-to-end version of electronic money would enable online
payments to be sent directly from one person to another without the
interference of an economic body. Bitcoin is a network practice that
empowers people to transfer assets rights on account units called Bitcoin's,
made in limited quantity. When an individual sends a couple of bitcoins to
another individual, this data is communicated to the peer-to-peer bitcoin
network.
Bitcoins don't exist physically and are only an arrangement of virtual data. It can be exchanged
for genuine money, and are broadly acceptable in most countries around the globe. There is no
central authority for Bitcoins, similar to a central bank (RBI in India) that controls the monetary
policy. Alternatively, developers solve complex puzzles to support Bitcoin transactions. This
process is called Bitcoin mining.
It is quite a complex process, but if you want to take it directly, then here is the process of how
it works. You need to get a CPU(Central Processing Unit) with excellent processing power and a
speedy web interface. In the next step, there are numerous online networks that list out the
latest Bitcoin transactions taking place in real-time. Afterward, Sign-in with a Bitcoin customer
and attempt to approve those transactions by assessing blocks of data, called hash. Now,
communication goes through several systems, called nodes, which are simply blocks of data,
and since the data is encoded, a miner is needed to check if his answers are accurate.
Bitcoin Mining requires a task that is exceptionally tricky to perform, but simple to verify. It uses
cryptography, with a hash function called double SHA-256( a one-way function that converts a
text of any dimension into a string of 256 bits). A hash accepts a portion of data as input and
reduces it down into a smaller hash value (256 bits). With a cryptographic hash, there is no
other option to get a hash value we want without attempting a ton of sources. Once we find an
input that gives the value we want, it is a simple task for anybody to validate the hash. So,
cryptographic hashing turns into a decent method to apply the Bitcoin
"Proof-of-work" (data that is complex to produce but easy
for others to verify).
If we consider a block to mine first, we need to collect the new transactions
into a block, and then we hash the block to form a 256-bit block hash value.
When the hash initiates with sufficient zeros, the block has been
successfully mined and is directed to the Bitcoin network, and that has
turned into the identifier for the block. In many cases, the hash is not
successful, so we need to alter the block to some extent and try again and
again.
Bitcoin Transaction
When we send Bitcoin, an individual data structure, namely a Bitcoin transaction, is made by
your wallet customer and afterward communicate to rebroadcast the transaction. If the
operation is valid, nodes will incorporate it in the block they are mining, within 10-20 minutes,
the transaction will be included, along with other transactions, in a block in the blockchain.
Finally, the receiver can see the transaction amount in their wallet.
Bitcoin Wallets
Bitcoin wallets compile the private keys through which we access a bitcoin address and payout
our funds. They appear in different forms, designed for specific types of devices. We can even
use hardcopy to store data to avoid having them on the computer. It is important to secure and
back up our Bitcoin wallet. Bitcoins are the latest technology of cash, and very soon, other
merchants start accepting them as payment.
We know how a bitcoin transaction mechanism works and how they are created, but how they
are stored? We store money in a physical wallet, and bitcoin works similarly, except it is
generally digital. In brief, we don't need to stock bitcoins anywhere. What we store are the
secured digital keys used to access our public bitcoin address and sign transactions.
There are mainly five types of wallets that are given below:
Desktop Wallets
First, we need to install the original bitcoin customer (Bitcoin Core). If we have already installed,
then we are running a wallet, but may not know it. In addition to depend on transactions on the
network, this software also empowers us to create a bitcoin address for transfer and getting the
virtual currency. MultiBit (Bitcoin wallet) runs on Mac OSX, Windows, and Linux. Hive is an OS X-
based wallet with some particular features, including an application store that associates directly
to bitcoin services.
Mobile Wallets
An application on our cell phone, the wallet can store up the security key for our bitcoin
addresses, and enable us to pay for things straightforwardly with our phone. Many times, a
bitcoin wallet will even take advantage of a cell phones near-field communication (NFC) aspect,
empowering us to tap the mobile phone against a reader and pay bitcoins without entering any
data at all. A bitcoin customer has to download the whole bitcoin blockchain, which is always
developing and is multiple gigabytes in size. A ton of mobile phones wouldn't be able to hold
the blockchain in their memory. In such a case, they can use alternative options, and these
mobile users are repeatedly designed with simplified payment verification (SPV) in mind. They
download a confined subset of the blockchain and depend on other trusted nodes in the
bitcoin system to ensure that they have the precise data. Mycelium is the example of mobile
wallets that comprises of the Android-based Bitcoin wallet.
Online Wallets
Electronic wallets stores our security keys on the web, on a computer, limited by someone else
and coupled to the Internet. Various online services are accessible, and the network to mobile
and desktop wallets copying our address among various devices that we own. One significant
advantage of online wallets is that we can access them from anywhere, in spite of which device
we are using.
Hardware Wallets
Hardware wallets are incomplete numbers. These are sharp devices that can hold private keys
electronically and make easy payments. The compact Ledger USB bitcoin Wallet utilizes
smartcard protection and is accessible at a reasonable cost.
Paper Wallets
The cheapest alternative for keeping our bitcoins safe and sound is significantly called a paper
wallet. There are various sites offering paper bitcoin wallet services. They deliver a bitcoin
address for us and generate an image containing two QR codes. The first one is the public
address that we can use to receive bitcoins, and the other is the private key that we use to pay
out bitcoins stored at the address. The primary advantage of a paper wallet is that the private
keys are not stored digitally anyplace, so it secures our wallet from cyber attacks.
Orange Data Mining
Orange is a C++ core object and routines library that incorporates a huge
variety of standard and non-standard machine learning and data mining
algorithms. It is an open-source data visualization, data mining, and
machine learning tool. Orange is a scriptable environment for quick
prototyping of the latest algorithms and testing patterns. It is a group of
python-based modules that exist in the core library. It implements some functionalities for
which execution time is not essential, and that is done in Python.
It incorporates a variety of tasks such as pretty-print of decision trees, bagging and boosting,
attribute subset, and many more. Orange is a set of graphical widgets that utilizes strategies
from the core library and orange modules and gives a decent user interface. The widget
supports digital-based communication and can be gathered together into an application by a
visual programming tool called an orange canvas.
All these together make an orange an exclusive component-based algorithm for data mining
and machine learning. Orange is proposed for both experienced users and analysts in data
mining and machine learning who want to create and test their own algorithms while reusing as
much of the code as possible, and for those simply entering the field who can either write short
python contents for data analysis.
Orange supports a flexible domain for developers, analysts, and data mining
specialists. Python, a new generation scripting language and programming
environment, where our data mining scripts may be easy but powerful.
Orange employs a component-based approach for fast prototyping. We can
implement our analysis technique simply like putting the LEGO bricks, or
even utilize an existing algorithm. What are Orange components for
scripting Orange widgets for visual programming?. Widgets utilize a specially
designed communication mechanism for passing objects like classifiers,
regressors, attribute lists, and data sets permitting to build easily rather
complex data mining schemes that use modern approaches and techniques.
Orange core objects and Python modules incorporate numerous data mining
tasks that are far from data preprocessing for evaluation and modeling. The
operating principle of Orange is cover techniques and perspective in data
mining and machine learning. For example, Orange's top-down induction of
decision tree is a technique build of numerous components of which anyone
can be prototyped in python and used in place of the original one. Orange
widgets are not simply graphical objects that give a graphical interface for a
specific strategy in Orange, but it includes an adaptable signaling
mechanism that is for communication and exchange of objects like data
sets, classification models, learners, objects that store the results of the
assessment. All these ideas are significant and together recognize Orange
from other data mining structures.
Orange Widgets
Orange Scripting
We can see how it uses Python and Orange with an example, consider an
easy script that reads the data set and prints the number of attributes used.
We will utilize a classification data set called "voting" from UCI Machine
Learning Repository that records sixteen key votes of each of the Parliament
of India MP (Member of Parliament), and labels each MP with a party
membership:
import orange
data1 = orange.ExampleTable('voting.tab')
print('Instance:', len(data1))
print(Attributes:', 1len(data.domain.attributes))
Here, we can see that the script first loads in the orange library, reads the data file, and prints
out what we were concerned about. If we store this script in script.py and run it by shell
command "python script.py" ensure that the data file is in the same directory then we get
Instances: 543
Attributes: 16
Let us proceed with our script that uses the same data created by a naïve Bayesian classifier and
print the classification of the first five instances:
model = orange.BayesLearner(data1)
for i in range(5):
print(model(data1[i]))
It is easy to produce the classification model; we have called Orange?s object (Bayes Learner) and gave it
the data set. It returned another object (naïve Bayesian classifier) when given an instance returns the
label of the possible class. Here we can see the output of this part of the script:
inc
inc
inc
bjp
bjp
Here, we need to discover what the correct classifications were; we can print the original labels
of our five instances:
or i in range(5):
What we cover is that naïve Bayesian classifier has misclassified the third instance:
n = model(data1[2], orange.GetProbabilities)
Here we recognize that Python's indices initiate with 0, and that classification model returns a
probability vector when a classifier is called with argument orange.-Getprobabilities. Our
model was estimating a very high probability for an inc:
Inc : 0.878529638542
Data Mining Vs Big Data
Data Mining uses tools such as statistical models, machine learning, and visualization to "Mine"
(extract) the useful data and patterns from the Big Data, whereas Big Data processes high-
volume and high-velocity data, which is challenging to do in older databases and analysis
program.
Big Data
Big Data refers to the vast amount that can be structured, semi-structured, and
unstructured sets of data ranging in terms of tera-bytes. It is challenging to
process a huge amount of data on a single system that's why the RAM of our
computer stores the interim calculations during the processing and
analyzing. When we try to process such a huge amount of data, it takes
much time to do these processing steps on a single system. Also, our
computer system doesn't work correctly due to overload.
Here we will understand the concept (how much data is produced) with a
live example. We all know about Big Bazaar. We as a customer goes to Big
Bazaar at least once a month. These stores monitor each of its product that
the customers purchase from them, and from which store location over the
world. They have a live information feeding system that stores all the data in
huge central servers. Imagine the number of Big bazaar stores in India alone
is around 250. Monitoring every single item purchased by every customer
along with the item description will make the data go around 1 TB in a
month.
What does Big Bazaar do with that data
We know some promotions are running in Big Bazaar on some items. Do we genuinely believe
Big Bazaar would just run those products without any full back-up to find those promotions
would increase their sales and generate a surplus? That is where Big Data analysis plays a vital
role. Using Data Analysis techniques, Big Bazaar targets its new customers as well as existing
customers to purchase more from its stores. rd Skip 10s
Big data comprises of 5Vs that is Volume, Variety, Velocity, Veracity, and Value.
Volume: In Big Data, volume refers to an amount of data that can be huge when it comes to big
data.
Variety: In Big Data, variety refers to various types of data such as web server logs, social media
data, company data.
Velocity: In Big Data, velocity refers to how data is growing with respect to time. In general,
data is increasing exponentially at a very fast rate.
Value: In Big Data, value refers to the data which we are storing, and processing is valuable or
not and how we are getting the advantage of these huge data sets.
A very efficient method, known as Hadoop, is primarily used for Big data processing. It is an
Open-source software that works on a Distributed Parallel processing method.
Hadoop Common
A distributed file-system which stores data on commodity machine, supporting very high gross
bandwidth over the cluster.
Hadoop YARN
Hadoop MapReduce
Data Mining
As the name suggests, Data Mining refers to the mining of huge data sets to identify trends,
patterns, and extract useful information is called data mining.
In data Mining, we are looking for hidden data but without any idea about what exactly type of
data we are looking for and what we plan to use it for once you find it. When we discover
interesting information, we start thinking about how to make use of it to boost business.
We will understand the data mining concept with an example:
A Data Miner starts discovering the call records of a mobile network operator without any
specific target from his manager. The manager probably gives him a significant objective to
discover at least a few new patterns in a month. As he begins extracting the data to discover a
pattern that there are some international calls on Friday (example) compared to all other days.
Now he shares this data with management, and they come up with a plan to shrink international
call rates on Friday and start a campaign. Call duration goes high, and customers are happy with
low call rates, more customers join, the organization makes more profit as utilization percentage
has increased.
Data Integration
In step first, Data are integrated and collected from various sources.
Data Selection
In the first step, we may not collect all the data simultaneously, so in this step, we select only
those data which are left, and we think it is useful for data mining.
Data Cleaning
In this step, the information we have collected is not clean and may consist of errors, noisy or
inconsistent data, missing values. So we need to implement various strategies to get rid of such
problems.
Data Transformation
The data even after cleaning is not prepared for mining, so we need to transform them into
structures for mining. The methods used to achieve this are aggregation, normalization,
smoothing, etc.
Data Mining
Once the data has transformed, we are ready to implement data mining methods on data to
extract useful data and patterns from data sets. Techniques like clustering association rules are
among the many various techniques used for data mining.
Pattern Evaluation
Patten evaluation contains visualization, removing random patterns, transformation, etc. from
the patterns we generated.
Decision
It is the last step in data mining. It helps users to make use of the acquired user data to make
better data-driven decisions.
It primarily targets an analysis of data to extract It primarily targets the data relationship.
useful information.
It can be used for large volume as well as low It contains a huge volume of data.
volume data.
It is a method primarily used for data analysis. It is a whole concept than a brief term.
It is primarily based on Statistical Analysis, It is primarily based on data analysis, generally target
generally target prediction, and finding business prediction, and finding business factors on a large
factors on a small scale. scale.
It uses the following data types e.g., Structured It uses the following data types e.g., Structured,
data, relational, and dimensional database. Semi-Structured, and unstructured data.
It is primarily used for strategic decision-making It is primarily used for Dashboards and predictive
purposes. measures.