DWDM Unit3

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

UNIT III DATA MINING

What is Data Mining?

Data mining is the process of extracting knowledge or insights from large amounts of data using
various statistical and computational techniques. The data can be structured, semi-structured or
unstructured, and can be stored in various forms such as databases, data warehouses, and data
lakes.

The primary goal of data mining is to discover hidden patterns and relationships in the data that can
be used to make informed decisions or predictions. This involves exploring the data using various
techniques such as clustering, classification, regression analysis, association rule mining, and
anomaly detection.

Data mining has a wide range of applications across various industries, including marketing, finance,
healthcare, and telecommunications. For example, in marketing, data mining can be used to identify
customer segments and target marketing campaigns, while in healthcare, it can be used to identify
risk factors for diseases and develop personalized treatment plans.

However, data mining also raises ethical and privacy concerns, particularly when it involves personal
or sensitive data. It’s important to ensure that data mining is conducted ethically and with
appropriate safeguards in place to protect the privacy of individuals and prevent misuse of their data.

Introduction to Data Mining

Introduction to data

Data is a word we hear everywhere nowadays. In general, data is a collection of facts, information,
and statistics and this can be in various forms such as numbers, text, sound, images, or any other
format.

Data mining is the process of discovering patterns and relationships in large datasets using
techniques such as machine learning and statistical analysis. The goal of data mining is to extract
useful information from large datasets and use it to make predictions or inform decision-making.
Data mining is important because it allows organizations to uncover insights and trends in their data
that would be difficult or impossible to discover manually.

Types and Part of Data Mining architecture

Data Mining refers to the detection and extraction of new patterns from the already collected data.
Data mining is the amalgamation of the field of statistics and computer science aiming to discover
patterns in incredibly large datasets and then transform them into a comprehensible structure for
later use.

The architecture of Data Mining:


Basic Working:

1. It all starts when the user puts up certain data mining requests, these requests are then sent
to data mining engines for pattern evaluation.

2. These applications try to find the solution to the query using the already present database.

3. The metadata then extracted is sent for proper analysis to the data mining engine which
sometimes interacts with pattern evaluation modules to determine the result.

4. This result is then sent to the front end in an easily understandable manner using a suitable
interface.

A detailed description of parts of data mining architecture is shown:

1. Data Sources: Database, World Wide Web(WWW), and data warehouse are parts of data
sources. The data in these sources may be in the form of plain text, spreadsheets, or other
forms of media like photos or videos. WWW is one of the biggest sources of data.

2. Database Server: The database server contains the actual data ready to be processed. It
performs the task of handling data retrieval as per the request of the user.

3. Data Mining Engine: It is one of the core components of the data mining architecture that
performs all kinds of data mining techniques like association, classification, characterization,
clustering, prediction, etc.

4. Pattern Evaluation Modules: They are responsible for finding interesting patterns in the data
and sometimes they also interact with the database servers for producing the result of the
user requests.
5. Graphic User Interface: Since the user cannot fully understand the complexity of the data
mining process so graphical user interface helps the user to communicate effectively with the
data mining system.

6. Knowledge Base: Knowledge Base is an important part of the data mining engine that is quite
beneficial in guiding the search for the result patterns. Data mining engines may also
sometimes get inputs from the knowledge base. This knowledge base may contain data from
user experiences. The objective of the knowledge base is to make the result more accurate
and reliable.

Types of Data Mining architecture:

1. No Coupling: The no coupling data mining architecture retrieves data from particular data
sources. It does not use the database for retrieving the data which is otherwise quite an
efficient and accurate way to do the same. The no coupling architecture for data mining is poor
and only used for performing very simple data mining processes.

2. Loose Coupling: In loose coupling architecture data mining system retrieves data from the
database and stores the data in those systems. This mining is for memory-based data mining
architecture.

3. Semi-Tight Coupling: It tends to use various advantageous features of the data warehouse
systems. It includes sorting, indexing, and aggregation. In this architecture, an intermediate
result can be stored in the database for better performance.

4. Tight coupling: In this architecture, a data warehouse is considered one of its most important
components whose features are employed for performing data mining tasks. This architecture
provides scalability, performance, and integrated information

Advantages of Data Mining:

• Assists in preventing future adversaries by accurately predicting future trends.

• Contributes to the making of important decisions.

• Compresses data into valuable information.

• Provides new trends and unexpected patterns.

• Helps to analyze huge data sets.

• Aids companies to find, attract and retain customers.

• Helps the company to improve its relationship with the customers.

• Assists Companies to optimize their production according to the likability of a certain product
thus saving costs to the company.

Disadvantages of Data Mining:

• Excessive work intensity requires high-performance teams and staff training.

• The requirement of large investments can also be considered a problem as sometimes data
collection consumes many resources that suppose a high cost.
• Lack of security could also put the data at huge risk, as the data may contain private customer
details.

• Inaccurate data may lead to the wrong output.

• Huge databases are quite difficult to manage.

There are several different types of data mining, including:

1. Association Rule Learning: This type of data mining involves identifying patterns of
association between items in large datasets, such as market basket analysis, where the items
that are frequently bought together are identified.
Three types of association rules are:
I. Multilevel Association Rule
II. Quantitative Association Rule
III. Multidimensional Association Rule

2. Clustering: This type of data mining involves grouping similar data points together into
clusters based on certain characteristics or features. Clustering is used to identify patterns in
data and to discover hidden structures or groups in data.
Different types of clustering methods are:
I. Density-Based Methods
II. Model-Based Methods
III. Partitioning Methods
IV. Hierarchical Agglomerative methods
V. Grid-Based Methods

3. Classification: This type of data mining involves using a set of labeled data to train a model
that can then be used to classify new, unlabeled data into predefined categories or classes.

4. Anomaly detection: This type of data mining is used to identify data points that deviate
significantly from the norm, such as detecting fraud or identifying outliers in a dataset.

5. Regression: This type of data mining is used to model and predict numerical values, such as
stock prices or weather patterns.

6. Sequential pattern mining: This type of data mining is used to identify patterns in data that
occur in a specific order, such as identifying patterns in customer buying behavior.

7. Time series analysis: This type of data mining is used to analyze data that is collected over
time, such as stock prices or weather patterns, to identify trends or patterns that change
over time.

8. Text mining: This type of data mining is used to extract meaningful information from
unstructured text data, such as customer feedback or social media posts.

9. Graph mining: This type of data mining is used to extract insights from graph-structured
data, such as social networks or the internet.
Data Preprocessing

Steps of Data Preprocessing

Data preprocessing is an important step in the data mining process that involves cleaning and
transforming raw data to make it suitable for analysis. Some common steps in data preprocessing
include:

1. Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data,
such as missing values, outliers, and duplicates. Various techniques can be used for data
cleaning, such as imputation, removal, and transformation.

2. Data Integration: This involves combining data from multiple sources to create a unified
dataset. Data integration can be challenging as it requires handling data with different
formats, structures, and semantics. Techniques such as record linkage and data fusion can be
used for data integration.

3. Data Transformation: This involves converting the data into a suitable format for analysis.
Common techniques used in data transformation include normalization, standardization, and
discretization. Normalization is used to scale the data to a common range, while
standardization is used to transform the data to have zero mean and unit variance.
Discretization is used to convert continuous data into discrete categories.

4. Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.

5. Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.

6. Data Normalization: This involves scaling the data to a common range, such as between 0
and 1 or -1 and 1. Normalization is often used to handle data with different units and scales.
Common normalization techniques include min-max normalization, z-score normalization,
and decimal scaling.

Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis
results. The specific steps involved in data preprocessing may vary depending on the nature of the
data and the analysis goals.

By performing these steps, the data mining process becomes more efficient and the results become
more accurate.
Steps Involved in Data Preprocessing

1. Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

• Missing Data: This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:

o Ignore the tuples: This approach is suitable only when the dataset we have is quite
large and multiple values are missing within a tuple.

o Fill the Missing values: There are various ways to do this task. You can choose to fill
the missing values manually, by attribute mean or the most probable value.

• Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in following
ways :

o Binning Method: This method works on sorted data in order to smooth it. The whole
data is divided into segments of equal size and then various methods are performed
to complete the task. Each segmented is handled separately. One can replace all data
in a segment by its mean or boundary values can be used to complete the task.

o Regression:Here data can be made smooth by fitting it to a regression function.The


regression used may be linear (having one independent variable) or multiple (having
multiple independent variables).

o Clustering: This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.

2. Data Transformation: This step is taken in order to transform the data in appropriate forms
suitable for mining process. This involves following ways:
• Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)

• Attribute Selection: In this strategy, new attributes are constructed from the given set of
attributes to help the mining process.

• Discretization: This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels.

• Concept Hierarchy Generation: Here attributes are converted from lower level to higher
level in hierarchy. For Example-The attribute “city” can be converted to “country”.

3. Data Reduction: Data reduction is a crucial step in the data mining process that involves reducing
the size of the dataset while preserving the important information. This is done to improve the
efficiency of data analysis and to avoid overfitting of the model. Some common steps involved in data
reduction are:

• Feature Selection: This involves selecting a subset of relevant features from the dataset.
Feature selection is often performed to remove irrelevant or redundant features from the
dataset. It can be done using various techniques such as correlation analysis, mutual
information, and principal component analysis (PCA).

• Feature Extraction: This involves transforming the data into a lower-dimensional space while
preserving the important information. Feature extraction is often used when the original
features are high-dimensional and complex. It can be done using techniques such as PCA,
linear discriminant analysis (LDA), and non-negative matrix factorization (NMF).

• Sampling: This involves selecting a subset of data points from the dataset. Sampling is often
used to reduce the size of the dataset while preserving the important information. It can be
done using techniques such as random sampling, stratified sampling, and systematic
sampling.

• Clustering: This involves grouping similar data points together into clusters. Clustering is
often used to reduce the size of the dataset by replacing similar data points with a
representative centroid. It can be done using techniques such as k-means, hierarchical
clustering, and density-based clustering.

• Compression: This involves compressing the dataset while preserving the important
information. Compression is often used to reduce the size of the dataset for storage and
transmission purposes. It can be done using techniques such as wavelet compression, JPEG
compression, and gif compression.

Data Cleaning

What is Data Scrubbing?

Scrubbing is also known as data cleaning. The data cleaning process detects and removes errors and
anomalies and improves data quality. Data quality problems arise due to misspelling during data
entry, missing values, or any other invalid data.

In basic terms, Data Scrubbing is the process of guaranteeing accurate and correct collection of
information. This process is especially for companies that rely on electronic data during the
operation of their business. During the process, several tools are used to check the stability and
accuracy of documents.
By using data cleansing software, your system will be fed up with unnecessary material that reduces
the system.

Reasons for ‘Dirty’ Data Dummy Values:

• Absence of data

• Multipurpose fields

• Cryptic data

• Contradicting data

• Inappropriate use of address lines

• Violation of business rules

• Reused primary keys

• Non-unique identifiers

• Data integration problems

• Why data cleaning or cleansing is required?

• Source Systems data is not clean; it contains certain errors and inconsistencies.

• Specialized tools are available which can be used for cleaning the data.

• Some leading data cleansing vendors include Validity (integrity), Harte-Hanks (Trillium), and
First brick.

Data Scrubbing as a Process

1. The first step in data scrubbing as a process is discrepancy detection. The discrepancy can be
caused by a number of factors, including human errors in data entry, intentional errors, and data
delays. Discrepancies can also arise from consistent data representation and inconsistent use of
code.

After detecting the discrepancy, we will use the knowledge we already have about the properties of
the data to find the noise, extrinsic, and abnormal values that need to be investigated.

Data about unique rules, consistent rules, and null rules should also be examined.

• A unique rule states that each value of a given attribute must be different from all other
values for that attribute.

• A consecutive rule states that there can be no missing value between the lowest and highest
value for an attribute and all values must be unique.

• A null rule specifies the use of a blank, question mark, special character, or other string that
represents null conditions and how such values should be handled.

• The null rule should specify how to record the null condition.

2. Once we find discrepancies, we typically need to define and apply the transformation to correct
them. The two-stage process of anomaly detection and data transformation. Some changes may
introduce more discrepancies.
The new method of data scrubbing emphasizes increasing inhumanity. In this tool, the change can be
specified as an underline. The results are immediately shown on the record appearing on the screen.
The user can choose to undo the change so that the change that introduces additional errors can be
erased.

Steps in Data Cleansing/Scrubbing

1. Parsing: Parsing is a process in which individual data elements are located and identified in source
systems and then these elements are separated into target files. For example, parsing of name into
the First name, Middle name, and Last name or parsing the address into a street name, city, state,
and country.

2. Correcting: This is the next step after parsing, in which individual data elements are fixed using
data algorithms and secondary data sources. For example, in the address attribute replacing a vanity
address and adding a zip code.

3. Standardizing: In standardization, process conversion routines are used to transform the data
consistent format using both standard and custom business rules. For example, the addition of a
prename, replacing a nickname, and using a preferred name.

4. Matching: The matching process involves eliminating duplication by searching for records with
parsed, corrected, and standardized data using certain standard business rules. For example,
identifying similar names and addresses.

5. Consolidating: Consolidation involves merging the records into one representation by analyzing
and identifying the relationship between the recorded records.

6. Data Scrubbing must deal with many types of eventual errors:

• There may be many errors in the data such as missing data, or incorrect data on one source.

• When more than one source is involved there is a possibility of inconsistency and conflicting
data. So Data Scrubbing must deal with all these types of errors.

7. Data Staging:

• Data staging is an interim step between data extraction and the remaining steps.

• Data is stored from asynchronous sources, using various processes such as native interfaces,
flat files, FTP sessions.

• After a certain predefined interval, data is loaded into the warehouse after the
transformation process.

• No end-user access is available to the staging file.

• For data staging, the operational data store may be used.

What is a Missing Value?

Missing values are data points that are absent for a specific variable in a dataset. They can be
represented in various ways, such as blank cells, null values, or special symbols like “NA” or
“unknown.” These missing data points pose a significant challenge in data analysis and can lead to
inaccurate or biased results.
Missing values can pose a significant challenge in data analysis, as they can:

• Reduce the sample size: This can decrease the accuracy and reliability of your analysis.

• Introduce bias: If the missing data is not handled properly, it can bias the results of your
analysis.

• Make it difficult to perform certain analyses: Some statistical techniques require complete
data for all variables, making them inapplicable when missing values are present

Why Is Data Missing From the Dataset?

Data can be missing for many reasons like technical issues, human errors, privacy concerns, data
processing issues, or the nature of the variable itself. Understanding the cause of missing data helps
choose appropriate handling strategies and ensure the quality of your analysis.

It’s important to understand the reasons behind missing data:

• Identifying the type of missing data: Is it Missing Completely at Random (MCAR), Missing at
Random (MAR), or Missing Not at Random (MNAR)?

• Evaluating the impact of missing data: Is the missingness causing bias or affecting the
analysis?

• Choosing appropriate handling strategies: Different techniques are suitable for different
types of missing data.

Types of Missing Values

There are three main types of missing values:

1. Missing Completely at Random (MCAR): MCAR is a specific type of missing data in which the
probability of a data point being missing is entirely random and independent of any other
variable in the dataset. In simpler terms, whether a value is missing or not has nothing to do with
the values of other variables or the characteristics of the data point itself.

2. Missing at Random (MAR): MAR is a type of missing data where the probability of a data point
missing depends on the values of other variables in the dataset, but not on the missing variable
itself. This means that the missingness mechanism is not entirely random, but it can be predicted
based on the available information.
3. Missing Not at Random (MNAR): MNAR is the most challenging type of missing data to deal with.
It occurs when the probability of a data point being missing is related to the missing value itself.
This means that the reason for the missing data is informative and directly associated with the
variable that is missing.

Noisy data
Noisy data in data mining is data that contains extra, meaningless information that can make
it difficult to find patterns:

• Definition
Noisy data is a data set that contains extra, meaningless data, errors, outliers,
inconsistencies, or duplicates. It can also be referred to as corrupt data.

• Causes
Noisy data can be caused by a variety of factors, including measurement errors, human
errors, data entry errors, or data transmission errors.

• Effects
Noisy data can negatively impact the quality and accuracy of data mining results. It can also
make it harder to find patterns in the data and can lead to misleading conclusions.

• Handling
Data cleaning is a key step in data mining that involves identifying and correcting noisy
data. Some techniques for handling noisy data include:

• Outlier detection and smoothing: Can be used to identify and mitigate


irregularities

• Imputation: Can be used to fill in missing values with estimated ones

• Ensemble methods: Can be used to enhance the overall resilience of the


analysis to noisy data

• Fuzzy matching techniques: Can be used to identify and remove duplicates


that have slight variations or type
Binning
Binning, also known as discretization, involves grouping continuous numerical data into
discrete intervals or categories. It simplifies the data and reduces its complexity, making it
easier to analyze. Binning can be done using various techniques:
1. Equal Width Binning:
• Divides the data range into equal-width intervals.
• Suitable for data with a uniform distribution but may not capture variations
effectively.
2. Equal Frequency Binning:
• Divides the data into intervals with approximately equal numbers of
observations.

• Ensures each bin contains a similar number of data points but may result in
uneven bin widths.
3. Custom Binning:
• Divides the data based on domain knowledge or specific requirements.
• Allows for more flexibility in defining bin boundaries based on data
characteristics.
Clustering
Clustering is a data analysis technique that groups similar data points together based on
their characteristics or features. It helps identify patterns or natural groupings within the
data. Common clustering algorithms include:
1. K-means Clustering:
• Divides the data into k clusters by iteratively assigning data points to the
nearest cluster centroid and updating centroids based on the mean of data
points in each cluster.
2. Hierarchical Clustering:
• Builds a hierarchy of clusters by recursively merging or splitting clusters
based on proximity measures such as Euclidean distance or linkage criteria.

3. Density-based Clustering (e.g., DBSCAN):


• Identifies clusters based on regions of high density in the data space,
ignoring regions with low density.
Regression
Regression analysis is a statistical technique used to model the relationship between one
or more independent variables (predictors) and a dependent variable (response). It aims
to predict the value of the dependent variable based on the values of the independent
variables. Types of regression include:
1. Linear Regression:

• Models the relationship between the dependent variable and one or more
independent variables using a linear equation.
• Suitable for predicting continuous numeric outcomes.
2. Logistic Regression:

• Models the relationship between a binary dependent variable and one or


more independent variables using a logistic function.
• Suitable for binary classification problems.
3. Polynomial Regression:

• Extends linear regression by fitting a polynomial equation to the data,


allowing for more flexible modeling of nonlinear relationships.

4. Computer and Human Inspection


Computer and human inspection involve reviewing and validating the data after
preprocessing to ensure its quality and reliability. This may include:
1. Automated Data Quality Checks:
• Using software tools or scripts to perform automated checks for errors,
inconsistencies, or anomalies in the data.

2. Visual Inspection:
• Visualizing the data using charts, graphs, or dashboards to identify patterns,
trends, or outliers that may require further investigation.
3. Manual Review:
• Reviewing the data manually to verify its accuracy, completeness, and
consistency, especially for critical or sensitive datasets.
4. Domain Expert Review:
• Involving domain experts or stakeholders to validate the data preprocessing
steps and ensure that the data is fit for the intended purpose.

Inconsistent Data, Data Integration and Transformation in Data Mining


Data mining is a powerful tool for uncovering valuable insights, but its
effectiveness relies heavily on data quality. Inconsistent data is a common
challenge in data mining. Inconsistent data refers to data that is inaccurate,
incomplete, or inconsistent.
The Challenges of Inconsistent Data
Distorted Insights: Inconsistent data can lead to inaccurate and misleading
results, jeopardizing decision-making.

Model Bias: Biased models can be created when data is not representative
of the real world.
Reduced Efficiency: Inconsistent data can necessitate additional time and
effort to identify and resolve discrepancies.

Understanding the Sources of Data Inconsistencies


Human Errors: Misspellings, incorrect data entries, and flawed data
collection methods contribute to inconsistencies.
Data Integration: Merging data from various sources can lead to
inconsistencies due to differing formats and definitions.
System Limitations:Data storage systems and software can have limitations
that can contribute to data inconsistencies.
Data Standardization and Normalization Techniques
Data Standardization: Transforming data to a common format, ensuring
uniformity across different datasets.

Data Normalization: Scaling data values to a specific range, often between 0 and 1,
to reduce the impact of outliers.

Data Cleaning: Removing or correcting inconsistent data points through techniques


like imputation and outlier detection.

Data Transformation Methodologies

Data Aggregation: Combining data into larger units, such as averaging or summing
values.

Data Discretization: Dividing continuous data into discrete intervals, simplifying


analysis and visualization.

Data Encoding: Converting categorical data into numerical values for machine
learning algorithms.

Handling Missing and Erroneous Data

Imputation: Replacing missing values with estimated values based on available


data.

Error Detection: Identifying and correcting erroneous data points using data
validation techniques.

Data Exclusion: Removing data points that are too inconsistent or unreliable from
the analysis.
Integrating Data from Multiple Sources

Data Matching: Identifying and linking corresponding records from different


sources based on common keys.

Data Reconciliation: Resolving discrepancies between data values from different


sources using rules or heuristics

Data Transformation: Converting data into a consistent format that can be easily
integrated into the target system.

Ensuring Data Quality and Integrity

Data Validation: Verifying the accuracy and consistency of data through predefined
rules

Data Governance: Establishing policies and procedures for managing and


controlling data quality.

Data Monitoring: Continuously tracking data quality metrics and identifying


potential issues

You might also like