Data Mining Display

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 20

Data Mining

 The process of structuring, analyzing, and formulating massive

amounts of raw data in order to find patterns and through

mathematical and computational algorithms is called Data

Mining.

 Every data scientist who wants to advance further in their career

and obtain powerful skill set needs to at least know the basics of

data mining

 Through learning the techniques of data mining, one can use this

knowledge to generate new insights and find new trends

 The process of mining data can be divided into three main parts:

 gathering,

 collecting,

 cleaning

The data applying a data mining technique on the data, and

validating the results of the technique.


Data Mining Architecture

The data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base

 There are many techniques out there that one can use to perform

data miningI will focus on the top 5 data mining techniques used

right now by individuals and big companies


 The techniques we will cover are:

MapReduce.

Clustering.

Link Analysis.

Recommendation Systems.

Frequent Itemset Analysis.

 MapReduce is a programming model and implementation for

collecting and processing big amounts of data sets on parallel.

MapReduce takes on some chunk of data, divided it to be

processed on different hardware, and then gather the information

from all of that hardware

A MapReduce program is composed of three steps:

1. map step: Performs filtering and sorting. The results of this step

are a collection of (key, value) pairs that represent the mapping of

the data we are attempting to mine.


2. shuffle step: The shuffle state acts as an intermediate state

between the map and the reduce states. Its only job is to sort the

(key, value) collection so that the reduce stage gets all identical

keys.

3. reduce step: Performs a summary operation (such as counting

the different values for the same key).

 Clustering is the task of grouping a set of items so that the items in

one group are connected some another . Every group is then called a

cluster. Clustering is often used in data mining and data analysis. Can

find clustering in many applications such as pattern recognition,

computer vision, data composition, and bioinformatics.


Clustering can be done using one of two strategies:

1. Hierarchal Clustering: Here, each data point starts as its own

cluster. Then the algorithm starts to join clusters that are close in

distance to each other till in reaches a specified limit. This limit

can either be a set number of clusters or a set of rules on the

different clusters.

2. Point Assignment: Each data point is assigned to a pre-defined

cluster based on which it fits best. Some variations of these

algorithms allow for cluster-splitting or cluster-joining. There are

some popular point assignment algorithms out there such as k-

means and BFR .

 Link analysis is a data mining technique based on a mathematics

branch called graph theory. Graph Theory represents different objects

(nodes) and the relationships between them (edges) as a graph. Link

analysis can be used for both directed and undirected data mining.
Link analysis is often performed in 4 steps:

1. Data Processing: Collecting and manipulation of data using

different algorithms, such as sorting, aggregation, classification,

and validation.

2. Transforming: Converting data from one format or structure into

another format or structure in order to ease up the process of

analyzing that data.

3. Analysis: Once the data has been transformed, different analysis

strategies can be used to extract useful, desirable information.

4. Visualization: The best wayto communicate information is to use

a visualization approach.

 Recommendation Systems are a class of application that involves

using machine learning and mathematical models to predict the user’s

responses to different sets of options.


There are different approaches to implement a recommendation

system, the 4 most used approaches are:

1. Collaborative systems: This approach combines different users

and objects and it is the main approach used in Amazon.

2. Content-based systems: This approach focuses mainly on the

content of your previous experiences.

3. Risk-aware systems: This approach uses content and

collaborative techniques but adds another layer on top. This new

layer will calculate the risk of recommending a specific content

based on the location or the age of the user.

4. Hybrid systems: Hybrid systems are those that make use of

different recommendation techniques to increase the accuracy of

their recommendation and ensure a higher user satisfaction rate.


 Frequent Itemset Analysis is the analysis approach used with

market-basket model data. The market-basket is a data model that is

used to describe a common form of a many-to-many relationship.

This data model is used to connect two kinds of data points, items,

and baskets. Each basket has a set of items .

Frequent itemset analysis can be used to categorize and analyze

different kinds of applications,

1. Related concepts: If we want to look for some words that appear

in many documents, the sets will be dominated by the most

common words in documents, such as stop words or connecting

words. We can ignore these words to see the most frequent words

in the documents.
2. Plagiarism: the items will be the documents and the baskets will

be the sentences within the document. An item is a part of a

basket if the sentence is in the document. If we want to detect

plagiarism, then we try to look for pairs of items that appear

together in several baskets within two different documents. If we

find such a pair, then we have 2 documents that share several

sentences in common, which means that plagiarism exists.


Data Warehouse

 Introduction

 A Data Warehouse is Built by combining data from multiple

diverse sources that support analytical reporting, structured and

unstructured queries, and decision making for the organization,

and Data Warehousing is a step-by-step approach for

constructing and using a Data Warehouse.

 Many data scientists get their data in raw formats from various

sources of data and information

 many data scientists also as business decision-makers, particularly

in big enterprises, the main sources of data and information are

corporate data warehouses

 A data warehouse holds data from multiple sources, including

internal databases and Software platforms. After the data is loaded,

it often cleansed, transformed, and checked for quality


 it is used for analytics reporting, data science, machine learning, or

anything.

 What is Data Warehouse?

 A Data Warehouse is a collection of software tools that facilitates

analysis of a large set of business data used to help an organization

make decisions.

 A large amount of data in data warehouses comes from

numerous sources such that internal applications like marketing,

sales, and finance; customer-facing apps.

 A data warehouse is mainly a data management system that’s

designed to enable and support business intelligence (BI)

activities, particularly analytics. Data warehouses are alleged to

perform queries, cleaning, manipulating, transforming and

analyzing.
 Need of Data Warehousing
 Data Warehousing is aessential tool for business intelligence. It

allows organizations to make quality business decisions.

 The data warehouse benefits by improving data analytics.

 Basic Data Warehouse Architecture


 Data warehouses can find out more practical business strategies.

 Business User: Business users or customers need a data

warehouse to look at summarized data from the past.

 Maintains consistency: Data warehouses are programmed

in such a way that they can be applied in a regular format

to all collected data from different sources.

 standardizing the data and risk of error in interpretation is

also reduced and improves overall accuracy

 Storehistoricaldata: DataWarehouses are also used to

store historical data that means, the time variable

data from the past and this input can be used for various

purposes.

 Make strategic decisions: Data warehouses contribute to

making better strategic decisions. Some business


strategies may be depending upon the data stored within the

data warehouses.

 High response time: Data warehouse has got to be

prepared for masses and type of queries that demands a

major degree of flexibility and fast.

 Characteristics of Data warehouse:

 Subject Oriented: A data warehouse is often subject-oriented

because it delivers may be achieved on a

particular theme which means the data warehousing process is

proposed to handle a particular theme that is more defined.

These themes are often sales, distribution, selling. etc.

 Time-Variant: When the data is maintained via totally different

intervals of time like weekly, monthly,

or annually, etc.

 Non-volatile: The data residing in the data warehouse is

permanent and additionally


means that the data in the data warehouse is cannot be erased or

deleted or also when new data is inserted into it. In the data

warehouse, data is read-only and can only be refreshed at a

particular interval of time. Operations such as delete, update and

insert that is done in a software application over data is lost in

the data warehouse environment.

 There are only two types of data operations that can be done in

the data warehouse

 Data Loading

 Data Access

 Integrated: A data warehouse is created by integrating data

from different sources such that from mainframe computers and

a relational database.
 It also have reliable naming conventions, formats, and codes.

Integration of data warehouse benefits in the successful analysis

of data

 Dependability in naming conventions, column scaling, encoding

structure, etc
Basic Statistics Concepts for Data Science
1. Descriptive Statistics

It is used to describe the basic features of data that provide a summary of the given

data set which can either represent the entire population or a sample of the

population.

It is derived from calculations that include:

 Mean: It is the central value which is commonly known as arithmetic average.

 Mode: It refers to the value that appears most often in a data set.

 Median: It is the middle value of the ordered set that divides it in exactly half .

2. Variability

Variability includes the following parameters:


 Standard Deviation: It is a statistic that calculates the dispersion of a data set as

compared.

 Variance: It refers to a statistical measure of the spread between the numbers in a

data set. In general terms, it means the difference from the mean. A large variance

indicates that numbers are far apart from average value. Small variance indicates

that the numbers are closer to the average values. Zero variance indicates that the

values are identical to the given set.

 Range: This is defined as the difference between the largest and smallest value of

a dataset.
 Percentile: It refers to the measure used in statistics that indicates the value

below which the given percentage of observation in the dataset falls.

 Quartile: It is defined as the value that divides the data points into quarters .

 Interquartile Range: It measures the middle half of your data . In general terms, it

is the middle 50% of the dataset.

3. Correlation

It is one of the major statistical techniques that measure the relationship between two

variables. The correlation coefficient indicates the strength of the linear relationship

between two variables.

 A correlation coefficient that is more than zero indicates a positive relationship.

 A correlation coefficient that is less than zero indicates a negative relationship.

 Correlation coefficient zero indicates that there is no relationship between the two

variables.

4. Probability Distribution

It specifies of all possible events. In simple terms, an event refers to the result of an

experiment. Events are of two types dependent and independent .

 Independent event: The event is said to be an Independent event when it is not

affected by the earlier events .

 Dependent event: The event is said to be dependent when the occurrence of the

event is dependent on the earlier events


The probability of independent events is calculated by simply multiplying the

probability of each event and for a dependent event is calculated by conditional

probability.

5. Regression

It is a method that is used to determine the relationship between one or more

independent variables and a dependent variable. Regression is mainly of two types:

 Linear regression: It is used to fit the regression model that explains the

relationship between a numeric predictor variable and one or more predictor

variables.

 Logistic regression: It is used to fit a regression model that explains the

relationship between the binary response variable and one or more predictor

variables.

6. Normal Distribution

Normal is used to define the probability density function for a continuous random

variable in a system. The standard normal distribution has two parameters – mean

and standard deviation . When the distribution of random variables is unknown, the

normal distribution is used. The central limit theorem justifies why normal distribution

is used in such cases.


7. Bias

In statistical terms, it means when a model is representative of a complete population.

This needs to be minimized to get the desired outcome .

The three most common types of bias are:

 Selection bias: It is a phenomenon of selecting a group of data for statistical

analysis, the selection in such a way that data is not randomized resulting in the

data being unrepresentative of the whole population.

 Confirmation bias: It occurs when the person performing the statistical analysis

has some predefined assumption.

 Time interval bias: It is caused intentionally by specifying a certain time range to

favor a particular outcome.

You might also like