Data Mining Display
Data Mining Display
Data Mining Display
Mining.
and obtain powerful skill set needs to at least know the basics of
data mining
Through learning the techniques of data mining, one can use this
The process of mining data can be divided into three main parts:
gathering,
collecting,
cleaning
The data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base
There are many techniques out there that one can use to perform
data miningI will focus on the top 5 data mining techniques used
MapReduce.
Clustering.
Link Analysis.
Recommendation Systems.
1. map step: Performs filtering and sorting. The results of this step
between the map and the reduce states. Its only job is to sort the
(key, value) collection so that the reduce stage gets all identical
keys.
one group are connected some another . Every group is then called a
cluster. Clustering is often used in data mining and data analysis. Can
cluster. Then the algorithm starts to join clusters that are close in
different clusters.
analysis can be used for both directed and undirected data mining.
Link analysis is often performed in 4 steps:
and validation.
a visualization approach.
This data model is used to connect two kinds of data points, items,
words. We can ignore these words to see the most frequent words
in the documents.
2. Plagiarism: the items will be the documents and the baskets will
Introduction
Many data scientists get their data in raw formats from various
anything.
make decisions.
analyzing.
Need of Data Warehousing
Data Warehousing is aessential tool for business intelligence. It
data from the past and this input can be used for various
purposes.
data warehouses.
or annually, etc.
deleted or also when new data is inserted into it. In the data
There are only two types of data operations that can be done in
Data Loading
Data Access
a relational database.
It also have reliable naming conventions, formats, and codes.
of data
structure, etc
Basic Statistics Concepts for Data Science
1. Descriptive Statistics
It is used to describe the basic features of data that provide a summary of the given
data set which can either represent the entire population or a sample of the
population.
Mode: It refers to the value that appears most often in a data set.
Median: It is the middle value of the ordered set that divides it in exactly half .
2. Variability
compared.
data set. In general terms, it means the difference from the mean. A large variance
indicates that numbers are far apart from average value. Small variance indicates
that the numbers are closer to the average values. Zero variance indicates that the
Range: This is defined as the difference between the largest and smallest value of
a dataset.
Percentile: It refers to the measure used in statistics that indicates the value
Quartile: It is defined as the value that divides the data points into quarters .
Interquartile Range: It measures the middle half of your data . In general terms, it
3. Correlation
It is one of the major statistical techniques that measure the relationship between two
variables. The correlation coefficient indicates the strength of the linear relationship
Correlation coefficient zero indicates that there is no relationship between the two
variables.
4. Probability Distribution
It specifies of all possible events. In simple terms, an event refers to the result of an
Dependent event: The event is said to be dependent when the occurrence of the
probability.
5. Regression
Linear regression: It is used to fit the regression model that explains the
variables.
relationship between the binary response variable and one or more predictor
variables.
6. Normal Distribution
Normal is used to define the probability density function for a continuous random
variable in a system. The standard normal distribution has two parameters – mean
and standard deviation . When the distribution of random variables is unknown, the
normal distribution is used. The central limit theorem justifies why normal distribution
analysis, the selection in such a way that data is not randomized resulting in the
Confirmation bias: It occurs when the person performing the statistical analysis