Data Mining Cheat Sheet

Data Mining Cheat Sheet
Data Mining Steps Types of Attributes
1. Data Cleaning Removal of noise and inconsistent records Nomial e.g., ID numbers, eye color, zip codes
2. Data Integration Combing multiple sources Ordinal e.g., rankings, grades, height
3. Data Selection Only data relevant for the task are retrieved from Interval e.g., calendar dates, temperatures
the database
Ratio e.g., length, time, counts
4. Data Converting data into a form more appropriate for
Transformation mining Distance Measures
5. Data Mining Application of intelligent methods to extract data
patterns
6. Model Evaluation Identification of truly interesting patterns

representing knowledge
7. Knowledge Visualization or other knowledge presentation

Presentation techniques
Data mining could also be called Knowledge Discovery in Databases (see

kdnuggets.com)
Manhattan = City Block
Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity

measures
Measures of Node Impurity Model Evaluation
Kappa = (observed agreement - chance agreement) / (1- chance

agreement)
Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates

the sum of values in diagonal of the confusion matrix
K-Nearest Neighbor
* Compute distance between two points

* Determine the class from nearest neighbor list
* Take the majority vote of class labels
among the k-nearest neighbors
* Weigh the vote according to distance
K-Nearest Neighbor (cont) Bayesian Classification
* weight factor, w = 1 / d^2
Rule-based Classification
Classify records by using a collection of

“if…then…” rules
Rule: (Condition) --> y
where:
* Condition is a conjunction of attributes
* y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) ^ (Lay Eggs=Yes) --> Birds
(Taxable Income < 50K) ^ (Refund=Yes) --> Evade=No
Sequential covering is a rule-based classifier.
Rule Evaluation
p(a,b) is the probability that both a and b happen.
p(a|b) is the probability that a happens, knowing that b has already

happened.
Terms
Association Min-Apriori, LIFT, Simpson's Paradox, Anti-

Analysis monotone property
Ensemble Staking, Random Forest

Methods
Terms (cont) Rules Analysis
Decision Trees C4.5, Pessimistic estimate, Occam's Razor, Hunt's

Algorithm
Model Cross-validation, Bootstrap, Leave-one out (C-V),

Evaluation Misclassification error rate, Repeated holdout,
Stratification
Bayes Probabilistic classifier
Data Chernoff faces, Data cube, Percentile plots, Parallel

Visualization coordinates
Nonlinear Principal components, ISOMAP, Multidimensional

Dimensionality scaling
Reduction
Ensemble Techniques Apriori Algorithm
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets
of length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
Manipulate training data: bagging and boosting ensemble of “experts”,
leaving only those that are frequent
each specializing on different portions of the instance space
Manipulate output values: error-correcting output coding (ensemble of

“experts”, each predicting 1 bit of the {multibit} full class label)
Methods: BAGGing, Boosting, AdaBoost

K-means Clustering Dendrogram Example
Select K points as the initial centroids

repeat
Form K Clusters by assigning all points to the
closest centroid
Recompute the centroid of each cluster
until the centroids don't change
Closeness is measured by distance (e.g., Euclidean), similarity (e.g.,

Cosine), correlation.
Centroid is typically the mean of the points in the cluster
Hierarchical Clustering Dataset: {7, 10, 20, 28, 35}
Single-Link or MIN
Density-Based Clustering
Similarity of two clusters is based on the two most similar (closest /
minimum) points in the different clusters current_cluster_label <-- 1
Determined by one pair of points, i.e., by one link in the proximity graph. for all core points do
Complete or MAX
if the core point has no cluster label then
Similarity of two clusters is based on the two least similar (most distant,
current_cluster_label <--
maximum) points in the different clusters
current_cluster_label +1
Determined by all pairs of points in the two clusters
Group Average Label the current core point with the cluster
Proximity of two clusters is the average of pairwise proximity between label

points in the two clusters end if
Agglomerative clustering starts with points as individual clusters and for all points in the Eps-neighborhood, except i-
merges closest clusters until only one cluster left. th the point itself do
if the point does not have a cluster label
Divisive clustering starts with one, all-inclusive cluster and splits a then
cluster until each cluster only has one point. Label the point with cluster label
end if
end for
Density-Based Clustering (cont) Regression Analysis (cont)
end for | Elastic Net
DBSCAN is a popular algorithm

Anomaly Detection
Density = number of points within a specified radius (Eps) Anomaly is a pattern in the data that does not conform to the expected
behavior (e.g., outliers, exceptions, peculiarities, surprise)
A point is a core point if it has more than a specified number of points Types of Anomaly
(MinPts) within Eps oint: An individual data instance is anomalous w.r.t. the data
P
ontextual: An individual data instance is anomalous within a context
C
These are points that are at the interior of a cluster ollective: A collection of related data instances is anomalous
C
Approaches
A border point has fewer than MinPts within Eps, but is in the * Graphical (e.g., boxplots, scatter plots)
neighborhood of a core point * Statistical (e.g., normal distribution, likelihood)
| Parametric Techniques
A noise point is any point that is not a core point or a border point | Non-parametric Techniques
* Distance (e.g., nearest-neighbor, density, clustering)
Other Clustering Methods
Local outlier factor (LOF) is a density-based distance approach
Fuzzy is a partitional clustering method. Fuzzy clustering (also referred
to as soft clustering) is a form of clustering in that each data point can Mahalanobis Distance is a clustering-based distance approach
belong to more than one cluster.
Graph-based methods: Jarvis-Patrick, Shared-Near Neighbor (SNN,
Density), Chameleon
Model-based methods: Expectation-Maximization
Regression Analysis
* Linear Regression
| Least squares
* Subset selection
* Stepwise selection
* Regularized regression
| Ridge
| Lasso

Data Mining Cheat Sheet

Uploaded by

Copyright:

Available Formats

Data Mining Cheat Sheet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Cheat Sheet

Uploaded by

Copyright:

Available Formats

Data Mining Cheat Sheet

Data Mining Steps Types of Attributes

6. Model Evaluation Identi​fic​ation of truly intere​sting patterns

7. Knowledge Visual​ization or other knowledge presen​tation

Data mining could also be called Knowledge Discovery in Databases (see

Manhattan = City Block

Jaccard coeffi​cient, Hamming, Cosine are a similarity / dissim​ilarity

Measures of Node Impurity Model Evaluation

Kappa = (observed agreement - chance agreement) / (1- chance

Kappa = (Dreal – Drandom) / (Dperfect – Drandom), where D indicates

* Compute distance between two points

K-Nearest Neighbor (cont) Bayesian Classi​fic​ation

​ ​ ​ ​ ​ ​ ​ * weight factor, w = 1 / d^2

Classify records by using a collection of

p(a,b) is the probab​ility that both a and b happen.

p(a|b) is the probab​ility that a happens, knowing that b has already

Associ​ation Min-Ap​riori, LIFT, Simpson's Paradox, Anti-

Ensemble Staking, Random Forest

Terms (cont) Rules Analysis

Decision Trees C4.5, Pessim​istic estimate, Occam's Razor, Hunt's

Model Cross-​val​ida​tion, Bootstrap, Leave-one out (C-V),

Bayes Probab​ilistic classifier

Data Chernoff faces, Data cube, Percentile plots, Parallel

Nonlinear Principal compon​ents, ISOMAP, Multid​ime​nsional

Ensemble Techniques Apriori Algorithm

Mani​pulate output values: error-​cor​recting output coding (ensemble of

Meth​ods: BAGGing, Boosting, AdaBoost

K-means Clustering Dendrogram Example

Select K points as the initial centroids

Clos​eness is measured by distance (e.g., Euclid​ean), similarity (e.g.,

Cent​roid is typically the mean of the points in the cluster

Hierar​chical Clustering Data​set: {7, 10, 20, 28, 35}

Proximity of two clusters is the average of pairwise proximity between label

Densit​y-Based Clustering (cont) Regression Analysis (cont)

end for ​ ​| El​astic Net

DBSCAN is a popular algorithm

You might also like

6. Model Evaluation Identification of truly interesting patterns

7. Knowledge Visualization or other knowledge presentation

Jaccard coefficient, Hamming, Cosine are a similarity / dissimilarity

K-Nearest Neighbor (cont) Bayesian Classification

* weight factor, w = 1 / d^2

p(a,b) is the probability that both a and b happen.

p(a|b) is the probability that a happens, knowing that b has already

Association Min-Apriori, LIFT, Simpson's Paradox, Anti-

Decision Trees C4.5, Pessimistic estimate, Occam's Razor, Hunt's

Model Cross-validation, Bootstrap, Leave-one out (C-V),

Bayes Probabilistic classifier

Nonlinear Principal components, ISOMAP, Multidimensional

Manipulate output values: error-correcting output coding (ensemble of

Methods: BAGGing, Boosting, AdaBoost

Closeness is measured by distance (e.g., Euclidean), similarity (e.g.,

Centroid is typically the mean of the points in the cluster

Hierarchical Clustering Dataset: {7, 10, 20, 28, 35}

Density-Based Clustering (cont) Regression Analysis (cont)

end for | Elastic Net