Data Mining Cheat Sheet
Data Mining Cheat Sheet
Data Mining Cheat Sheet
1. Data Cleaning Removal of noise and inconsistent records Nomial e.g., ID numbers, eye color, zip codes
2. Data Integration Combing multiple sources Ordinal e.g., rankings, grades, height
3. Data Selection Only data relevant for the task are retrieved from Interval e.g., calendar dates, temperatures
the database
Ratio e.g., length, time, counts
4. Data Converting data into a form more appropriate for
Transformation mining Distance Measures
5. Data Mining Application of intelligent methods to extract data
patterns
K-Nearest Neighbor
Rule-based Classification
Rule Evaluation
Terms
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from
length k frequent itemsets
Prune candidate itemsets containing subsets
of length k that are infrequent
Count the support of each candidate by
scanning the DB
Eliminate candidates that are infrequent,
Manipulate training data: bagging and boosting ensemble of “experts”,
leaving only those that are frequent
each specializing on different portions of the instance space
Single-Link or MIN
Density-Based Clustering
Similarity of two clusters is based on the two most similar (closest /
minimum) points in the different clusters current_cluster_label <-- 1
Determined by one pair of points, i.e., by one link in the proximity graph. for all core points do
Complete or MAX
if the core point has no cluster label then
Similarity of two clusters is based on the two least similar (most distant,
current_cluster_label <--
maximum) points in the different clusters
current_cluster_label +1
Determined by all pairs of points in the two clusters
Group Average Label the current core point with the cluster
Agglomerative clustering starts with points as individual clusters and for all points in the Eps-neighborhood, except i-
merges closest clusters until only one cluster left. th the point itself do
if the point does not have a cluster label
Divisive clustering starts with one, all-inclusive cluster and splits a then
cluster until each cluster only has one point. Label the point with cluster label
end if
end for
Data Mining Cheat Sheet
Density = number of points within a specified radius (Eps) Anomaly is a pattern in the data that does not conform to the expected
behavior (e.g., outliers, exceptions, peculiarities, surprise)
A point is a core point if it has more than a specified number of points Types of Anomaly
(MinPts) within Eps oint: An individual data instance is anomalous w.r.t. the data
P
ontextual: An individual data instance is anomalous within a context
C
These are points that are at the interior of a cluster ollective: A collection of related data instances is anomalous
C
Approaches
A border point has fewer than MinPts within Eps, but is in the * Graphical (e.g., boxplots, scatter plots)
neighborhood of a core point * Statistical (e.g., normal distribution, likelihood)
| Parametric Techniques
A noise point is any point that is not a core point or a border point | Non-parametric Techniques
* Distance (e.g., nearest-neighbor, density, clustering)
Other Clustering Methods
Local outlier factor (LOF) is a density-based distance approach
Fuzzy is a partitional clustering method. Fuzzy clustering (also referred
to as soft clustering) is a form of clustering in that each data point can Mahalanobis Distance is a clustering-based distance approach
belong to more than one cluster.
Graph-based methods: Jarvis-Patrick, Shared-Near Neighbor (SNN,
Density), Chameleon
Model-based methods: Expectation-Maximization
Regression Analysis
* Linear Regression
| Least squares
* Subset selection
* Stepwise selection
* Regularized regression
| Ridge
| Lasso