DWBI4

4.
AIM:
Classification Algorithms are Supervised Machine Learning Algorithms that use
labeled data (aka training datasets) to train classifier models. These models then
predict outcomes with the best possible accuracy when new data (aka testing
datasets) is fed to them.
The outcome predicted by a classification algorithm is categorical in nature. These
algorithms classify variables into a specific set of classes – such as classifying a text
message into transactions or promotions through an SMS filter on your iPhones.
Classification techniques predict discrete class label output(s) to which the data
elements belong. For example, weather prediction is a type of classification problem
– ‘hot’ and ‘cold’ being the class labels. This is called binary classification since there
are only two classes.
Few more examples of classification problems –

 Speech recognition
 Face detection
 Spam texts/e-mails classification
 Stock market prediction
 Breast cancer detection
 Employee Attrition prediction
How do Classification Algorithms work?

A classifier utilizes known (training) data to understand how the given input
(dependent) variables relate to the target (independent) variable.
In the above example, we will take into account the outside temperatures of previous
days and use that as the training data. This data would be fed into the classifier – if it
is trained accurately, it would be able to predict future weather conditions.
We use Binary Classifiers in case there are only two classes and Multi-class
Classifiers for more than two class divisions.
Types of Classification Algorithms
When to use which algorithm would depend on the application and the nature of the
data. The most common classification algorithms include:
 Logistic Regression
 K Nearest Neighbors (KNN)
 Support Vector Machine (SVM)
 Decision Tree
 Random Forest
 Naïve Bayes
Logistic Regression
Note that, though the name is Logistic “Regression” it is actually a linear
classification algorithm. It is used when the classes are linearly separable and binary
– like true (1) or false (0), win (1) or lose (0), etc.
Logistics regression uses a sigmoid function to return the probability of a label. The
curve obtained is called a sigmoid curve or an S-curve. The function generates a
probability output. By comparing the probability with a pre-defined threshold, the
object is assigned to a label accordingly.
K-Nearest Neighbors (KNN)

If your dataset has n-features, KNN represents each data point in an n-dimensional
space. It then calculates the distance between the data points. The unobserved data
is then assigned the label of the nearest observed data points. KNN is commonly
used for recommender systems, credit scoring, etc.
Support Vector Machine (SVM)

Support vector classifier lets you define a set of hyper-planes, called decision
boundary, that separates the data points into specific classes. The data points
closest to the decision boundary are called support vectors. An optimum decision
boundary will have a maximum distance from each of the support
vectors. Margins are the shortest perpendicular distance between the support
vectors and the decision boundary.
Decision Tree
As the name suggests, this algorithm builds “branches” in a hierarchical manner
where each branch can be considered as an if-else statement. The branches divide
the dataset into subsets based on the most important features. The “leaves” of the
decision tree are where the final classifications happen.
Random Forest
Like a forest has trees, a random forest is a collection of decision trees. This
classifier aggregates the results from multiple predictors. It additionally utilizes the
bagging technique that allows each tree to be trained on a random sampling of the
original dataset and takes the majority vote from trees. A random forest classifier has
better generalization but is less interpretable than a decision tree classifier, naturally
because more layers are added to the model.
Naïve Bayes
This classifier data into different classes according to the Bayes’ Theorem. But
assumes that the relationship between all input features in a class is independent.
Hence, the model is called naïve. This algorithm works relatively well even when the
size of the training dataset is small. Naïve Bayes is commonly used for text
classification, sentiment analysis, etc.
How to Run a Classification Task with Naive Bayes
Conclusion:
5 AIM
Clustering Algorithms
There are many types of clustering algorithms.
Many algorithms use similarity or distance measures between

examples in the feature space in an effort to discover dense regions of
observations. As such, it is often good practice to scale data prior to
using clustering algorithms.
Clustering or cluster analysis is an unsupervised learning problem.

It is often used as a data analysis technique for discovering interesting
patterns in data, such as groups of customers based on their
behaviour.
There are many clustering algorithms to choose from and no single

best clustering algorithm for all cases. Instead, it is a good idea to
explore a range of clustering algorithms and different configurations
for each algorithm.
In this tutorial, you will discover how to fit and use top clustering
algorithms in python.
After completing this tutorial, you will know:
 Clustering is an unsupervised problem of finding natural groups in the feature space of

input data.
 There are many different clustering algorithms and no single best method for all
datasets.
 How to implement, fit, and use top clustering algorithms in Python with the scikit-learn
machine learning library.
This tutorial is divided into three parts; they are:
1. Clustering
2. Clustering Algorithms
3. Examples of Clustering Algorithms
1. Library Installation
2. Clustering Dataset
3. Affinity Propagation
4. Agglomerative Clustering
5. BIRCH
6. DBSCAN
7. K-Means
8. Mini-Batch K-Means
9. Mean Shift
10. OPTICS
11. Spectral Clustering
12. Gaussian Mixture Model
Clustering
Cluster analysis, or clustering, is an unsupervised machine learning
task.
It involves automatically discovering natural grouping in data. Unlike

supervised learning (like predictive modeling), clustering algorithms
only interpret the input data and find natural groups or clusters in
feature space.
Clustering techniques apply when there is no class to be predicted but

rather when the instances are to be divided into natural groups.
A cluster is often an area of density in the feature space where

examples from the domain (observations or rows of data) are closer to
the cluster than other clusters. The cluster may have a center (the
centroid) that is a sample or a point feature space and may have a
boundary or extent.
These clusters presumably reflect some mechanism at work in the

domain from which instances are drawn, a mechanism that causes
some instances to bear a stronger resemblance to each other than
they do to the remaining instances.
Clustering can be helpful as a data analysis activity in order to learn

more about the problem domain, so-called pattern discovery or
knowledge discovery.
For example:
 The phylogenetic tree could be considered the result of a manual clustering analysis.
 Separating normal data from outliers or anomalies may be considered a clustering
problem.
 Separating clusters based on their natural behavior is a clustering problem, referred to
as market segmentation.
Clustering can also be useful as a type of feature engineering, where
existing and new examples can be mapped and labeled as belonging to
one of the identified clusters in the data.
Evaluation of identified clusters is subjective and may require a domain

expert, although many clustering-specific quantitative measures do
exist. Typically, clustering algorithms are compared academically on
synthetic datasets with pre-defined clusters, which an algorithm is
expected to discover.
Clustering is an unsupervised learning technique, so it is hard to

evaluate the quality of the output of any given method.
Clustering Algorithms
There are many types of clustering algorithms.
Many algorithms use similarity or distance measures between

examples in the feature space in an effort to discover dense regions of
observations. As such, it is often good practice to scale data prior to
using clustering algorithms.
Central to all of the goals of cluster analysis is the notion of the degree
of similarity (or dissimilarity) between the individual objects being
clustered. A clustering method attempts to group the objects based on
the definition of similarity supplied to it. Some clustering algorithms
require you to specify or guess at the number of clusters to discover in
the data, whereas others require the specification of some minimum
distance between observations in which examples may be considered
“close” or “connected.”
As such, cluster analysis is an iterative process where subjective

evaluation of the identified clusters is fed back into changes to
algorithm configuration until a desired or appropriate result is
achieved.
The scikit-learn library provides a suite of different clustering

algorithms to choose from.
A list of 10 of the more popular algorithms is as follows:
 Agglomerative Clustering
 BIRCH
 DBSCAN
 K-Means
 Mini-Batch K-Means
Conclusion:

DWBI4

Uploaded by

Copyright:

Available Formats

DWBI4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWBI4

Uploaded by

Copyright:

Available Formats

4.

Few more examples of classification problems –

How do Classification Algorithms work?

K-Nearest Neighbors (KNN)

Support Vector Machine (SVM)

Many algorithms use similarity or distance measures between

Clustering or cluster analysis is an unsupervised learning problem.

There are many clustering algorithms to choose from and no single

After completing this tutorial, you will know:

 Clustering is an unsupervised problem of finding natural groups in the feature space of

It involves automatically discovering natural grouping in data. Unlike

Clustering techniques apply when there is no class to be predicted but

A cluster is often an area of density in the feature space where

These clusters presumably reflect some mechanism at work in the

Clustering can be helpful as a data analysis activity in order to learn

Evaluation of identified clusters is subjective and may require a domain

Clustering is an unsupervised learning technique, so it is hard to

Many algorithms use similarity or distance measures between

As such, cluster analysis is an iterative process where subjective

The scikit-learn library provides a suite of different clustering

A list of 10 of the more popular algorithms is as follows:

You might also like