DWBI4

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

4.

AIM:
Classification Algorithms are Supervised Machine Learning Algorithms that use
labeled data (aka training datasets) to train classifier models. These models then
predict outcomes with the best possible accuracy when new data (aka testing
datasets) is fed to them.
The outcome predicted by a classification algorithm is categorical in nature. These
algorithms classify variables into a specific set of classes – such as classifying a text
message into transactions or promotions through an SMS filter on your iPhones.
Classification techniques predict discrete class label output(s) to which the data
elements belong. For example, weather prediction is a type of classification problem
– ‘hot’ and ‘cold’ being the class labels. This is called binary classification since there
are only two classes.

Few more examples of classification problems –


 Speech recognition
 Face detection
 Spam texts/e-mails classification
 Stock market prediction
 Breast cancer detection
 Employee Attrition prediction

How do Classification Algorithms work?


A classifier utilizes known (training) data to understand how the given input
(dependent) variables relate to the target (independent) variable.
In the above example, we will take into account the outside temperatures of previous
days and use that as the training data. This data would be fed into the classifier – if it
is trained accurately, it would be able to predict future weather conditions.
We use Binary Classifiers in case there are only two classes and Multi-class
Classifiers for more than two class divisions.
Types of Classification Algorithms
When to use which algorithm would depend on the application and the nature of the
data. The most common classification algorithms include:
 Logistic Regression
 K Nearest Neighbors (KNN)
 Support Vector Machine (SVM)
 Decision Tree
 Random Forest
 Naïve Bayes
Logistic Regression
Note that, though the name is Logistic “Regression” it is actually a linear
classification algorithm. It is used when the classes are linearly separable and binary
– like true (1) or false (0), win (1) or lose (0), etc.
Logistics regression uses a sigmoid function to return the probability of a label. The
curve obtained is called a sigmoid curve or an S-curve. The function generates a
probability output. By comparing the probability with a pre-defined threshold, the
object is assigned to a label accordingly.

K-Nearest Neighbors (KNN)


If your dataset has n-features, KNN represents each data point in an n-dimensional
space. It then calculates the distance between the data points. The unobserved data
is then assigned the label of the nearest observed data points. KNN is commonly
used for recommender systems, credit scoring, etc.

Support Vector Machine (SVM)


Support vector classifier lets you define a set of hyper-planes, called decision
boundary, that separates the data points into specific classes. The data points
closest to the decision boundary are called support vectors. An optimum decision
boundary will have a maximum distance from each of the support
vectors. Margins are the shortest perpendicular distance between the support
vectors and the decision boundary.
Decision Tree
As the name suggests, this algorithm builds “branches” in a hierarchical manner
where each branch can be considered as an if-else statement. The branches divide
the dataset into subsets based on the most important features. The “leaves” of the
decision tree are where the final classifications happen.

Random Forest
Like a forest has trees, a random forest is a collection of decision trees. This
classifier aggregates the results from multiple predictors. It additionally utilizes the
bagging technique that allows each tree to be trained on a random sampling of the
original dataset and takes the majority vote from trees. A random forest classifier has
better generalization but is less interpretable than a decision tree classifier, naturally
because more layers are added to the model.

Naïve Bayes
This classifier data into different classes according to the Bayes’ Theorem. But
assumes that the relationship between all input features in a class is independent.
Hence, the model is called naïve. This algorithm works relatively well even when the
size of the training dataset is small. Naïve Bayes is commonly used for text
classification, sentiment analysis, etc.
How to Run a Classification Task with Naive Bayes
Conclusion:

5 AIM

Clustering Algorithms
There are many types of clustering algorithms.

Many algorithms use similarity or distance measures between


examples in the feature space in an effort to discover dense regions of
observations. As such, it is often good practice to scale data prior to
using clustering algorithms.

Clustering or cluster analysis is an unsupervised learning problem.


It is often used as a data analysis technique for discovering interesting
patterns in data, such as groups of customers based on their
behaviour.

There are many clustering algorithms to choose from and no single


best clustering algorithm for all cases. Instead, it is a good idea to
explore a range of clustering algorithms and different configurations
for each algorithm.
In this tutorial, you will discover how to fit and use top clustering
algorithms in python.

After completing this tutorial, you will know:

 Clustering is an unsupervised problem of finding natural groups in the feature space of


input data.
 There are many different clustering algorithms and no single best method for all
datasets.
 How to implement, fit, and use top clustering algorithms in Python with the scikit-learn
machine learning library.
This tutorial is divided into three parts; they are:

1. Clustering
2. Clustering Algorithms
3. Examples of Clustering Algorithms
1. Library Installation
2. Clustering Dataset
3. Affinity Propagation
4. Agglomerative Clustering
5. BIRCH
6. DBSCAN
7. K-Means
8. Mini-Batch K-Means
9. Mean Shift
10. OPTICS
11. Spectral Clustering
12. Gaussian Mixture Model
Clustering
Cluster analysis, or clustering, is an unsupervised machine learning
task.

It involves automatically discovering natural grouping in data. Unlike


supervised learning (like predictive modeling), clustering algorithms
only interpret the input data and find natural groups or clusters in
feature space.

Clustering techniques apply when there is no class to be predicted but


rather when the instances are to be divided into natural groups.

A cluster is often an area of density in the feature space where


examples from the domain (observations or rows of data) are closer to
the cluster than other clusters. The cluster may have a center (the
centroid) that is a sample or a point feature space and may have a
boundary or extent.

These clusters presumably reflect some mechanism at work in the


domain from which instances are drawn, a mechanism that causes
some instances to bear a stronger resemblance to each other than
they do to the remaining instances.

Clustering can be helpful as a data analysis activity in order to learn


more about the problem domain, so-called pattern discovery or
knowledge discovery.

For example:

 The phylogenetic tree could be considered the result of a manual clustering analysis.
 Separating normal data from outliers or anomalies may be considered a clustering
problem.
 Separating clusters based on their natural behavior is a clustering problem, referred to
as market segmentation.
Clustering can also be useful as a type of feature engineering, where
existing and new examples can be mapped and labeled as belonging to
one of the identified clusters in the data.

Evaluation of identified clusters is subjective and may require a domain


expert, although many clustering-specific quantitative measures do
exist. Typically, clustering algorithms are compared academically on
synthetic datasets with pre-defined clusters, which an algorithm is
expected to discover.

Clustering is an unsupervised learning technique, so it is hard to


evaluate the quality of the output of any given method.

Clustering Algorithms
There are many types of clustering algorithms.

Many algorithms use similarity or distance measures between


examples in the feature space in an effort to discover dense regions of
observations. As such, it is often good practice to scale data prior to
using clustering algorithms.
Central to all of the goals of cluster analysis is the notion of the degree
of similarity (or dissimilarity) between the individual objects being
clustered. A clustering method attempts to group the objects based on
the definition of similarity supplied to it. Some clustering algorithms
require you to specify or guess at the number of clusters to discover in
the data, whereas others require the specification of some minimum
distance between observations in which examples may be considered
“close” or “connected.”

As such, cluster analysis is an iterative process where subjective


evaluation of the identified clusters is fed back into changes to
algorithm configuration until a desired or appropriate result is
achieved.

The scikit-learn library provides a suite of different clustering


algorithms to choose from.

A list of 10 of the more popular algorithms is as follows:

 Agglomerative Clustering
 BIRCH
 DBSCAN
 K-Means
 Mini-Batch K-Means
Conclusion:

You might also like