Unsupervised Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Unsupervised Machine Learning

unsupervised learning is a machine learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden patterns and insights from the
given data.

“Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.”

The goal of unsupervised learning is to find the underlying structure of dataset, group
that data according to similarities, and represent that dataset in a compressed
format.

Why use Unsupervised Learning?

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Types of Unsupervised Learning algorithms:


Clustering:

• Clustering is the process of dividing the datasets into groups, consisting


of similar data points.

• It means grouping of objects based on information found in the data,


describing the objects or then relationship.

Association:

• Used for finding the relationships between variables in the large database.

• Association rule makes marketing strategy more effective. Such as people who buy X
item (suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical
example of Association rule is Market Basket Analysis.

Advantages of Unsupervised Learning


• Unsupervised learning is used for more complex tasks as compared to
supervised learning because, in unsupervised learning, we don't have
labeled input data.

• Unsupervised learning is preferable as it is easy to get unlabeled data in


comparison to labeled data.

Disadvantages of Unsupervised Learning


• Unsupervised learning is more difficult than supervised learning as it
does not have corresponding output.

• The result of the unsupervised learning algorithm might be less accurate


as input data is not labeled, and algorithms do not know the exact output
in advance.
Clustering in Machine Learning

• Clustering is the task of dividing the data points into a number of


groups, such that data points in the same groups are more similar to each
other and dissimilar to the data points in other groups.
• It is basically a collection of objects on the basis of similarity and
dissimilarity between them.

Why clustering is important?

• Through the use of clusters, It is very easy to sort data and analyze
specific groups.
• Clustering enables businesses to approach customer segments
differently based on their attributes and similarities. This helps in
maximizing profits.
• It can help in dimensionality reduction if the dataset is comprised
of too many variables. Irrelevant clusters can be identified easier
and removed from the dataset.
Where it is used?
• City Planning: It is used to make groups of houses and to study their
values based on their geographical locations and other factors present.

• Earthquake studies: By learning the earthquake-affected areas we can


determine the dangerous zones.

• Image Processing: Clustering can be used to group similar images


together, classify images based on content, and identify patterns in
image data.

• Manufacturing: Clustering is used to group similar products together,


optimize production processes, and identify defects in manufacturing
processes.

• Medical diagnosis: Clustering is used to group patients with similar


symptoms or diseases, which helps in making accurate diagnoses and
identifying effective treatments.

• Fraud detection: Clustering is used to identify suspicious patterns or


anomalies in financial transactions, which can help in detecting fraud or
other financial crimes.

Types of Clustering:-

❖ Exclusive Clustering
• k-Means Clustering
❖ Overlapping Clustering
• Fuzzy c-Means Clustering
❖ Hierarchical Clustering

Exclusive Clustering:-
• Exclusive clustering, also known as hard clustering, is a type of clustering
in unsupervised machine learning where each data point is assigned to
exactly one cluster.
• In other words, there is a clear and exclusive assignment of each data
point to a single cluster, and no overlapping memberships are allowed.

• The most well-known exclusive clustering algorithm is k-means.

k-Means Clustering:-
• K-means clustering is a popular unsupervised machine learning
algorithm used for partitioning a dataset into a set of groups.

• There is no overlapping of subgroups or clusters.

• The algorithm's objective is to group data points into k clusters, where


each data point belongs to the cluster.

• Here K defines the number of pre-defined clusters that need to be


created in the process, as if K=2, there will be two clusters, and for K=3,
there will be three clusters, and so on.

• It is a centroid-based algorithm, where each cluster is associated with a


centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.

How does the K-Means Algorithm Work?


Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Calculate the distance between each data point and centroid.Assign
each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Recalculate each cluster center by taking the average of cluster’s data
point.

Step-5: Repeat from step2 to step5 until the recalculated cluster centers are
same as previous or No reassignment of data points happend.

How to decide the number of clusters?


Elbow Method :
• The Elbow method is one of the most popular ways to find the optimal
number of clusters. This method uses the concept of WCSS
value. WCSS stands for Within Cluster Sum of Squares, which defines
the total variations within a cluster.
• The formula to calculate the value of WCSS (for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in


2
CLuster3 distance(Pi C3)

• In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances
between each data point and its centroid within a cluster1 and the same
for the other two terms.

• To measure the distance between data points and centroid, we can use
any method such as Euclidean distance .

• To find the optimal value of clusters, the elbow method follows the
below steps:
(1) It executes the K-means clustering on a given dataset for different K
values (ranges from 1-10).

(2) For each value of K, calculates the WCSS value.

(3) Plots a curve between calculated WCSS values and the number of
clusters K.

(4) The sharp point of bend or a point of the plot looks like an arm, then
that point is considered as the best value of K.

• Since the graph shows the sharp bend, which looks like an elbow, hence it
is known as the elbow method.

Overlapping Clustering:-
• Overlapping clustering, also known as soft clustering ,it’s a type of
clustering in which a data point can belong to more than one cluster.

• In traditional (non-overlapping) clustering, each data point is assigned to


exactly one cluster. However, in overlapping clustering, a data point may
have membership in more than one cluster, indicating that it exhibits
characteristics of multiple clusters.

Fuzzy C-Means Clustering:-


• Fuzzy C-Means (FCM) clustering is a type of unsupervised machine
learning algorithm used for clustering, and it's an extension of the classic
K-Means algorithm.
• The key difference between K-Means and FCM lies in the assignment of
data points to clusters. In K-Means, each data point is assigned to a
single cluster, while in FCM data point to belong to more than one
cluster with different degrees of membership.

• fuzzy clustering assigns a membership degree between 0 and 1 for each


data point for each cluster.

Advantages of Fuzzy Clustering:-

• Flexibility: Fuzzy clustering allows for overlapping clusters, which can


be useful when the data has a complex structure.

• Interpretability: Fuzzy clustering provides a more detailed


representation of the relationships between data points and clusters.

Disadvantages of Fuzzy Clustering:-


• Complexity: Fuzzy clustering algorithms can be computationally more
expensive than traditional clustering algorithms.

Hierarchical Clustering:-
• Hierarchical clustering is a type of clustering algorithm that organizes
data points into a tree-like structure, known as a dendrogram.

• The basic idea behind hierarchical clustering is to build a hierarchy of


clusters, where clusters at one level of the hierarchy are formed by
merging or splitting clusters at the preceding level.

Advantages:-
• Hierarchy Representation: This hierarchical structure can be useful for
understanding the organization of the data.

• No Need for Specify Number of Clusters:


Disadvantages:
• Computational Complexity
• Sensitive to Noise
• Difficulty in Handling Large Datasets

Association in Machine Learning


• Association in unsupervised machine learning generally refers to
the discovery of interesting relationships, patterns, or associations
within a dataset without predefined labels or outcomes.
Applications of Association:
• Market Basket Analysis:
Discovering relationships between products that are frequently
purchased together.
• Healthcare Data Analysis:
Identifying associations between symptoms and diseases.

• Fraud Detection:
Discovering unusual patterns of transactions that may indicate
fraudulent activity.
Pros and Cons:
Pros:
• Discover Hidden Patterns:
Association rule can reveal hidden patterns or relationships within
the data.
• Applicability:
Used in various domain, including retail, healthcare, finance etc.
Cons:
• Data Quality:
Sensitive to noise and irrelevant information in the dataset.
• Scalability:
Computationally expensive for large datasets.

You might also like