24CSR1R01 DSF Assignment 2
24CSR1R01 DSF Assignment 2
Assignment 2
We can see from the dataset that we have 150 rows i.e. 150 flowers of the same but distinguished colors
are collected and they are divided into different class on that basis.
The iris dataset has three classes: ‘Iris-setosa’, ‘Iris-versicolor’, and ‘Iris-virginica’, each containing 50
instances.
It is a balanced dataset.
The dataset includes four numeric attributes (sepal length, sepal width, petal length, petal width) and
one nominal class attribute.
It is a supervised dataset.
Observation:-
Here I used J48 tree and after looking at the confusion matrix we can say that-
All 50 Iris-setosa are classified correctly without any error as ‘b’ and ‘c’ are 0.
When it classified Iris- versicolor there was one error. One of data went to ‘c’.
For Iris-virginica 48 were classified correctly but 2 were error.
This is the tree which is used by this classifier to distinguish between different classes.
3. Multilayer Perceptron
Clustering
Observation:
Four clusters were identified, despite the dataset originally having three classes. This suggests that one
or more species of iris exhibit internal variation that the EM algorithm picked up on.
Cluster 1 corresponds to Iris-setosa, which is easily separable due to its distinct features.
Cluster 0 primarily corresponds to Iris-versicolor, while Clusters 2 and 3 mostly represent Iris-
virginica.
The presence of four clusters instead of three suggests some variability or overlap between Iris-
versicolor and Iris-virginica, which might be better understood with further analysis.
The EM algorithm's clustering model captures the natural variation within the Iris dataset, providing
insights into the subtle differences between the species.
Observation:
According to the Ranker the attribute “sepalwidth” is least impactful for determining the result
(classification).
From the screenshots below we can see sepalwidth has the least effect, so if we remove it while
classifying it won’t have much effect on the result.
Dataset 2- Diabetes dataset
We can see from the dataset that we have 768 instances i.e. 150 flowers of the same but distinguished
colors are collected and they are divided into different class on that basis.
There are 8 numerical input features (attributes) and 1 binary output variable (class - tested_negative /
tested_positive).
It is an unbalanced dataset.
It is a supervised dataset.
This is the tree which is used by this classifier to distinguish between different classes.
3. Multilayer Perceptron
Clustering
Observation:
Three clusters were identified, each representing different subgroups in the dataset.
Cluster 0 represent individuals with higher glucose and insulin levels, possibly indicating higher
diabetes risk.
Cluster 1 represents healthier, younger individuals with lower measurements across most features,
corresponding mainly to non-diabetic cases.
Cluster 2 includes older individuals with more pregnancies, with moderate glucose levels and possibly
missing insulin data, indicating another at-risk group.
The EM algorithm's clustering provides insights into the underlying structure of the data, potentially
guiding further analysis or targeted interventions.
Difference between the observations of using different classifiers on different datasets:
Naive Bayes classifier worked quite well for the dataset with less number of instances i.e. the Iris
Dataset. But as the number of instances increased in the second dataset, the respective classifier didn’t
performed well.
For iris dataset the Multilayer Perceptron performed the best while doing classification. But in the
case of Diabetes dataset Decision tree (J48) classifier performed the best.
For iris dataset the three classes (‘Iris-setosa’, ‘Iris-versicolor’, and ‘Iris-virginica’) contains 50
instances each.
For Diabetes dataset, the two classes ‘tested_negative’, ‘tested_positive’ has 500 and 268 instances
respectively.