0% found this document useful (0 votes)
14 views9 pages

24CSR1R01 DSF Assignment 2

Uploaded by

Stutee Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

24CSR1R01 DSF Assignment 2

Uploaded by

Stutee Pradhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

National Institute of Technology Warangal

Department of Computer Science and Engineering


CS 16035: DATA SCIENCE FUNDAMENTALS
Class Assignment Name: Smaraki Bhaktisudha
PhD 1st Yr. 1st Sem. Roll No: 24CSR1R01

Assignment 2

 Download weka tool.


 Try different tasks with different datasets
 Consider popular evaluation metrics for the evaluation of your experiments.
 Write your observations about each task- different datasets & different tasks- each dataset.

Dataset 1- Iris dataset

 We can see from the dataset that we have 150 rows i.e. 150 flowers of the same but distinguished colors
are collected and they are divided into different class on that basis.
 The iris dataset has three classes: ‘Iris-setosa’, ‘Iris-versicolor’, and ‘Iris-virginica’, each containing 50
instances.
 It is a balanced dataset.
 The dataset includes four numeric attributes (sepal length, sepal width, petal length, petal width) and
one nominal class attribute.
 It is a supervised dataset.

Visualising the dataset


 From the numeric attributes (especially ‘sepallength’, ‘sepalwidth’, ‘petalwidth’) it’s quite difficult to
distinguish between the flowers.
 But after looking at the ‘petallength’ here are some observations which we can note down are-
 If the petal-length of a flower is between 1 to 2.18 then that flower is iris-setosa.
 If the petal-length of a flower is between 2.18 to 3.36 then that flower is iris-versicolor (very
less instances though).
 If the petal-length of a flower is between 5.72 to 6.9 then that flower is iris-virginica
(comparatively less instances).

Classifying the dataset

1. Naive Bayes classifier


First I used Naive Bayes classifier. It is usually considered as the base line. It is usually very fast. Although
we can do better in terms of accuracy. After looking at the confusion matrix we can say that-
Observation:-
 All 50 Iris-setosa are classified correctly without any error as ‘b’ and ‘c’ are 0.
 When it classified Iris- versicolor there was two error. Two of data went to ‘c’.
 For Iris-virginica 46 were classified correctly but 4 were error.

2. Decision tree (J48) classifier

Observation:-
Here I used J48 tree and after looking at the confusion matrix we can say that-
 All 50 Iris-setosa are classified correctly without any error as ‘b’ and ‘c’ are 0.
 When it classified Iris- versicolor there was one error. One of data went to ‘c’.
 For Iris-virginica 48 were classified correctly but 2 were error.
This is the tree which is used by this classifier to distinguish between different classes.

3. Multilayer Perceptron

Clustering
Observation:
 Four clusters were identified, despite the dataset originally having three classes. This suggests that one
or more species of iris exhibit internal variation that the EM algorithm picked up on.
 Cluster 1 corresponds to Iris-setosa, which is easily separable due to its distinct features.
 Cluster 0 primarily corresponds to Iris-versicolor, while Clusters 2 and 3 mostly represent Iris-
virginica.
 The presence of four clusters instead of three suggests some variability or overlap between Iris-
versicolor and Iris-virginica, which might be better understood with further analysis.
 The EM algorithm's clustering model captures the natural variation within the Iris dataset, providing
insights into the subtle differences between the species.

Ranking the Attributes

Observation:
 According to the Ranker the attribute “sepalwidth” is least impactful for determining the result
(classification).
 From the screenshots below we can see sepalwidth has the least effect, so if we remove it while
classifying it won’t have much effect on the result.
Dataset 2- Diabetes dataset

 We can see from the dataset that we have 768 instances i.e. 150 flowers of the same but distinguished
colors are collected and they are divided into different class on that basis.
 There are 8 numerical input features (attributes) and 1 binary output variable (class - tested_negative /
tested_positive).
 It is an unbalanced dataset.
 It is a supervised dataset.

Visualising the dataset


Classifying the dataset

1. Naive Bayes classifier

2. Decision tree (J48) classifier

This is the tree which is used by this classifier to distinguish between different classes.
3. Multilayer Perceptron

Clustering

Observation:
 Three clusters were identified, each representing different subgroups in the dataset.
 Cluster 0 represent individuals with higher glucose and insulin levels, possibly indicating higher
diabetes risk.
 Cluster 1 represents healthier, younger individuals with lower measurements across most features,
corresponding mainly to non-diabetic cases.
 Cluster 2 includes older individuals with more pregnancies, with moderate glucose levels and possibly
missing insulin data, indicating another at-risk group.
 The EM algorithm's clustering provides insights into the underlying structure of the data, potentially
guiding further analysis or targeted interventions.
Difference between the observations of using different classifiers on different datasets:
 Naive Bayes classifier worked quite well for the dataset with less number of instances i.e. the Iris
Dataset. But as the number of instances increased in the second dataset, the respective classifier didn’t
performed well.
 For iris dataset the Multilayer Perceptron performed the best while doing classification. But in the
case of Diabetes dataset Decision tree (J48) classifier performed the best.
 For iris dataset the three classes (‘Iris-setosa’, ‘Iris-versicolor’, and ‘Iris-virginica’) contains 50
instances each.
 For Diabetes dataset, the two classes ‘tested_negative’, ‘tested_positive’ has 500 and 268 instances
respectively.

You might also like