0% found this document useful (0 votes)
11 views25 pages

AIML Removed

The document provides a comprehensive overview of decision trees, including their structure, terminology, and algorithmic workings for classification and regression tasks. It discusses advantages and limitations, such as simplicity and overfitting issues, along with methods to mitigate overfitting like pruning. Additionally, it contrasts decision trees with random forests, highlighting their differences in complexity, accuracy, and computational requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views25 pages

AIML Removed

The document provides a comprehensive overview of decision trees, including their structure, terminology, and algorithmic workings for classification and regression tasks. It discusses advantages and limitations, such as simplicity and overfitting issues, along with methods to mitigate overfitting like pruning. Additionally, it contrasts decision trees with random forests, highlighting their differences in complexity, accuracy, and computational requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: Decision trees

Theory questions

1. Why use decision trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
 Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
 It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
 The decisions or the test are performed on the basis of features of the given dataset.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
 A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub-trees. Below diagram explains the general structure of a decision tree.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 It is a tree-structured classifier, where internal nodes represent the features of a


dataset, branches represent the decision rules and each leaf node represents the
outcome.
 Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
 The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

2. Explain decision tree terminology.

The decision tree comprises of root node, leaf node, branch nodes, parent/child node etc.
following is the explanation of this terminology.
 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

3. How does the Decision Tree algorithm Work for classification?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node. For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree.
 Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM)
i.e. information gain and Gini index.
 Step-3: Divide the S into subsets that contains possible values for the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Mathematics based questions

5. Explain entropy reduction, information gain and Gini index in decision tree.

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
Information Gain:
 Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
 It calculates how much information a feature provides us about a class.
 According to the value of information gain, we split the node and build the decision
tree. A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S) – [(Weighted Average) * Entropy (each feature)]
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Where, S= Total number of samples, P(yes)= probability of yes, P(no)= probability of no
Gini Index:
 Gini index is a measure of impurity or purity used while creating a decision tree in the
CART (Classification and Regression Tree) algorithm.
 An attribute with the low Gini index should be preferred as compared to the high Gini
index. Gini index can be calculated using the formula: Gini Index= 1 – ∑jPj2
 It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.

6. What are advantages and limitations of the decision trees?

Advantages of the Decision Tree


 It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
 It can be very useful for solving decision-related problems.
 It helps to think about all the possible outcomes for a problem.
 There is less requirement of data cleaning compared to other algorithms.
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Disadvantages of the Decision Tree


 The decision tree contains lots of layers, which makes it complex.
 It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
 For more class labels, the computational complexity of the decision tree may
increase.

7. Many times while training decision tree tends to overfit. What is the reason

behind it and how to avoid it?

Decision tree tends to overfit since at each node, it will make the decision among a subset of
all the features (columns), so when it reaches a final decision, it is a complicated and long
decision chain. Only if a data point satisfies all the rules along this chain, the final decision
can be made. This kind of specific rule on training dataset make it very specific for the
training set, on the other hand, cannot generalize well for new data points that it has never
seen. Especially when your dataset has many features (high dimension), it tends to overfit
more. In J48 decision tree, over fitting happens when algorithm gets information with
exceptional attributes. This causes many fragmentations in the process distribution.
Statistically unimportant nodes with least examples are known as fragmentations. Usually
J48 algorithm builds trees and grows its branches ‗just deep enough to perfectly classify the
training examples‘. This approach performs better with noise free data. But most of the time
this strategy overfits the training examples with noisy data. At present there are two
strategies which are widely used to bypass this overfitting in decision tree learning. Those
are: 1) If tree grows taller, stop it from growing before it reaches the maximum point of
accurate classification of the training data. 2) Let the tree to over-fit the training data then
post-prune tree. By default, the decision tree model is allowed to grow to its full depth.
Pruning refers to a technique to remove the parts of the decision tree to prevent growing to
its full depth. By tuning the hyperparameters of the decision tree model one can prune the
trees and prevent them from overfitting. There are two types of pruning Pre-pruning and
Post-pruning. Now let's discuss the in-depth understanding and hands-on implementation
of each of these pruning techniques.
Pre-Pruning:
The pre-pruning technique refers to the early stopping of the growth of the decision tree.
The pre-pruning technique involves tuning the hyperparameters of the decision tree model
prior to the training pipeline. The hyperparameters of the decision tree including
max_depth, min_samples_leaf, min_samples_split can be tuned to early stop the growth
of the tree and prevent the model from overfitting.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Post-Pruning:
The Post-pruning technique allows the decision tree model to grow to its full depth, then
removes the tree branches to prevent the model from overfitting. Cost complexity pruning
(ccp) is one type of post-pruning technique. In case of cost complexity pruning, the
ccp_alpha can be tuned to get the best fit model.

Problems/Numerical

8. Problems on calculating entropy and information gain

Problem 1:
If we decided to arbitrarily label all 4 gumballs as red, how often would one of the gumballs
is incorrectly labelled?
4 red and 0 blue:

The impurity measurement is 0 because we would never incorrectly label any of the 4 red
gumballs here. If we arbitrarily chose to label all the balls ‗blue‘, then our index would still be
0, because we would always incorrectly label the gumballs.
The gini score is always the same no matter what arbitrary class you take the probabilities of
because they always add to 0 in the formula above.
A gini score of 0 is the most pure score possible.
2 red and 2 blue:

The impurity measurement is 0.5 because we would incorrectly label gumballs wrong about
half the time. Because this index is used in binary target variables (0,1), a gini index of 0.5 is
the least pure score possible. Half is one type and half is the other. Dividing gini scores by
0.5 can help intuitively understand what the score represents. 0.5/0.5 = 1, meaning the
grouping is as impure as possible (in a group with just 2 outcomes).
3 red and 1 blue:

The impurity measurement here is 0.375. If we divide this by 0.5 for more intuitive
understanding we will get 0.75, which is the probability of incorrectly/correctly labeling.
Problem 2:
How does entropy work with the same gumball scenarios stated in problem 1?
4 red and 0 blue:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

majority voting to decide on the predicted class, and in the case of regression, we will
take the mean value of the predictions of all the estimators.

Random forest inference for a simple classification example with Ntree = 3


This use of many estimators is the reason why the random forest algorithm is called an
ensemble method. Each individual estimator is a weak learner, but when many weak
estimators are combined together they can produce a much stronger learner. Ensemble
methods take a 'strength in numbers' approach, where the output of many small models is
combined to produce a much more accurate and powerful prediction.

12. What are advantages and limitations of the random forest tree?

Advantages of Random Forest


 Random Forest is capable of performing both Classification and Regression tasks.
 It is capable of handling large datasets with high dimensionality.
 It enhances the accuracy of the model and prevents the overfitting issue.
 It is fast and can deal with missing values data as well.
 Using random forest you can compute the relative feature importance.
 It can give good accuracy even if the higher volume of data is missing.
Limitations of Random Forest
 Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
 Random forest is a complex algorithm that is not easy to interpret.
 Complexity is large.
 Predictions given by random forest takes many times if we compare it to other
algorithms
 Higher computational resources are required to use a random forest algorithm.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

13. What is the difference between simple decision tree and random forest tree?

SN Random Forest Decision Tree


1. While building a random forest the Whereas, it built several decision trees
number of rows is selected randomly. and find out the output.
2. It combines two or more decision trees Whereas the decision is a collection of
together. variables or data set or attributes.
3. It gives accurate results. Whereas it gives less accurate results.
4. By using multiple trees it reduces the On the other hand, decision trees, it has
chances of overfitting. the possibility of overfitting, which is an
error that occurs due to variance or due
to bias.
5. Random forest is more complicated to Whereas, the decision tree is simple so it
interpret. is easy to read and understand.
6. In a random forest, we need to The decision tree is not accurate but it
generate, process, and analyze trees so processes fast which means it is fast to
that this process is slow, it may take implement.
one hour or even days.
7. It has more computation because it has Whereas it has less computation.
n number of decision trees, so more
decision trees more computation.
8. It has complex visualization, but it plays On the other hand, it is simple to
an important role to show hidden visualize because we just need to fit the
patterns behind the data. decision tree model.
9. The classification and regression Whereas a decision tree is used to solve
problems can be solved by using the classification and regression
random forest. problems.
10. It uses the random subspace method Whereas a decision is made based on the
and bagging during tree construction, selected sample‘s feature, this is usually a
which has built-in feature importance. feature that is used to make a decision,
decision tree learning is a process to find
the optimal value for each internal tree
node.
Decision trees are simple but suffer from some serious problems- overfitting, error due to
variance or error due to bias. Random Forest is the collection of decision trees with a single
and aggregated result. Using multiple trees in the random forest reduces the chances of

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

bias. Random forests, on the other hand, are a powerful modelling tool that is far more
resilient than a single decision tree. They combine numerous decision trees to reduce
overfitting and bias-related inaccuracy, and hence produce usable results.

Theory Mathematics Numerical


Topic: Naive Bayes

Theory questions

17. Why use Naive Bayes algorithm?

 Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes


theorem and used for solving classification problems.
 It is mainly used in text classification that includes a high-dimensional training dataset.
 Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
 It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
 Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
 Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the
bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an
apple. Hence each feature individually contributes to identify that it is an apple without
depending on each other.
 Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

18. What are the Pros and Cons of using Naive Bayes?

Advantages of Naïve Bayes Classifier:


 Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
 It can be used for Binary as well as Multi-class Classifications.
 It performs well in Multi-class predictions as compared to the other Algorithms.
 It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
 Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

 The requirement of predictors to be independent. In most of the real life cases, the
predictors are dependent, this hinders the performance of the classifier.

19. How does the Bayes algorithm differ from decision trees?

 Decision tree is a discriminative model, whereas Naive bayes is a generative model.


 Decision trees are more flexible and easy. Decision tree pruning may neglect some key
values in training data, which can lead the accuracy for a toss.
 A major advantage to Naive Bayes classifiers is that they are not prone to overfitting,
thanks to the fact that they ―ignore‖ irrelevant features. They are, however, prone to
poisoning, a phenomenon that occurs when we are trying to predict a class but
features uncommon to appear, causing a misclassification.
 Naive Bayes classifiers are easily implemented and highly scalable, with a linear
computational complexity with respect to the number of data entries.
 Naive Bayes is strongly associated with text-based classification. Example applications
include but are not limited to spam filtering and text categorization. This is because the
presence of certain words is strongly linked to their respective categories, and thus the
mutual independence assumption is stronger.
 Unfortunately, several data sets require that some features are hand-picked before the
classifier can work as intended.
 Decision Trees are very flexible, easy to understand, and easy to debug. They will work
with classification problems and regression problems. So if you are trying to predict a
categorical value like (red, green, up, down) or if you are trying to predict a continuous
value like 2.9, 3.4 etc Decision Trees will handle both problems. Probably one of the
coolest things about Decision Trees is they only need a table of data and they will build a
classifier directly from that data without needing any up front design work to take place.
To some degree properties that don't matter won't be chosen as splits and will get
eventually pruned so it's very tolerant of nonsense. To start it's set it and forget it.
 However, the downside. Simple decision trees tend to over fit the training data more so
that other techniques which mean you generally have to do tree pruning and tune the
pruning procedures. You didn't have any upfront design cost, but you'll pay that back on
tuning the trees performance. Also simple decision trees divide the data into squares so
building clusters around things means it has to split a lot to encompass clusters of data.
Splitting a lot leads to complex trees and raises probability you are overfitting. Tall trees
get pruned back so while you can build a cluster around some feature in the data it
might not survive the pruning process. There are other techniques like surrogate splits
which let you split along several variables at once creating splits in the space that aren't
either horizontal or perpendicular ( 0 < slope < infinity ). Cool, but your tree starts to
Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)
QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: Support Vector Machine

Theory questions

23. What is Support Vector Machine?

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help
in creating the hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or hyperplane:

24. How does the SVM work?

SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable.
Original dataset Data with separator added Transformed data

A separator between the categories is found, and then the data are transformed in such a
way that the separator could be drawn as a hyperplane. Following this, characteristics of new

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Theory Mathematics Numerical


Topic: 34. Logistic Regression

Theory questions

34. Why use Logistic Regression?

 Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
 Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
 Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).The curve from the logistic
function indicates the likelihood of something such as whether the cells are cancerous or
not, a mouse is obese or not based on its weight, etc.
 Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

hθ(x) approaches to 1. Conversely, the cost to pay grows to infinity as hθ(x) approaches to 0.
This is a desirable property: we want a bigger penalty as the algorithm predicts something
far away from the actual value. If the label is y=1 but the algorithm predicts hθ(x)=0, the
outcome is completely wrong. Conversely, the same intuition applies when y=0, depicted in
the plot 2. below, right side. Bigger penalties when the label is y=0 but the algorithm
predicts hθ(x)=1. The above two functions can be compressed into a single function i.e.

Theory Mathematics Numerical


Topic: K-Means & K-Nearest Neighbor (KNN)

Theory questions

40. What is meant by K Nearest Neighbor algorithm?

 K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
 K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
 K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
 K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
 K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
 It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs an
action on the dataset.
 KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
 Example: Suppose, we have an image of a creature that looks similar to cat and dog, but
we want to know either it is a cat or dog. So for this identification, we can use the KNN
algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Advantages
 The algorithm is simple and easy to implement.
 There‘s no need to build a model, tune several parameters, or make additional
assumptions.
 The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
Disadvantages
 The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.

Mathematics based questions

48. How to choose right value for K in KNN?

To select the K that‘s right for your data, we run the KNN algorithm several times with
different values of K and choose the K that reduces the number of errors we encounter while
maintaining the algorithm‘s ability to accurately make predictions when it‘s given data it
hasn‘t seen before. Here are some things to keep in mind:
 As we decrease the value of K to 1, our predictions become less stable. Just think for a
minute, imagine K=1 and we have a query point surrounded by several reds and one
green (I‘m thinking about the top left corner of the colored plot above), but the green is
the single nearest neighbor. Reasonably, we would think the query point is most likely
red, but because K=1, KNN incorrectly predicts that the query point is green.
 Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions (up
to a certain point). Eventually, we begin to witness an increasing number of errors. It is at
this point we know we have pushed the value of K too far.
 In cases where we are taking a majority vote (e.g. picking the mode in a classification
problem) among labels, we usually make K an odd number to have a tiebreaker.

49. How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 3: CLASSIFICATION & REGRESSION

Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
 It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
 For each value of K, calculates the WCSS value.
 Plots a curve between calculated WCSS values and the number of clusters K.
 The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:

Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.

Feel free to contact me on +91-8329347107 calling / +91-9922369797 whatsapp,


email ID: adp.mech@coep.ac.in and abhipatange93@gmail.com

*********************

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 Unsupervised Learning = Training a model to find patterns in an unlabeled dataset (e.g.


clustering).
 Validation Set = A set of observations used during model training to provide feedback
on how well the current parameters generalize beyond the training set. If training error
decreases but validation error increases, your model is likely overfitting and you should
pause training.

7. Enlist steps involved in development of classification model.

Following are the steps to be considered in development of classification model.


1 - Data Collection
 The quantity & quality of your data dictate how accurate our model is
 The outcome of this step is generally a representation of data which we will use for
training
 Using experimental data, data generated by simulations, pre-collected data, by way
of datasets from Kaggle, UCI, etc., still fits into this step
2 - Data Preparation
 Wrangle data and prepare it for training
 Clean that which may require it (remove duplicates, correct errors, deal with missing
values, normalization, and data type conversions, etc.)
 Randomize data, which erases the effects of the particular order in which we
collected and/or otherwise prepared our data
 Visualize data to help detect relevant relationships between variables or class
imbalances (bias alert!), or perform other exploratory analysis
 Split into training and evaluation sets
3 - Choose a Model
 Different algorithms are for different tasks; choose the right one
4 - Train the Model
 The goal of training is to answer a question or make a prediction correctly as often as
possible
 Linear regression example: algorithm would need to learn values for m (or W) and b
(x is input, y is output)
 Each iteration of process is a training step
5 - Evaluate the Model
 Uses some metric or combination of metrics to "measure" objective performance of
model
 Test the model against previously unseen data

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

 This unseen data is meant to be somewhat representative of model performance in


the real world, but still helps tune the model (as opposed to test data, which does
not)
 Good train/evaluation split? 80/20, 70/30, or similar, depending on domain, data
availability, dataset particulars, etc.
6 – Hyper parameter Tuning
 This step refers to hyperparameter tuning, which is an "artform" as opposed to a
science
 Tune model parameters for improved performance
 Simple model hyperparameters may include: number of training steps, learning rate,
initialization values and distribution, etc.
7 - Make Predictions
 Using further (test set) data which have, until this point, been withheld from the
model (and for which class labels are known), are used to test the model; a better
approximation of how the model will perform in the real world

8. Enlist steps involved in development of regression model.

 Regression Analysis is an analytical process whose end goal is to understand the inter-
relationships in the data and find as much useful information as possible.
 According to the book, there are a number of steps which are loosely detailed below.
1 - Problem definition
 The very first step is to, off course; define the problem we are trying to solve. Perhaps a
business question that needs to be answered or simply a prediction we want to make
based on some set of data. In this stage we must know the target variable and the
attributes we presume affects the target variable. This would be later analysed to judge
its credibility. For the sake of our discussion let‟s take the Titanic Dataset as an example.
 In this dataset we have data of about 900 passengers. The question or the problem we
must solve is predicting which passenger likely survived the tragedy given their data.

A look at the Titanic Dataset


 So now we know, that „Survival‟ is the response variable but of the 10 attributes given for
each passenger, how do we determine which of these predictor variables affect the
result? That‟s where data analysis comes in .

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

the training, while the testing set is used to test the model after the training is
completed.
 The validation dataset gives the model the first taste of unseen data. However, not all
data scientists perform an initial check using validation data. They might skip this part
and go directly to testing data.

23. Explain with neat sketch K-fold cross-validation mode.

 The classifier model can be designed/trained and performance can be evaluated based
on K-fold cross-validation mode, training mode and test mode.
 The main idea behind K-Fold cross-validation is that each sample in our dataset has
the opportunity of being tested. It is a special case of cross-validation where we
iterate over a dataset set k times. In each round, we split the dataset into k parts:

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

one part is used for validation, and the remaining k-1 parts are merged into a
training subset for model evaluation
 Computation time is reduced as we repeated the process only 10 times when the value
of k is 10. It has Reduced bias.
 Every data points get to be tested exactly once and is used in training k-1 times
 The variance of the resulting estimate is reduced as k increases

24. Explain with neat sketch 5-fold cross-validation mode.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 4: DEVELOPMENT OF ML MODEL

25. What is hyper parameter tuning?

• Machine learning algorithms have hyperparameters that allow you to tailor the
behavior of the algorithm to your specific dataset.
• Hyperparameters are different from parameters, which are the internal coefficients or
weights for a model found by the learning algorithm. Unlike parameters,
hyperparameters are specified by the practitioner when configuring the model.
• Typically, it is challenging to know what values to use for the hyperparameters of a given
algorithm on a given dataset, therefore it is common to use random or grid search
strategies for different hyperparameter values.
• The more hyperparameters of an algorithm that you need to tune, the slower the
tuning process. Therefore, it is desirable to select a minimum subset of model
hyperparameters to search or tune.

26. Explain hyper parameter tuning for simple decision tree.

Max_Depth: The maximum depth of the tree. If this is not specified in the Decision Tree, the
nodes will be expanded until all leaf nodes are pure or until all leaf nodes contain less than
min_samples_split.
 Default = None
 Input options → integer
Min_Samples_Split: The minimum samples required to split an internal node. If the amount
of sample in an internal node is less than the min_samples_split, then that node will become
a leaf node.
 Default = 2
 Input options → integer or float (if float, then min_samples_split is fraction)
Min_Samples_Leaf: The minimum samples required to be at a leaf node. Therefore, a split
can only happen if it leaves at least the min_samples_leaf in both of the resulting nodes.
 Default = 1
 Input options → integer or float (if float, then min_samples_leaf is fraction)
Max_Features: The number of features to consider when looking for the best split. For
example, if there are 35 features in a dataframe and max_features is 9, only 9 of the 35
features will be used in the decision tree.
 Default = None
 Input options → integer, float (if float, then max_features is fraction) or {“auto”, “sqrt”,
“log2”}
 “auto”: max_features=sqrt(n_features)
 “sqrt”: max_features = sqrt(n_features)

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Theory Mathematics Numerical


Topic: Characteristics of reinforced learning

Theory questions

1. What is reinforcement learning? State one practical example.

 Reinforcement Learning is a feedback-based Machine learning technique in which an


agent learns to behave in an environment by performing the actions and seeing the
results of actions. For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback or penalty.
 In Reinforcement Learning, the agent learns automatically using feedbacks without
any labeled data, unlike supervised learning.
 Since there is no labeled data, so the agent is bound to learn by its experience only.
 RL solves a specific type of problem where decision making is sequential, and the goal is
long-term, such as game-playing, robotics, etc.
 The agent interacts with the environment and explores it by itself. The primary goal of an
agent in reinforcement learning is to improve the performance by getting the maximum
positive rewards.
 The agent learns with the process of hit and trial, and based on the experience, it learns
to perform the task in a better way. Hence, we can say that "Reinforcement learning is
a type of machine learning method where an intelligent agent (computer program)
interacts with the environment and learns to act within that." How a Robotic dog
learns the movement of his arms is an example of Reinforcement learning.
 It is a core part of Artificial intelligence, and all AI agent works on the concept of
reinforcement learning. Here we do not need to pre-program the agent, as it learns from
its own experience without any human intervention.
 Example:
Suppose there is an AI agent present within a maze environment, and his goal is to find
the diamond. The agent interacts with the environment by performing some actions, and
based on those actions, the state of the agent gets changed, and it also receives a reward
or penalty as feedback.
The agent continues doing these three things (take action, change state/remain in the
same state, and get feedback), and by doing these actions, he learns and explores the
environment.
The agent learns that what actions lead to positive feedback or rewards and what actions
lead to negative feedback penalty. As a positive reward, the agent gets a positive point,
and as a penalty, it gets a negative point.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

2. State key constituents of reinforcement learning. (Explain key terms in

reinforcement learning.)

 Agent(): An entity that can perceive/explore the environment and act upon it.
 Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
 Action(): Actions are the moves taken by an agent within the environment.
 State(): State is a situation returned by the environment after each action taken by the
agent.
 Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
 Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
 Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
 Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).

3. State key features of reinforcement learning.

 In RL, the agent is not instructed about the environment and what actions need to be
taken.
 It is based on the hit and trial process.
 The agent takes the next action and changes states according to the feedback of the
previous action.
 The agent may get a delayed reward.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

environments, where the agent can observe the environment and act for the new state. The
complete process is known as Markov Decision process, which is explained below:

10. Explain Markov Decision Process


Markov Decision Process or MDP, is used to formalize the reinforcement learning
problems. If the environment is completely observable, then its dynamic can be modeled as
a Markov Process. In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state.

MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
 A set of finite States S
 A set of finite Actions A
 Rewards received after transitioning from state S to state S', due to action a.
 Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property: It says that "If the agent is present in the current state S1, performs an
action a1 and move to the state s2, then the state transition from s1 to s2 only depends on
the current state and future action and states do not depend on past actions, rewards, or
states." Or, in other words, as per Markov Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the current state and
do not need to remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

Now, how the price prediction is made by hidden layers? How computation is done inside
hidden layers? This will be explained with help of activation function, loss function, and
optimizers.

20. What is activation Functions?

Each neuron has an activation function that performs computation. Different layers can
have different activation functions but neurons belonging to one layer have the same
activation function. In DNN, a weighted sum of input is calculated based on weights and
inputs provided. Then, the activation function comes into the picture that works on weighted
sum and converts it into output.

21. Why activation functions are required?

Activation functions help model learn complex relationship that exists within the dataset. If
we do not use the activation function in neurons and give weighted sum as output, in that
case, computations will be difficult as there is no specific range for weighted sum. So, the
activation function helps to keep output in a particular range. Secondly, the non-linear
activation function is always preferred as it adds non-linearity to the dataset which otherwise
would form a simple linear regression model incapable of taking the benefit of hidden
layers. The relu function or its varients is mostly used for hidden layers and sigmoid/ softmax
function is mostly used for final layer for binary/ multi-class classification problems.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 5: REINFORCED AND DEEP LEARNING

22. What is Loss/ Cost Function?

To train the model, we give input (departure location, arrival location and, departure date in
case of train price prediction) to the network and let it predict the output making use of
activation function. Then, we compare predicted output with the actual output and compute
the error between two values. This error between two values is computed using loss/ cost
function. The same process is repeated for entire training dataset and we get the average
loss/error. Now, the objective is to minimize this loss to make the model accurate. There
exist weights between each connection of 2 neurons. Initially, weights are randomly
initialized and the motive is to update these weights with every iteration to get the minimum
value of the loss/ cost function. We can change the weights randomly but that is not efficient
method. Here comes the role of optimizers which updates weights automatically.

23. What are different loss functions and their use case?

Loss function is chosen based on the problem.


a. Regression Problem
Mean squared error (MSE) is used where real value quantity is to be predicted.
MSE in case of train price prediction as price predicted is real value quantity.
b. Binary/ Multi-class Classification Problem
Cross-entropy is used.
c. Maximum- Margin Classification
Hinge loss is used.

24. Explain optimizers. Why optimizers are required?

Once loss for one iteration is computed, optimizer is used to update weights. Instead of
changing weights manually, optimizers can update weights automatically in small
increments and helps to find the minimum value of the loss/ cost function. Magic of DL!!
Finding minimum value of cost function requires iterating through dataset many times and
thus requires large computational power. Common technique used to update these weights
is gradient descent.

25. What is Gradient Descent (GD) and its variants?

It is used to find minimum value of loss function by updating weights. There are 3 variants:
a) Batch/ Vanila Gradient
 In this, gradient for entire dataset is computed to perform one weight update.
 It gives good results but can be slow and requires large memory.
b) Stochastic Gradient Descent (SGD)
 Weights are updated for each training data point.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)


QUESTION BANK FOR UNIT 6: APPLICATIONS

Topic: Human Machine Interaction

1. What is human-machine interaction?

 HMI is all about how people and automated systems interact and communicate with
each other. That has long ceased to be confined to just traditional machines in industry
and now also relates to computers, digital systems or devices for the Internet of Things
(IoT).
 More and more devices are connected and automatically carry out tasks. Operating all of
these machines, systems and devices needs to be intuitive and must not place excessive
demands on users.
 Human-machine interaction is all about how people and automated systems interact
with each other.
 HMI now plays a major role in industry and everyday life: More and more devices are
connected and automatically carry out tasks.
 A user interface that is as intuitive as possible is therefore needed to enable smooth
operation of these machines. That can take very different forms.

2. How does human-machine interaction work?

 Smooth communication between people and machines requires interfaces: The place
where or action by which a user engages with the machine.
 Simple examples are light switches or the pedals and steering wheel in a car: An action is
triggered when you flick a switch, turn the steering wheel or step on a pedal.
 However, a system can also be controlled by text being keyed in, a mouse, touch screens,
voice or gestures.
 The devices are either controlled directly: Users touch the smartphone’s screen or issue a
verbal command. Or the systems automatically identify what people want: Traffic lights
change color on their own when a vehicle drives over the inductive loop in the road’s
surface.
 Other technologies are not so much there to control devices, but rather to complement
our sensory organs. One example of that is a virtual reality glass.
 There are also digital assistants: Chatbots, for instance, reply automatically to requests
from customers and keep on learning.
 User interfaces in HMI are the places where or actions by which the user engages with
the machine.
 A system can be operated by means of buttons, a mouse, touch screens, voice or
gesture, for instance.

Dr. Abhishek D. Patange, Mechanical Engineering, College of Engineering Pune (COEP)

You might also like