AIML Removed
AIML Removed
Theory questions
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas Leaf
nodes are the output of those decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into sub-trees. Below diagram explains the general structure of a decision tree.
The decision tree comprises of root node, leaf node, branch nodes, parent/child node etc.
following is the explanation of this terminology.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node. For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree.
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM)
i.e. information gain and Gini index.
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
5. Explain entropy reduction, information gain and Gini index in decision tree.
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
Information Gain:
Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
It calculates how much information a feature provides us about a class.
According to the value of information gain, we split the node and build the decision
tree. A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S) – [(Weighted Average) * Entropy (each feature)]
Entropy:
Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in
data. Entropy can be calculated as:
Entropy(s) = – P(yes)log2 P(yes) – P(no) log2 P(no)
Where, S= Total number of samples, P(yes)= probability of yes, P(no)= probability of no
Gini Index:
Gini index is a measure of impurity or purity used while creating a decision tree in the
CART (Classification and Regression Tree) algorithm.
An attribute with the low Gini index should be preferred as compared to the high Gini
index. Gini index can be calculated using the formula: Gini Index= 1 – ∑jPj2
It only creates binary splits, and the CART algorithm uses the Gini index to create binary
splits.
7. Many times while training decision tree tends to overfit. What is the reason
Decision tree tends to overfit since at each node, it will make the decision among a subset of
all the features (columns), so when it reaches a final decision, it is a complicated and long
decision chain. Only if a data point satisfies all the rules along this chain, the final decision
can be made. This kind of specific rule on training dataset make it very specific for the
training set, on the other hand, cannot generalize well for new data points that it has never
seen. Especially when your dataset has many features (high dimension), it tends to overfit
more. In J48 decision tree, over fitting happens when algorithm gets information with
exceptional attributes. This causes many fragmentations in the process distribution.
Statistically unimportant nodes with least examples are known as fragmentations. Usually
J48 algorithm builds trees and grows its branches ‗just deep enough to perfectly classify the
training examples‘. This approach performs better with noise free data. But most of the time
this strategy overfits the training examples with noisy data. At present there are two
strategies which are widely used to bypass this overfitting in decision tree learning. Those
are: 1) If tree grows taller, stop it from growing before it reaches the maximum point of
accurate classification of the training data. 2) Let the tree to over-fit the training data then
post-prune tree. By default, the decision tree model is allowed to grow to its full depth.
Pruning refers to a technique to remove the parts of the decision tree to prevent growing to
its full depth. By tuning the hyperparameters of the decision tree model one can prune the
trees and prevent them from overfitting. There are two types of pruning Pre-pruning and
Post-pruning. Now let's discuss the in-depth understanding and hands-on implementation
of each of these pruning techniques.
Pre-Pruning:
The pre-pruning technique refers to the early stopping of the growth of the decision tree.
The pre-pruning technique involves tuning the hyperparameters of the decision tree model
prior to the training pipeline. The hyperparameters of the decision tree including
max_depth, min_samples_leaf, min_samples_split can be tuned to early stop the growth
of the tree and prevent the model from overfitting.
Post-Pruning:
The Post-pruning technique allows the decision tree model to grow to its full depth, then
removes the tree branches to prevent the model from overfitting. Cost complexity pruning
(ccp) is one type of post-pruning technique. In case of cost complexity pruning, the
ccp_alpha can be tuned to get the best fit model.
Problems/Numerical
Problem 1:
If we decided to arbitrarily label all 4 gumballs as red, how often would one of the gumballs
is incorrectly labelled?
4 red and 0 blue:
The impurity measurement is 0 because we would never incorrectly label any of the 4 red
gumballs here. If we arbitrarily chose to label all the balls ‗blue‘, then our index would still be
0, because we would always incorrectly label the gumballs.
The gini score is always the same no matter what arbitrary class you take the probabilities of
because they always add to 0 in the formula above.
A gini score of 0 is the most pure score possible.
2 red and 2 blue:
The impurity measurement is 0.5 because we would incorrectly label gumballs wrong about
half the time. Because this index is used in binary target variables (0,1), a gini index of 0.5 is
the least pure score possible. Half is one type and half is the other. Dividing gini scores by
0.5 can help intuitively understand what the score represents. 0.5/0.5 = 1, meaning the
grouping is as impure as possible (in a group with just 2 outcomes).
3 red and 1 blue:
The impurity measurement here is 0.375. If we divide this by 0.5 for more intuitive
understanding we will get 0.75, which is the probability of incorrectly/correctly labeling.
Problem 2:
How does entropy work with the same gumball scenarios stated in problem 1?
4 red and 0 blue:
majority voting to decide on the predicted class, and in the case of regression, we will
take the mean value of the predictions of all the estimators.
12. What are advantages and limitations of the random forest tree?
13. What is the difference between simple decision tree and random forest tree?
bias. Random forests, on the other hand, are a powerful modelling tool that is far more
resilient than a single decision tree. They combine numerous decision trees to reduce
overfitting and bias-related inaccuracy, and hence produce usable results.
Theory questions
18. What are the Pros and Cons of using Naive Bayes?
The requirement of predictors to be independent. In most of the real life cases, the
predictors are dependent, this hinders the performance of the classifier.
19. How does the Bayes algorithm differ from decision trees?
Theory questions
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning. The goal of the SVM algorithm is to create
the best line or decision boundary that can segregate n-dimensional space into classes so
that we can easily put the new data point in the correct category in the future. This best
decision boundary is called a hyperplane. SVM chooses the extreme points/vectors that help
in creating the hyperplane. These extreme cases are called as support vectors, and hence
algorithm is termed as Support Vector Machine. Consider the below diagram in which there
are two different categories that are classified using a decision boundary or hyperplane:
SVM works by mapping data to a high-dimensional feature space so that data points can be
categorized, even when the data are not otherwise linearly separable.
Original dataset Data with separator added Transformed data
A separator between the categories is found, and then the data are transformed in such a
way that the separator could be drawn as a hyperplane. Following this, characteristics of new
Theory questions
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true
or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).The curve from the logistic
function indicates the likelihood of something such as whether the cells are cancerous or
not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:
hθ(x) approaches to 1. Conversely, the cost to pay grows to infinity as hθ(x) approaches to 0.
This is a desirable property: we want a bigger penalty as the algorithm predicts something
far away from the actual value. If the label is y=1 but the algorithm predicts hθ(x)=0, the
outcome is completely wrong. Conversely, the same intuition applies when y=0, depicted in
the plot 2. below, right side. Bigger penalties when the label is y=0 but the algorithm
predicts hθ(x)=1. The above two functions can be compressed into a single function i.e.
Theory questions
Advantages
The algorithm is simple and easy to implement.
There‘s no need to build a model, tune several parameters, or make additional
assumptions.
The algorithm is versatile. It can be used for classification, regression, and search (as we
will see in the next section).
Disadvantages
The algorithm gets significantly slower as the number of examples and/or
predictors/independent variables increase.
To select the K that‘s right for your data, we run the KNN algorithm several times with
different values of K and choose the K that reduces the number of errors we encounter while
maintaining the algorithm‘s ability to accurately make predictions when it‘s given data it
hasn‘t seen before. Here are some things to keep in mind:
As we decrease the value of K to 1, our predictions become less stable. Just think for a
minute, imagine K=1 and we have a query point surrounded by several reds and one
green (I‘m thinking about the top left corner of the colored plot above), but the green is
the single nearest neighbor. Reasonably, we would think the query point is most likely
red, but because K=1, KNN incorrectly predicts that the query point is green.
Inversely, as we increase the value of K, our predictions become more stable due to
majority voting / averaging, and thus, more likely to make more accurate predictions (up
to a certain point). Eventually, we begin to witness an increasing number of errors. It is at
this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification
problem) among labels, we usually make K an odd number to have a tiebreaker.
49. How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters
that it forms. But choosing the optimal number of clusters is a big task. There are some
different ways to find the optimal number of clusters, but here we are discussing the most
appropriate method to find the number of clusters or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values and the number of clusters K.
The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the
elbow method. The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the
number of clusters equal to the data points, then the value of WCSS becomes zero, and that
will be the endpoint of the plot.
*********************
Regression Analysis is an analytical process whose end goal is to understand the inter-
relationships in the data and find as much useful information as possible.
According to the book, there are a number of steps which are loosely detailed below.
1 - Problem definition
The very first step is to, off course; define the problem we are trying to solve. Perhaps a
business question that needs to be answered or simply a prediction we want to make
based on some set of data. In this stage we must know the target variable and the
attributes we presume affects the target variable. This would be later analysed to judge
its credibility. For the sake of our discussion let‟s take the Titanic Dataset as an example.
In this dataset we have data of about 900 passengers. The question or the problem we
must solve is predicting which passenger likely survived the tragedy given their data.
the training, while the testing set is used to test the model after the training is
completed.
The validation dataset gives the model the first taste of unseen data. However, not all
data scientists perform an initial check using validation data. They might skip this part
and go directly to testing data.
The classifier model can be designed/trained and performance can be evaluated based
on K-fold cross-validation mode, training mode and test mode.
The main idea behind K-Fold cross-validation is that each sample in our dataset has
the opportunity of being tested. It is a special case of cross-validation where we
iterate over a dataset set k times. In each round, we split the dataset into k parts:
one part is used for validation, and the remaining k-1 parts are merged into a
training subset for model evaluation
Computation time is reduced as we repeated the process only 10 times when the value
of k is 10. It has Reduced bias.
Every data points get to be tested exactly once and is used in training k-1 times
The variance of the resulting estimate is reduced as k increases
• Machine learning algorithms have hyperparameters that allow you to tailor the
behavior of the algorithm to your specific dataset.
• Hyperparameters are different from parameters, which are the internal coefficients or
weights for a model found by the learning algorithm. Unlike parameters,
hyperparameters are specified by the practitioner when configuring the model.
• Typically, it is challenging to know what values to use for the hyperparameters of a given
algorithm on a given dataset, therefore it is common to use random or grid search
strategies for different hyperparameter values.
• The more hyperparameters of an algorithm that you need to tune, the slower the
tuning process. Therefore, it is desirable to select a minimum subset of model
hyperparameters to search or tune.
Max_Depth: The maximum depth of the tree. If this is not specified in the Decision Tree, the
nodes will be expanded until all leaf nodes are pure or until all leaf nodes contain less than
min_samples_split.
Default = None
Input options → integer
Min_Samples_Split: The minimum samples required to split an internal node. If the amount
of sample in an internal node is less than the min_samples_split, then that node will become
a leaf node.
Default = 2
Input options → integer or float (if float, then min_samples_split is fraction)
Min_Samples_Leaf: The minimum samples required to be at a leaf node. Therefore, a split
can only happen if it leaves at least the min_samples_leaf in both of the resulting nodes.
Default = 1
Input options → integer or float (if float, then min_samples_leaf is fraction)
Max_Features: The number of features to consider when looking for the best split. For
example, if there are 35 features in a dataframe and max_features is 9, only 9 of the 35
features will be used in the decision tree.
Default = None
Input options → integer, float (if float, then max_features is fraction) or {“auto”, “sqrt”,
“log2”}
“auto”: max_features=sqrt(n_features)
“sqrt”: max_features = sqrt(n_features)
Theory questions
reinforcement learning.)
Agent(): An entity that can perceive/explore the environment and act upon it.
Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
Action(): Actions are the moves taken by an agent within the environment.
State(): State is a situation returned by the environment after each action taken by the
agent.
Reward(): A feedback returned to the agent from the environment to evaluate the action
of the agent.
Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
Value(): It is expected long-term retuned with the discount factor and opposite to the
short-term reward.
Q-value(): It is mostly similar to the value, but it takes one additional parameter as a
current action (a).
In RL, the agent is not instructed about the environment and what actions need to be
taken.
It is based on the hit and trial process.
The agent takes the next action and changes states according to the feedback of the
previous action.
The agent may get a delayed reward.
environments, where the agent can observe the environment and act for the new state. The
complete process is known as Markov Decision process, which is explained below:
MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
A set of finite States S
A set of finite Actions A
Rewards received after transitioning from state S to state S', due to action a.
Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property: It says that "If the agent is present in the current state S1, performs an
action a1 and move to the state s2, then the state transition from s1 to s2 only depends on
the current state and future action and states do not depend on past actions, rewards, or
states." Or, in other words, as per Markov Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that satisfies the
Markov property. Such as in a Chess game, the players only focus on the current state and
do not need to remember past actions or states.
Finite MDP: A finite MDP is when there are finite states, finite rewards, and finite actions. In
RL, we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2, ....., St that
uses the Markov Property. Markov process is also known as Markov chain, which is a tuple (S,
Now, how the price prediction is made by hidden layers? How computation is done inside
hidden layers? This will be explained with help of activation function, loss function, and
optimizers.
Each neuron has an activation function that performs computation. Different layers can
have different activation functions but neurons belonging to one layer have the same
activation function. In DNN, a weighted sum of input is calculated based on weights and
inputs provided. Then, the activation function comes into the picture that works on weighted
sum and converts it into output.
Activation functions help model learn complex relationship that exists within the dataset. If
we do not use the activation function in neurons and give weighted sum as output, in that
case, computations will be difficult as there is no specific range for weighted sum. So, the
activation function helps to keep output in a particular range. Secondly, the non-linear
activation function is always preferred as it adds non-linearity to the dataset which otherwise
would form a simple linear regression model incapable of taking the benefit of hidden
layers. The relu function or its varients is mostly used for hidden layers and sigmoid/ softmax
function is mostly used for final layer for binary/ multi-class classification problems.
To train the model, we give input (departure location, arrival location and, departure date in
case of train price prediction) to the network and let it predict the output making use of
activation function. Then, we compare predicted output with the actual output and compute
the error between two values. This error between two values is computed using loss/ cost
function. The same process is repeated for entire training dataset and we get the average
loss/error. Now, the objective is to minimize this loss to make the model accurate. There
exist weights between each connection of 2 neurons. Initially, weights are randomly
initialized and the motive is to update these weights with every iteration to get the minimum
value of the loss/ cost function. We can change the weights randomly but that is not efficient
method. Here comes the role of optimizers which updates weights automatically.
23. What are different loss functions and their use case?
Once loss for one iteration is computed, optimizer is used to update weights. Instead of
changing weights manually, optimizers can update weights automatically in small
increments and helps to find the minimum value of the loss/ cost function. Magic of DL!!
Finding minimum value of cost function requires iterating through dataset many times and
thus requires large computational power. Common technique used to update these weights
is gradient descent.
It is used to find minimum value of loss function by updating weights. There are 3 variants:
a) Batch/ Vanila Gradient
In this, gradient for entire dataset is computed to perform one weight update.
It gives good results but can be slow and requires large memory.
b) Stochastic Gradient Descent (SGD)
Weights are updated for each training data point.
HMI is all about how people and automated systems interact and communicate with
each other. That has long ceased to be confined to just traditional machines in industry
and now also relates to computers, digital systems or devices for the Internet of Things
(IoT).
More and more devices are connected and automatically carry out tasks. Operating all of
these machines, systems and devices needs to be intuitive and must not place excessive
demands on users.
Human-machine interaction is all about how people and automated systems interact
with each other.
HMI now plays a major role in industry and everyday life: More and more devices are
connected and automatically carry out tasks.
A user interface that is as intuitive as possible is therefore needed to enable smooth
operation of these machines. That can take very different forms.
Smooth communication between people and machines requires interfaces: The place
where or action by which a user engages with the machine.
Simple examples are light switches or the pedals and steering wheel in a car: An action is
triggered when you flick a switch, turn the steering wheel or step on a pedal.
However, a system can also be controlled by text being keyed in, a mouse, touch screens,
voice or gestures.
The devices are either controlled directly: Users touch the smartphone’s screen or issue a
verbal command. Or the systems automatically identify what people want: Traffic lights
change color on their own when a vehicle drives over the inductive loop in the road’s
surface.
Other technologies are not so much there to control devices, but rather to complement
our sensory organs. One example of that is a virtual reality glass.
There are also digital assistants: Chatbots, for instance, reply automatically to requests
from customers and keep on learning.
User interfaces in HMI are the places where or actions by which the user engages with
the machine.
A system can be operated by means of buttons, a mouse, touch screens, voice or
gesture, for instance.