ML Model Paper 2 Solution
ML Model Paper 2 Solution
ML Model Paper 2 Solution
Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions
about a population parameter or a population probability distribution. First, a tentative assumption is
made about the parameter or distribution. This assumption is called the null hypothesis and is denoted
by H0. An alternative hypothesis (denoted Ha), which is the opposite of what is stated in the null
hypothesis, is then defined. The hypothesis-testing procedure involves using sample data to determine
whether or not H0 can be rejected. If H0 is rejected, the statistical conclusion is that the alternative
hypothesis Ha is true.
2. Define possibility of conversion for Regression into Classification and vice versa.
Some regression models are already classification models - e.g. logistic regression. One could
set the cut point at any particular level to get a classification. Usually one would choose a
50-50 split, but there may be reasons not to (the cost of the two types of classification error
might be different).
Regression trees turn into classification trees if the dependent variable changes. In general, it
is not a good idea to turn a continuous dependent variable (as for regression trees) into a
categorical one - it loses information. But there might be times when it is necessary (e.g. to
make certain kinds of decisions).
Similarly, if you categorize the dependent variable, a linear regression is inappropriate and a
logistic regression model is better.
Regression models can be very sensitive to outliers. Also, a practical challenge is, a predicted
value might be far off from the real value in extreme ranges, however it still may fall in the
correct side of the data distribution with respect to the mean. For e.g. you have a device that
measures heart rate from your finger tips images (just cooking up an example here), it’ll be
lot easier from data science standpoint to first try to predict whether the pulse rate is normal
or high or below normal.
The definitions of what is “normal” is very clear from medical standpoint. However, you may
wish to redefine your response variable segments, first using a clustering technique.
Regression models can be very sensitive to outliers. Also, a practical challenge is, a predicted
value might be far off from the real value in extreme ranges, however it still may fall in the
correct side of the data distribution with respect to the mean. For e.g. you have a device that
measures heart rate from your finger tips images (just cooking up an example here), it’ll be
lot easier from data science standpoint to first try to predict whether the pulse rate is normal
or high or below normal.
The definitions of what is “normal” is very clear from medical standpoint. However, you may
wish to redefine your response variable segments, first using a clustering technique.
with linear regression you fit a polynomial through the data - say, like on the example below
we're fitting a straight line through {tumor size, tumor type} sample set:
Above, malignant tumors get 11 and non-malignant ones get 00, and the green line is our
hypothesis h(x)h(x). To make predictions we may say that for any given tumor size xx,
if h(x)h(x) gets bigger than 0.50.5 we predict malignant tumor, otherwise we predict
benign.
Looks like this way we could correctly predict every single training set sample, but now let's
change the task a bit.
Intuitively it's clear that all tumors larger certain threshold are malignant. So let's add
another sample with a huge tumor size, and run linear regression again:
We cannot change the hypothesis each time a new sample arrives. Instead, we should learn
it off the training set data, and then (using the hypothesis we've learned) make correct
predictions for the data we haven't seen before.
Hope this explains why linear regression is not the best fit for classification problems! Also,
you might want to watch VI. Logistic Regression. Classification video on ml-class.org which
explains the idea in more detail.
EDIT
probabilityislogic asked what a good classifier would do. In this particular example you would
probably use logistic regression which might learn a hypothesis like this (I'm just making
this up):
Note that both linear regression and logistic regression give you a straight line (or a higher
order polynomial) but those lines have different meaning:
h(x)h(x) for linear regression interpolates, or extrapolates, the output and predicts the
value for xx we haven't seen. It's simply like plugging a new xx and getting a raw
number, and is more suitable for tasks like predicting, say car price based on {car size,
car age} etc.
h(x)h(x) for logistic regression tells you the probability that xx belongs to the
"positive" class. This is why it is called a regression algorithm - it estimates a continuous
quantity, the probability. However, if you set a threshold on the probability, such
as h(x)>0.5h(x)>0.5, you obtain a classifier, and in many cases this is what is done
with the output from a logistic regression model. This is equivalent to putting a line on
the plot: all points sitting above the classifier line belong to one class while the points
below belong to the other class.
So, the bottom line is that in classification scenario we use a completely different reasoning
and a completely different algorithm than in regression scenario.
A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most
commonly created as an output from hierarchical clustering. The main use of a dendrogram is to
work out the best way to allocate objects to clusters.
The dendrogram is a tree-like structure that is mainly used to store each step as a memory that
the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the given dataset.
Set of actions- A
Set of states -S
Reward- R
Policy- n
Value- V
The mathematical approach for mapping a solution in reinforcement Learning is recon as a
Markov Decision Process or (MDP).
Q-Learning
Q learning is a value-based method of supplying information to inform which action an agent
should take.
In the below-given image, a state is described as a node, while the arrows show the action.
6. Differentiate Root Mean Squared Error (RMSE) with Mean Squared Error (MSE) for Linear Regression?
The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points. For
every data point, you take the distance vertically from the point to the corresponding y value on
the curve fit (the error), and square the value. Then you add up all those values for all data
points, and, in the case of a fit with two parameters such as a linear fit, divide by the number of
points minus two. The squaring is done so negative values do not cancel positive values. The
smaller the Mean Squared Error, the closer the fit is to the data. The MSE has the units squared
of whatever is plotted on the vertical axis.
RMSE stands for root mean square error and MSE stands for mean square error. It is just the
square root of the mean square error. That is probably the most easily interpreted statistic, since
it has the same units as the quantity plotted on the vertical axis.
They are the most common measures of accuracy for a linear regression model. The formulas are
below.
RMSE is the square root of MSE. MSE is measured in units that are the square of the
target variable, while RMSE is measured in the same units as the target variable. Due
to its formulation, MSE, just like the squared loss function that it derives from, effectively
penalizes larger errors more severely.
7. Describe over fitting with comparison to underfitting? Give any one method to avoid over fitting.
The Apriori algorithm uses frequent itemsets to generate association rules, and it is designed to
work on the databases that contain transactions. With the help of these association rule, it
determines how strongly or how weakly two objects are connected. This algorithm uses
a breadth-first search and Hash Tree to calculate the itemset associations efficiently. It is the
iterative process for finding the frequent itemsets from the large dataset.
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition
of joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent
remove that itemset.(Example subset of{I1, I2} are {I1}, {I2} they are
frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
Step-4:
Generate candidate set C4 using L3 (join step). Condition of
joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2) elements
in common. So here, for L3, first 2 elements (items) should
match.
Check all subsets of these itemsets are frequent or not (Here
itemset formed by joining L3 is {I1, I2, I3, I5} so its subset
contains {I1, I3, I5}, which is not frequent). So no itemset in C4
We stop here because no frequent itemsets are found further
Thus, we have discovered all the frequent item-sets. Now generation of
strong association rule comes into picture. For that we need to calculate
confidence of each rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased
milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the
rule generation.
Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as
strong association rules.
9. Compare Naive Bayes with Logistic Regression to solve classification problems.
Naïve Bayes is a classification method based on Bayes’ theorem that derives the
probability of the given feature vector being associated with a label. Naïve Bayes has a
naive assumption of conditional independence for every feature, which means that the
algorithm expects the features to be independent which not always is the case.
Logistic regression is a linear classification method that learns the probability of a sample
belonging to a certain class. Logistic regression tries to find the optimal decision
boundary that best separates the classes.
You might wonder what posterior probability is, let me give you a hint. Posterior
probability can be defined as the probability of event A happening given that event B has
occurred, in more layman terms this means that the previous belief can be updated when
we have new information. For example, let’s say we think the stock market will go up by
50% next year, this prediction can be updated when we get new information such as
updated GPD numbers, interest rates etc.
3. Model assumptions
Naïve Bayes assumes all the features to be conditionally independent. So, if some of the
features are in fact dependent on each other (in case of a large feature space), the
prediction might be poor.
Logistic regression splits feature space linearly, and typically works reasonably well even
when some of the variables are correlated.
10. Differentiate Random Forest with Decision Tree and Explain how is it possible to perform
Unsupervised Learning with Random Forest?
As stated above, many unsupervised learning methods require the inclusion of an input
dissimilarity measure among the observations. Hence, if a dissimilarity matrix can be
produced using Random Forest, we can successfully implement unsupervised learning.
The patterns found in the process will be used to make clusters.
An artificial class label is created that distinguishes the ‘observed’ data from suitably
generated ‘synthetic’ data. The observed data is the original unlabeled data, while the
synthetic data is drawn from a reference distribution. Supervised learning methods, which
distinguish observed data from synthetic data, yield a dissimilarity measure that can be
used as input in subsequent unsupervised learning methods.
11. Can PCA be used for regression-based problem statements? If yes, then explain the scenario
where we can use it.
Yes, we can use Principal Components for regression problem statements.
PCA would perform well in cases when the first few Principal Components are sufficient
to capture most of the variation in the independent variables as well as the relationship
with the dependent variable.
The only problem with this approach is that the new reduced set of features would be
modeled by ignoring the dependent variable Y when applying a PCA and while these
features may do a good overall job of explaining the variation in X, the model will
perform poorly if these variables don’t explain the variation in Y.
PCA would perform well in cases when the first few Principal Components are sufficient
to capture most of the variation in the independent variables as well as the relationship
The only problem with this approach is that the new reduced set of features would be
modeled by ignoring the dependent variable Y when applying a PCA and while these
features may do a good overall job of explaining the variation in X, the model will
12. Compare Feature Extraction and Feature Selection techniques. Explain how dimensionality can
be reduced using subset selection procedure.
Feature selection is the process of choosing precise features, from a features pool.
This helps in simplification, regularization and shortening training time. This can be
done with various techniques: e.g. Linear Regression, Decision Trees.
Feature extraction is the process of converting the raw data into some other data
type, with which the algorithm works is called Feature Extraction. Feature
extraction creates a new, smaller set of features that captures most of the useful
information in the data.
The main difference between them is Feature selection keeps a subset of the original
features while feature extraction creates new ones.
The main difference:- Feature Extraction transforms an arbitrary data, such as text
or images, into numerical features that is understood by machine learning
algorithms. Feature Selection on the other hand is a machine learning technique
applied on these (numerical) features.