Model Definition11

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Logistic Regression is part of a larger class of algorithms known as Generalized Linear Model (glm).

, it uses maximum likelihood estimation (MLE).

The dependent variable need not to be normally distributed.

To start with logistic regression, I’ll first write the simple linear regression equation
with dependent variable enclosed in a link function:

g(y) = βo + β(Age) ---- (a)

Note: For ease of understanding, I’ve considered ‘Age’ as independent variable.

In logistic regression, we are only concerned about the probability of outcome


dependent variable ( success or failure).

2. Logistic Regression

Logistic regression is used to find the probability of event=Success and


event=Failure. We should use logistic regression when the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1
and it can represented by following equation.

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence

ln(odds) = ln(p/(1-p))

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk

Above, p is the probability of presence of the characteristic of interest. A question


that you should ask here is “why have we used log in the equation?”.

Since we are working here with a binomial distribution (dependent variable), we need
to choose a link function which is best suited for this distribution. And, it is logit
function. In the equation above, the parameters are chosen to maximize the
likelihood of observing the sample values rather than minimizing the sum of squared
errors (like in ordinary regression).

Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales etc.)
based on continuous variable(s). Here, we establish relationship between
independent and dependent variables by fitting a best line. This best fit line is known
as regression line and represented by a linear equation Y= a *X + b.

 Y – Dependent Variable
 a – Slope
 X – Independent variable
 b – Intercept

Decision Tree Simplified.

Decision Tree

we split the population into two or more homogeneous sets. This is done based on most significant
attributes/ independent variables to make as distinct groups as possible.

In the image above, you can see that population is classified into four different groups based on
multiple attributes to identify ‘if they will play or not’. To split the population into different
heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square, entropy.

Support Vectors

we’d first plot these two variables in two dimensional space where each point has two co-ordinates
(these co-ordinates are known as Support Vectors)

Now, we will find some line that splits the data between the two differently classified
groups of data. This will be the line such that the distances from the closest point in
each of the two groups will be farthest away.
In the example shown above, the line which splits the data into two differently
classified groups is the black line, since the two closest points are the farthest apart
from the line. This line is our classifier. Then, depending on where the testing data
lands on either side of the line, that’s what class we can classify the new data as.
Naive Bayes

For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other features, a
naive Bayes classifier would consider all of these properties to independently contribute to the
probability that this fruit is an apple.

Naive Bayesian model is easy to build and particularly useful for very large data sets.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). Look at the equation below:

Here,
 P(c|x) is the posterior probability of class (target) given predictor (attribute).
 P(c) is the prior probability of class.
 P(x|c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor.

KNN (K- Nearest Neighbors)

Things to consider before selecting KNN:

KNN is computationally expensive

Variables should be normalized else higher range variables can bias it

Works on pre-processing stage more before going for KNN like outlier

K-Means
It is a type of unsupervised algorithm which solves the clustering problem.
Its procedure follows a simple and easy way to classify a given data set through
a certain number of clusters (assume k clusters). Data points inside a cluster are
homogeneous and heterogeneous to peer groups.

Remember figuring out shapes from ink blots? k means is somewhat similar this
activity. You look at the shape and spread to decipher how many different clusters /
population are present!
How K-means forms cluster:

1. K-means picks k number of points for each cluster known as centroids.


2. Each data point forms a cluster with the closest centroids i.e. k clusters.
3. Finds the centroid of each cluster based on existing cluster members. Here we
have new centroids.
4. As we have new centroids, repeat step 2 and 3. Find the closest distance for each
data point from new centroids and get associated with new k-clusters. Repeat this
process until convergence occurs i.e. centroids does not change.

How to determine value of K:

In K-means, we have clusters and each cluster has its own centroid. Sum of square
of difference between centroid and the data points within a cluster constitutes within
sum of square value for that cluster. Also, when the sum of square values for all the
clusters are added, it becomes total within sum of square value for the cluster
solution.

We know that as the number of cluster increases, this value keeps on decreasing but
if you plot the result you may see that the sum of squared distance decreases
sharply up to some value of k, and then much more slowly after that. Here, we can
find the optimum number of cluster.
8. Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In Random
Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new
object based on attributes, each tree gives a classification and we say the tree
“votes” for that class. The forest chooses the classification having the most votes
(over all the trees in the forest).

Each tree is planted & grown as follows:

1. If the number of cases in the training set is N, then sample of N cases is taken at
random but with replacement. This sample will be the training set for growing the
tree.
2. If there are M input variables, a number m<<M is specified such that at each node,
m variables are selected at random out of the M and the best split on these m is
used to split the node. The value of m is held constant during the forest growing.
3. Each tree is grown to the largest extent possible. There is no pruning.

You might also like