Important Tems
Important Tems
Important Tems
A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R| S | T | U| V |
W| X | Y | Z
Word Description
Accuracy is a metric by which one can examine how good is the
machine learning model. Let us look at the confusion matrix to
understand it in a better way:
Accuracy
Since the regression model uses data from the same input variable
at previous time steps, it is referred to as an autoregression.
Word Description
In neural networks, if the estimated output is far away from the
actual output (high error), we update the biases and weights based
on the error. This weight and bias updating process is known as
Back Propagation. Back-propagation (BP) algorithms work by
Backpropogati
determining the loss (or error) at the output and then propagating it
on
back into the network. The weights are updated to minimize the
error resulting from each neuron. The first step in minimizing the
error is to determine the gradient (Derivatives) of each node w.r.t.
the final output.
Bagging or bootstrap averaging is a technique where multiple
models are created on the subset of data, and the final predictions
are determined by combining the predictions of all the models.
Some of the algorithms that use bagging technique are :
Bagging meta-estimator
Random Forest
Bagging
Bar charts are a type of graph that are used to display and compare
the numbers, frequency or other measures (e.g. mean) for different
discrete categories of data. They are used for categorical variables.
Simple example of a bar chart:
Bar Chart
For example, Let’s say a clinic wants to cure cancer of the patients
visiting the clinic.
A represents an event “Person has cancer”
To understand Bayes’
Theorem in detail, refer here.
Bias-Variance The error emerging from any model can be broken down into
Trade-off components mathematically.
Big data is a term that describes the large volume of data – both
structured and unstructured. But it’s not the amount of data that’s
important. It’s how organizations use this large amount of data to
Big Data
generate insights. Companies use various tools, techniques and
resources to make sense of this data to derive effective business
strategies.
Binary variables are those variables which can have only two
Binary
unique values. For example, a variable “Smoking Habit” can
Variable
contain only two values like “Yes” and “No”.
where:
n : No. of trials
r : No. of success
p : the probability of success
1 – p : Probability of failure
nCr : binomial coefficient given by n!/k!(n-k)!
AdaBoost
Boosting
GBM
XGBM
LightGBM
CatBoost
Box Plot It displays the full range of variation (from min to max), the likely
range of variation (the Interquartile range), and a typical value (the
median). Below is a visualization of a box plot:
Word Description
Categori Categorical variables (or nominal variables) are those variables which
cal have discrete qualitative values. For example, names of cities are
Variable categorical like Delhi, Mumbai, Kolkata. Read in detail here.
Concord Concordant and discordant pairs are used to describe the relationship
ant- between pairs of observations. To calculate the concordant and
Discorda discordant pairs, the data are treated as ordinal. The number of
nt Ratio concordant and discordant pairs are used in calculations for Kendall’s
tau, which measures the association between two ordinal variables.
Let’s say you had two movie reviewers rank a set of 5 movies:
A 1 1
B 2 2
C 3 4
D 4 3
E 5 6
The ranks given by the reviewer 1 are ordered in ascending order, this
way we can compare the rankings given by both the reviewers.
Continuo Continuous variables are those variables which can have infinite number
us of values but only in a specific range. For example, height is a
Variable continuous variable. Read more here.
Convex A real value function is called convex if the line segment between any
Function two points on the graph of the function lies above or on the graph.
Convex functions play an important role in many areas of mathematics.
They are especially important in the study of optimization problems
where they are distinguished by a number of convenient properties.
Here,
Cost
Function h(x) is the prediction
y is the actual value
m is the number of rows in the training set
So let’s say, you increase the size of a particular shop, where you
predicted that the sales would be higher. But despite increasing the size,
the sales in that shop did not increase that much. So the cost applied in
increasing the size of the shop, gave you negative results. So, we need to
minimize these costs. Therefore we make use of cost function to
minimize the loss.
Where,
Word Description
Data mining is a study of extracting useful information from
structured/unstructured data taken from various sources. This is done
usually for
Data Data transformation is the process to convert data from one form to the
Transf other. This is usually done at a preprocessing step.
For instance, replacing a variable x by the square root of x
9
ormati
on
distance
the minimum number of points required to form a dense region
Decisio
n
Bound
ary
Decisio
n Tree
Decile Decile divides a series into 10 equal parts. For any series, there are 10
decile denoted by D1, D2, D3 … D10. These are known as First Decile ,
Second Decile and so on.
For example, the diagram below shows the health score of a patient from
range 0 to 60. Nine deciles split the patients into 10 groups
It is the number of variables that have the choice of having more than one
Degree arbitrary value.
of For example, in a sample of size 10 with mean 10, 9 values can be arbitrary
Freedo but the 10th value is forced by the sample mean. So, we can choose any
m number for 9 values but the 10th value must be such that the mean is 10.
So, the degree of freedom in this case will be 9.
install.packages("dplyr")
Dumm
Dummy Variable is another name for Boolean variable. An example of
y
dummy variable is that it takes value 0 or 1. 0 means value is true (i.e. age
Variab
< 25) and 1 means value is false (i.e. age >= 25)
le
E
Word Description
Early stopping is a technique for avoiding overfitting when training a
machine learning model with iterative method. We set the early
stopping in such a way that when the performance has stopped
improving on the held-out validation set, the model training stops.
Early
For example, in XGBoost, as you train more and more trees, you will
Stopping
overfit your training dataset. Early stopping enables you to specify a
validation dataset and the number of iterations after which the
algorithm should stop if the score on your validation dataset didn’t
increase.
EDA
2. Univariate analysis
3. Multivariate analysis
ETL ETL is the acronym for Extract, Transform and Load. An ETL system
has the following properties:
Wor
Description
d
Factor analysis is a technique that is used to reduce a large number of
variables into fewer numbers of factors. Factor analysis aims to find
independent latent variables. Factor analysis also assumes several
assumptions:
Points which are actually true but are incorrectly predicted as false. For
False
example, if the problem is to predict the loan status. (Y-loan approved, N-loan
Nega
not approved). False negative in this case will be the samples for which loan
tive
was approved but the model predicted the status as not approved.
Points which are actually false but are incorrectly predicted as true. For
False
example, if the problem is to predict the loan status. (Y-loan approved, N-loan
Posit
not approved). False positive in this case will be the samples for which loan
ive
was not approved but the model predicted the status as approved.
Term
John
Likes
To
ure Watch
Hash
ing Movies
Mary
Too
Also
Football
The array form for the same will be:
Feat Feature reduction is the process of reducing the number of features to work on
ure a computation intensive task without losing a lot of information.
Red PCA is one of the most popular feature reduction techniques, where we
uctio combine correlated variables to reduce the features.
n
Feature Selection is a process of choosing those features which are required to
explain the predictive power of a statistical model and dropping out irrelevant
features.
This can be done by either filtering out less useful features or by combining
Feat features to make a new one.
ure
Selec
tion
Refer here.
Few- Few-shot learning refers to the training of machine learning algorithms using
shot a very small set of training data instead of a very large set. This is most
Lear suitable in the field of computer vision, where it is desirable to have an object
ning categorization model work well without thousands of training examples.
Flume is a service designed for streaming logs into the Hadoop environment.
It can collect and aggregate huge amounts of log data from a variety of
sources. In order to collect high volume of data, multiple flume agents can be
configured.
Here are the major features of Apache Flume:
Here, the sampling distributions of fixed size are taken. Then, the experiment
Stati
is theoretically repeated infinite number of times but practically done with a
stics
stopping intention. For example, I perform an experiment with a stopping
intention in mind that I will stop the experiment when it is repeated 1000
times or I see minimum 300 heads in a coin toss. Read more here.
Word Description
Gated The GRU is a variant of the LSTM (Long Short Term Memory) and was
Recurren introduced by K. Cho. It retains the LSTM’s resistance to the vanishing
t Unit gradient problem, but because of its simpler internal structure it is faster
(GRU) to train.
Instead of the input, forget, and output gates in the LSTM cell, the GRU
cell has only two gates, an update gate z, and a reset gate r. The update
gate defines how much previous memory to keep, and the reset gate
defines how to combine the new input with the previous memory.
GGplot2 is a data visualization package for the R programming
language. It is a highly versatile and user-friendly tool for creating
attractive plots. To know more about Ggplot2, visit here.
Ggplot2
It can be easily installed using the following code from the R console:
install.packages("ggplot2")
Go Memory safety
Garbage collection
Structural typing
The compiler and other tools originally developed by Google are all free
and open source. To read further on the Go language, refer here.
Goodness The goodness of fit of a model describes how well it fits a set of
of Fit observations. Measures of goodness of fit typically summarize the
discrepancy between observed values and the values expected under the
model.
With regard to a machine learning algorithm, a good fit is when the error
for the model on the training data as well as the test data is minimum.
Over time, as the algorithm learns, the error for the model on the training
data goes down and so does the error on the test dataset. If we train for
too long, the performance on the training dataset may continue to
decrease because the model is overfitting and learning the irrelevant
detail and noise in the training dataset. At the same time the error for the
test set starts to rise again as the model’s ability to generalize decreases.
So the point just before the error on the test dataset starts to increase
where the model has good skill on both the training dataset and the
unseen test dataset is known as the good fit of the model.
Word Description
Hadoop Hadoop is an open source distributed processing framework used
when we have to deal with enormous data. It allows us to use
parallel processing capability to handle big data. Here are some
significant benefits of Hadoop:
Holdout Sample While working on the dataset, a small part of the dataset is not
used for training the model instead, it is used to check the
performance of the model. This part of the dataset is called the
holdout sample.
For instance, if I divide my data in two parts – 7:3 and use the
70% to train the model, and other 30% to check the performance
of my model, the 30% data is called the holdout sample.
Word Description
Imputation is a technique used for handling missing values in the data.
This is done either by statistical metrics like mean/mode imputation or
by machine learning techniques like kNN imputation
For example,
Name Age
Akshay 23
Akshat NA
Imputation
Viraj 40
Name Age
Akshay 23
Akshat 31.5
Viraj 40
Word Description
Julia is a high-level, high-performance dynamic programming language for
numerical computing. Some important features of Julia are:
Word Description
It is a type of unsupervised algorithm which solves the clustering problem.
It is a procedure which follows a simple and easy way to classify a given
data set through a certain number of clusters (assume k clusters). Data
points inside a cluster are homogeneous and heterogeneous to peer groups.
K-
Means
kNN K nearest neighbors is a simple algorithm that stores all available cases
and classifies new cases by a majority vote of its k neighbors. The
case being assigned to the class is most common amongst its K nearest
neighbors measured by a distance function.
L
Word Description
A labeled dataset has a meaningful “label”, “class” or “tag” associated
with each of its records or rows. For example, labels for a dataset of a
set of images might be whether an image contains a cat or a dog.
Labeled Labeled data are usually more expensive to obtain than the raw
Data unlabeled data because preparation of the labelled data involves manual
labelling every piece of unlabeled data.
Line Line charts are used to display information as series of points connected
Chart by straight line segment. These charts are used to communicate
information visually, such as to show an increase or decrease in the
trend in data over intervals of time.
In the plot below, for each time instance, the speed trend is shown and
the points are connected to display the trend over time.
This plot is for a single case. Line charts can also be used to compare
changes over the same period of time for multiple cases, like plotting
the speed of a cycle, car, train over time in the same plot.
Y=aX+b
where:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
Linear These coefficients a and b are derived based on minimizing the sum of
Regression squared difference of distance between data points and regression line.
Look at the below example. Here we have identified the best fit line
having linear equation y=0.2811x+13.9. Now using this equation, we
can find the weight, knowing the height of a person.
Log Loss Log Loss or Logistic loss is one of the evaluation metrics used to find
how good the model is. Lower the log loss, better is the model. Log loss
is the logarithm of the product of all probabilities.
Mathematically, log loss for two classes is defined as:
Long short-term memory (LSTM) units (or blocks) are a building unit
for layers of a recurrent neural network (RNN). A common LSTM unit
is composed of a cell, an input gate, an output gate and a forget gate.
The cell is responsible for “remembering” values over arbitrary time
intervals, hence the word “memory” in LSTM. Each of the three gates
Long can be thought of as a “conventional” artificial neuron, as in a multi-
Short layer neural network, that is, they compute an activation (using an
Term activation function) of a weighted sum. Applications of LSTM include:
Memory
(LSTM) Time series predictions
Speech recognition
Rhythm learning
Handwriting recognition
Word Description
Machi Machine Learning refers to the techniques involved in dealing with vast
ne data in the most intelligent fashion (by developing algorithms) to derive
Learni actionable insights. In these techniques, we expect the algorithms to learn
ng by itself wiithout being explicitly programmed.
Mahout Mahout is an open source project from Apache that is used for creating
scalable machine learning algorithms. It implements popular machine
learning techniques such as recommendation, classification, clustering.
Features of Mahout:
Mahout offers a framework for doing data mining tasks on large
volumes of data
1. Map: each worker node applies the map function to the local data,
MapRe and writes the output to a temporary storage. A master node
duce ensures that only one copy of redundant input data is processed.
2. Shuffle: worker nodes redistribute data based on the output keys
(produced by the map function), such that all data belonging to one
key is located on the same worker node.
3. Reduce: worker nodes now process each group of output data, per
key, in parallel.
Maxim
um
It is a method for finding the values of parameters which make the
Likelih
likelihood maximum. The resulting values are called maximum likelihood
ood
estimates (MLE).
Estimat
ion
For a dataset, mean is said to be the average value of all the numbers. It
can sometimes be used as a representation of the whole data.
For instance, if you have the marks of students from a class, and you asked
about how good is the class performing. It would be irrelevant to say the
marks of every single student, instead, you can find the mean of the class,
Mean which will be a representative for class performance.
To find the mean, sum all the numbers and then divide by the number of
items in the set.
For example, if the numbers are 1,2,3,4,5,6,7,8,8 then the mean would be
44/9 = 4.89.
Median of a set of numbers is usually the middle value. When the total
numbers in the set are even, the median will be the average of the two
middle values. Median is used to measure the central tendency.
To calculate the median for a set of numbers, follow the below steps:
Median
1. Arrange the numbers in ascending or descending order
2. Find the middle value, which will be n/2 (where n is the numbers in
the set)
4,5,2,8,4,7,6,4,6,3
So now we will calculate the number of times each value has appeared.
Mode
Value Count
2 1
3 1
4 3
5 1
6 2
7 1
8 1
So we see that the value 4 is repeating the most, i.e., 3 times. So, the mode
of this dataset will be 4.
Model Model selection is the task of selecting a statistical model from a set of
Selectio known models. Various methods that can be used for choosing the model
n are:
Exploratory Data Analysis
Scientific Methods
Problems which have more than one class in the target variable are called
Multi-
multi-class Classification problems.
Class
For example, if the target is to predict the quality of a product, which can
Classifi
be Excellent, good, average, fair, bad. In this case, the variable has 5
cation
classes, hence it is a 5-class classification problem.
Word Description
Naive Bayes It is a classification technique based on Bayes’ theorem with an
assumption of independence between predictors. In simple terms, a
Naive Bayes classifier assumes that the presence of a particular
feature in a class is unrelated to the presence of any other feature.
For example, a fruit may be considered to be an apple if it is red,
round and about 3 inches in diameter. Even if these features depend
on each other or upon the existence of the other features, a naive
Bayes classifier would consider all of these properties to
independently contribute to the probability that this fruit is an apple.
Normal The normal distribution is the most important and most widely used
Distribution distribution in statistics. It is sometimes called the bell curve,
because it has a peculiar shape of a bell. Mostly, a binomial
distribution is similar to normal distribution. The difference
between the two is normal distribution is continuous.
Normalization is the process of rescaling your data so that they have
the same scale. Normalization is used when the attributes in our
data have varying scales.
Normalization
For example, if you have a variable ranging from 0 to 1 and other
from 0 to 1000, you can normalize the variable, such that both are in
the range 0 to 1.
Word Description
One Hot One Hot encoding is done usually in the preprocessing step. It is a
Encoding technique which converts categorical variables to numerical in an
interpretable format. In this we create a Boolean column for each
category of the variable.
For example, if the data is
1 Vivek
2 Akshat
3 Arshad
This is converted as
1 1 0 0
2 0 1 0
3 0 0 1
Apache Oozie is the tool in which all sort of programs can be pipelined
in a desired order to work in Hadoop’s distributed environment. Oozie
also provides a mechanism to run the job at a given schedule.
It consists of two parts:
Features of Oozie:
Oozie has client API and command line interface which can be
used to launch, control and monitor job from Java application.
Using its Web Service APIs one can control jobs from
anywhere.
Oozie has provision to execute jobs which are scheduled to run
periodically.
Ordinal Ordinal variables are those variables which have discrete values but has
Variable some order involved. Refer here.
Outlier is an observation that appears far away and diverges from an
overall pattern in a sample.
Outlier
A model is said to overfit when it performs well on the train dataset but
fails on the test set. This happens when the model is too sensitive and
captures random patterns which are present only in the training dataset.
There are two methods to overcome overfitting:
Overfitting
Reduce the model complexity
Regularization
Word Description
Pandas is an open source, high-performance, easy-to-use data structure
and data analysis library for the Python programming language. Some
of the highlights of Pandas are:
Parameters Parameters are a set of measurable factors that define a system. For
machine learning models, model parameters are internal variables
whose values can be determined from the data.
For instance, the weights in linear and logistic regression fall under the
category of parameters.
Pie Chart
This represents a pie graph showing the results of an exam. Each grade
is denoted by a “slice”. The total of the percentages is equal to 100.
The total of the arc measures is equal to 360 degrees. So 12% students
got A grade, 29% got B, and so on.
Pig Pig is a high level scripting language that is used with Apache
Hadoop. Pig enables data workers to write complex data
transformations without knowing Java. Pig is complete, so one can do
all required data manipulations in Apache Hadoop with Pig. Through
the User Defined Functions(UDF) facility in Pig, Pig can invoke code
in many languages like JRuby, Jython and Java.
Key features of Pig:
Polynomial In this technique, a curve fits into the data points. In a polynomial
Regression regression equation, the power of the independent variable is greater
than 1. Although higher degree polynomials give lower error, they
might also result in over-fitting.
Pre-trained A pre-trained model is a model created by someone else to solve a
Model similar problem. Instead of building a model from scratch to solve a
similar problem, you use the model trained on other problem as a
starting point.
For example, if you want to build a self learning car. You can spend
years to build a decent image recognition algorithm from scratch or
you can take inception model (a pre-trained model) from Google
which was built on ImageNet data to identify images in those pictures.
Word Description
Quartile Quartile divides a series into 4 equal parts. For any series, there are 4
quartiles denoted by Q1, Q2, Q3 and Q4. These are known as First
Quartile , Second Quartile and so on.
For example, the diagram below shows the health score of a patient from
range 0 to 60. Quartiles divide the population into 4 groups.
R
Word Description
R is an open-source programming language and a software
environment for statistical computing, machine learning, and
data visualization.
Features of R:
Range Range is the difference between the highest and the lowest value
of the population. It is used to measure the spread of the data.Let
us understand it with an example:
Suppose we have a dataset having 10 data points, listed below:
4,5,2,8,4,7,6,4,6,3
Now the range of this set is the difference between the highest(8)
and the lowest(2) value.
Range = 8-2 = 6
1. α = 0:
o The objective becomes same as simple linear
regression.
o We’ll get the same coefficients as simple linear
regression.
2. α = ∞:
o The coefficients will be zero. This is because of
infinite weightage on square of coefficients,
anything less than zero will make the objective
infinite.
3. 0 < α < ∞:
o The magnitude of α will decide the weightage
given to different parts of objective.
o The coefficients will be somewhere between 0
and 1 for simple linear regression.
As you can see, the sensitivity at this threshold is 99.6% and the
(1-specificity) is ~60%. This coordinate becomes on point in our
ROC curve. To bring this curve down to a single number, we
find the area under this curve (AUC).
Note that the area of entire square is 1*1 = 1. Hence, AUC itself
is the ratio under the curve and the total area.
Root Mean
Squared Error
(RMSE)
Here,
Invariance
Word Description
Scala is a general purpose language that combines concepts of
object-oriented and functional programming languages. Here are
some key features of Scala
Problems where you have a large amount of input data (X) and only
some of the data, is labeled (Y) are called semi-supervised learning
problems.
Semi-
These problems sit in between both supervised and unsupervised
Supervised
learning.
Learning
A good example is a photo archive where only some of the images
are labeled, (e.g. dog, cat, person) and the majority are unlabeled.
For example, if we only have two features like Height and Hair
length of an individual, we’d first plot these two variables in two-
dimensional space where each point has two coordinates (these
coordinates are known as Support Vectors) Now, we will find
some line that splits the data between the two differently classified
groups of data. This will be the line such that the distances from the
closest point in each of the two groups will be farthest away.
Word Description
TensorFlow, developed by the Google Brain team, is an open source
software library. It is used for building machine learning models for
range of tasks in data science, mainly used for machine learning
TensorFlow
applications such as building neural networks. TensorFlow can also
be used for non- machine learning tasks that require numerical
computation.
These are the points which are actually true and we have predicted
them true. For example, consider an example where we have to
True predict whether the loan will be approved or not. Y represents that
Positive loan will be approved, whereas N represents that loan will not be
approved. So, here the True positive will be the number of classes
which are actually Y and we have predicted them Y as well.
Type I error
Type II
error
Word Description
Underfitting occurs when a statistical model or machine learning
algorithm cannot capture the underlying trend of the data. It refers to
Underfitting a model that can neither model on the training data nor generalize to
new data. An underfit model is not a suitable model as it will have
poor performance on the training data.
Word Description
Variance is used to measure the spread of given set of numbers and
calculated by the average of squared distances from the mean
Let’s take an example, suppose the set of numbers we have is (600, 470,
170, 430, 300)
To Calculate:
Variance 1) Find the Mean of set of numbers, which is (600 + 470 + 170 + 430 +
300) / 5 = 394
2) Subtract the mean from each value which is (206, 76, -334, 36, -94)
3) Square each deviation from the mean which is (42436, 5776, 50176,
1296, 8836)
4) Find the Sum of Squares which is 108520
5) Divide by total number of items (numbers) which is 21704
Word Description
Z-test determines to what extent a data point is away from the mean of
the data set, in standard deviation. For example:
Principal at a certain school claims that the students in his school are
above average intelligence. A random sample of thirty students has a
mean IQ score of 112. The mean population IQ is 100 with a standard
deviation of 15. Is there sufficient evidence to support the principal’s
claim?
So we can make use of z-test to test the claims made by the principal.
Steps to perform z-test:
Here,
If the test statistic is greater than the z-score of rejection area, reject the
null hypothesis. If it’s less than that z-score, you cannot reject the null
hypothesis.