0% found this document useful (0 votes)
589 views22 pages

Unit1 6thsemCS

The document provides an introduction to machine learning including its scope, limitations and some key algorithms like regression, probability, statistics and linear algebra. It discusses supervised and unsupervised learning. It also describes regression techniques like linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression and ridge regression.

Uploaded by

saloni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
589 views22 pages

Unit1 6thsemCS

The document provides an introduction to machine learning including its scope, limitations and some key algorithms like regression, probability, statistics and linear algebra. It discusses supervised and unsupervised learning. It also describes regression techniques like linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression and ridge regression.

Uploaded by

saloni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit - I

Introduction to machine learning, scope and limitations, regression, probability, statistics and linear
algebra for machine learning, convex optimization, data visualization, hypothesis function and testing,
data distributions, data pre-processing, data augmentation, normalizing data sets, machine learning
models, supervised and unsupervised learning.

Machine Learning
➢ Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that
focuses on the using data and algorithms to enable AI to imitate the way that humans learn,
gradually improving its accuracy.
➢ Machine learning is a type of AI that teaches machines how to learn, interpret and
predict results based on a set of data.

The three basic ingredient of machine learning

There are three basic functional ingredients of ML.

1. Data: The dataset you want to use must be well-structured, accurate. The data you use can be
labeled or unlabeled. Unlabeled data are sample items — e.g. photos, videos, news articles —
that don’t need additional explanation, while labeled ones are augmented: unlabeled
information is bundled and an explanation, with a tag, is added to them.

2. Algorithm: there are different types of algorithms that can be used (e.g. linear regression,
logistic regression). Choosing the right algorithm is both a combination of business need,
specification, experimentation and time available.

3. Model: A “model” is the output of a machine learning algorithm run on data. It represents the
rules, numbers, and any other algorithm-specific data structures required to make predictions.

Use of machine learning

➢ The function of a machine learning system can be descriptive, meaning that the system uses
the data to explain what happened; predictive, meaning the system uses the data to predict
what will happen; or prescriptive, meaning the system will use the data to make suggestions
about what action to take.
➢ What are training data?
Training data is the initial data used to train machine learning models. Training datasets are fed
to machine learning algorithms so that they can learn to make predictions, or perform a desired
task. This type of data is key, because it helps machines achieve results and work in the right
way, as shown in the graph below.

Prof. Saloni Shrivastava


Applications of Machine Learning

Future Scope of Machine Learning

1. Robotics
2. Computer vision
3. Quantum Processing
4. Automotive Industry
5. Digital protection

Limitations of Machine Learning

1. Dependence on Data Quality


2. Lack of Transparency
3. Limited Applicability
4. High Computational Costs
5. Data Privacy Concerns
6. Ethical Considerations

Regression
➢ Regression is a method for understanding the relationship between independent variables or
features and a dependent variable or outcome. Outcomes can then be predicted once the
relationship between independent and dependent variables has been estimated.

Prof. Saloni Shrivastava


Types of regression

1. Linear Regression
➢ Linear regression is a statistical regression method which is used for predictive analysis.
➢ It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
➢ It is used for solving the regression problem in machine learning.
➢ Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
➢ If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
➢ The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.

➢ Below is the mathematical equation for Linear regression:


➢ Y= aX+b

Prof. Saloni Shrivastava


Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients

➢ Some popular applications of linear regression are:

1. Analyzing trends and sales estimates


2. Salary forecasting
3. Real estate prediction
4. Arriving at ETAs in traffic.

2. Polynomial Regression
➢ Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
➢ It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
➢ Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
➢ In Polynomial regression, the original features are transformed into polynomial features of given
degree and then modeled using a linear model. Which means the datapoints are best fitted
using a polynomial line.

➢ The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.
➢ Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
➢ The model is still linear as the coefficients are still linear with quadratic

Prof. Saloni Shrivastava


Note: This is different from Multiple Linear regression in such a way that in Polynomial regression, a
single element has different degrees instead of multiple variables with the same degree.

3. Support Vector Regression


➢ Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.
➢ Support Vector Regression is a regression algorithm which works for continuous variables.
➢ Below are some keywords which are used in Support Vector Regression:
1. Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.

2. Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line
which helps to predict the continuous variables and cover most of the datapoints.

3. Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints.

4. Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.

➢ In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain
a maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

4. Decision Tree Regression


➢ Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
➢ It can solve problems for both categorical and numerical data
➢ Decision Tree regression builds a tree-like structure in which each internal node represents the
"test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.

Prof. Saloni Shrivastava


➢ A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes.
➢ Consider the below image:

Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.

5. Random Forest Regression


➢ Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
➢ The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as: g(x)= f0(x)+ f1(x)+ f2(x)+....
➢ Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in
which aggregated decision tree runs in parallel and do not interact with each other.
➢ With the help of Random Forest regression, we can prevent Overfitting in the model by creating
random subsets of the dataset.

Prof. Saloni Shrivastava


6. Ridge Regression
➢ Ridge regression is one of the most robust versions of linear regression in which a small amount
of bias is introduced so that we can get better long term predictions.
➢ The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
➢ The equation for ridge regression will be:

➢ A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
➢ Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
➢ It helps to solve the problems if we have more parameters than samples.

7. Lasso Regression
➢ Lasso regression is another regularization technique to reduce the complexity of the model.
➢ It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.
➢ Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
➢ It is also called as L1 regularization. The equation for Lasso regression will be:

Probability and statistics

➢ Probability and statistics both are the most important concepts for Machine Learning.
Probability is about predicting the likelihood of future events, while statistics involves the
analysis of the frequency of past events.

Probability in Machine Learning

➢ Probability is the bedrock of ML, which tells how likely is the event to occur. The value of
Probability always lies between 0 to 1. It is the core concept as well as a primary prerequisite
to understanding the ML models and their applications.
➢ Probability can be calculated by the number of times the event occurs divided by the total
number of possible outcomes. Let's suppose we tossed a coin, then the probability of getting
head as a possible outcome can be calculated as below formula:

Prof. Saloni Shrivastava


➢ P (H) = Number of ways to head occur/ total number of possible outcomes
P (H) = ½
P (H) = 0.5
Where; P (H) = Probability of occurring Head as outcome while tossing a coin.

Types of Probability

➢ For better understanding the Probability, it can be categorized further in different types as
follows:
Empirical Probability: Empirical Probability can be calculated as the number of times the event
occurs divided by the total number of incidents observed.
Theoretical Probability: Theoretical Probability can be calculated as the number of ways the
particular event can occur divided by the total number of possible outcomes.
Joint Probability: It tells the Probability of simultaneously occurring two random events.
P(A ∩ B) = P(A). P(B)
Where; P(A ∩ B) = Probability of occurring events A and B both.
P (A) = Probability of event A
P (B) = Probability of event B
Conditional Probability: It is given by the Probability of event A given that event B occurred.

➢ The Probability of an event A conditioned on an event B is denoted and defined as;


P(A|B) = P(A∩B)/P(B)
Similarly,
P(B|A) = P(A ∩ B)/ P(A) . We can write the joint Probability of as A and B as P(A ∩ B)= p(A).P(B|A),
which means: "The chance of both things happening is the chance that the first one happens, and then
the second one is given when the first thing happened."

Statistics in Machine Learning

➢ Statistics is also considered as the base foundation of machine learning which deals with finding
answers to the questions that we have about data. In general, we can define statistics as:
➢ Statistics is the part of applied Mathematics that deals with studying and developing ways for
gathering, analyzing, interpreting and drawing conclusion from empirical data. It can be used
to perform better-informed business decisions.
➢ Statistics can be categorized into 2 major parts. These are as follows:

1. Descriptive Statistics

2. Inferential Statistics

Prof. Saloni Shrivastava


Use of Statistics in ML

➢ Statistics methods are used to understand the training data as well as interpret the results of
testing different machine learning models. Further, Statistics can be used to make better-
informed business and investing decisions.
➢ Linear Algebra is an essential field of mathematics, which defines the study of vectors, matrices,
planes, mapping, and lines required for linear transformation.
➢ Linear algebra plays a vital role and key foundation in machine learning, and it enables ML
algorithms to run on a huge number of datasets.
➢ The concepts of linear algebra are widely used in developing algorithms in machine learning.
Although it is used almost in each concept of Machine learning, specifically, it can perform the
following task:

1. Optimization of data.

2. Applicable in loss functions, regularisation, covariance matrices, Singular Value


Decomposition (SVD), Matrix Operations, and support vector machine classification.

3. Implementation of Linear Regression in Machine Learning.

➢ Besides the above uses, linear algebra is also used in neural networks and the data science
field.
➢ Basic mathematics principles and concepts like Linear algebra are the foundation of Machine
Learning and Deep Learning systems. To learn and understand Machine Learning or Data
Science, one needs to be familiar with linear algebra and optimization theory.

Linear Algebra

Why learn Linear Algebra before learning Machine Learning?

➢ Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is based
on flour similarly, every Machine Learning Model is also based on Linear Algebra. Further, the
cake also needs more ingredients like egg, sugar, cream, soda. Similarly, Machine Learning
also requires more concepts as vector calculus, probability, and optimization theory. So, we
can say that Machine Learning creates a useful model with the help of the above-mentioned
mathematical concepts.
➢ Below are some benefits of learning Linear Algebra before Machine learning:
1. Better Graphic experience
2. Improved Statistics
3. Creating better Machine Learning algorithms
4. Estimating the forecast of Machine Learning
5. Easy to Learn
Better Graphics Experience:

Prof. Saloni Shrivastava


➢ Linear Algebra helps to provide better graphical processing in Machine Learning like Image,
audio, video, and edge detection. These are the various graphical representations supported
by Machine Learning projects that you can work on. Further, parts of the given data set are
trained based on their categories by classifiers provided by machine learning algorithms. These
classifiers also remove the errors from the trained data.
➢ Moreover, Linear Algebra helps solve and compute large and complex data set through a
specific terminology named Matrix Decomposition Techniques. There are two most popular
matrix decomposition techniques, which are as follows:

1. Q-R

2. L-U

Improved Statistics:
➢ Statistics is an important concept to organize and integrate data in Machine Learning. Also,
linear Algebra helps to understand the concept of statistics in a better manner. Advanced
statistical topics can be integrated using methods, operations, and notations of linear algebra.

Creating better Machine Learning algorithms:


➢ Linear Algebra also helps to create better supervised as well as unsupervised Machine
Learning algorithms.
➢ Few supervised learning algorithms can be created using Linear Algebra, which is as follows:

1. Logistic Regression

2. Linear Regression

3. Decision Trees
4. Support Vector Machines (SVM)
➢ Further, below are some unsupervised learning algorithms listed that can also be created with
the help of linear algebra as follows:

1. Single Value Decomposition (SVD)

2. Clustering

3. Components Analysis

➢ With the help of Linear Algebra concepts, you can also self-customize the various parameters
in the live project and understand in-depth knowledge to deliver the same with more accuracy
and precision.
Estimating the forecast of Machine Learning:
➢ If you are working on a Machine Learning project, then you must be a broad-minded person
and also, you will be able to impart more perspectives. Hence, in this regard, you must increase
the awareness and affinity of Machine Learning concepts. You can begin with setting up

Prof. Saloni Shrivastava


different graphs, visualization, using various parameters for diverse machine learning
algorithms or taking up things that others around you might find difficult to understand.

Easy to Learn:

➢ Linear Algebra is an important department of Mathematics that is easy to understand. It is taken


into consideration whenever there is a requirement of advanced mathematics and its
applications.

Minimum Linear Algebra for Machine Learning


➢ Notation: Notation in linear algebra enables you to read algorithm descriptions in papers, books,
and websites to understand the algorithm's working. Even if you use for-loops rather than matrix
operations, you will be able to piece things together.
➢ Operations: Working with an advanced level of abstractions in vectors and matrices can make
concepts clearer, and it can also help in the description, coding, and even thinking capability.
In linear algebra, it is required to learn the basic operations such as addition, multiplication,
inversion, transposing of matrices, vectors, etc.
➢ Matrix Factorization: One of the most recommended areas of linear algebra is matrix
factorization, specifically matrix deposition methods such as SVD and QR.

Examples of Linear Algebra in Machine Learning


Below are some popular examples of linear algebra in Machine learning:
1. Datasets and Data Files
2. Linear Regression
3. Recommender Systems
4. One-hot encoding
5. Regularization
6. Principal Component Analysis
7. Images and Photographs
8. Singular-Value Decomposition
9. Deep Learning
10. Latent Semantic Analysis

Convex Optimization

➢ Convex optimisation is the foundation of gradient descent, a well-liked optimisation technique


in machine learning. The parameters are updated using gradient descent in the direction of the
objective function's negative gradient. The size of each iteration's step depends on the learning
rate.

Prof. Saloni Shrivastava


Data Visualization in Machine Learning

➢ Data Visualization is a crucial aspect of machine learning that enables analysts to understand
and make sense of data patterns, relationships, and trends. Through data visualization,
insights and patterns in data can be easily interpreted and communicated to a wider audience,
making it a critical component of machine learning.

Significance of Data Visualization in Machine Learning


➢ Data visualization helps machine learning analysts to better understand and analyze complex
data sets by presenting them in an easily understandable format. Data visualization is an
essential step in data preparation and analysis as it helps to identify outliers, trends, and
patterns in the data that may be missed by other forms of analysis.
➢ With the increasing availability of big data, it has become more important than ever to use data
visualization techniques to explore and understand the data. Machine learning algorithms work
best when they have high-quality and clean data, and data visualization can help to identify
and remove any inconsistencies or anomalies in the data.

Types of Data Visualization Approaches

➢ Machine learning may make use of a wide variety of data visualization approaches. That
include:

1. Line Charts: In a line chart, each data point is represented by a point on the graph, and
these points are connected by a line. We may find patterns and trends in the data across
time by using line charts. Time-series data is frequently displayed using line charts.

Prof. Saloni Shrivastava


2. Scatter Plots: A quick and efficient method of displaying the relationship between two
variables is to use scatter plots. With one variable plotted on the x-axis and the other
variable drawn on the y-axis, each datapoint in a scatter plot is represented by a point on
the graph. We may use scatter plots to visualize data to find patterns, clusters, and outliers.

3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar chart,
each category is represented by a bar, with the height of the bar indicating the frequency
or proportion of that category in the data. Bar graphs are useful for comparing several
categories and seeing patterns over time.

4. Heat Maps: Heat maps are a type of graphical representation that displays data in a matrix
format. The value of the data point that each matrix cell represents determines its hue.
Heatmaps are often used to visualize the correlation between variables or to identify
patterns in time-series data.

Prof. Saloni Shrivastava


5. Tree Maps: Tree maps are used to display hierarchical data in a compact format and are
useful in showing the relationship between different levels of a hierarchy.

6. Box Plots: Box plots are a graphical representation of the distribution of a set of data. In a
box plot, the median is shown by a line inside the box, while the center box depicts the
range of the data. The whiskers extend from the box to the highest and lowest values in
the data, excluding outliers. Box plots can help us to identify the spread and skewness of
the data.

Prof. Saloni Shrivastava


Uses of Data Visualization in Machine Learning
Data visualization has several uses in machine learning. It can be used to:
1. Identify trends and patterns in data
2. Communicate insights to stakeholders
3. Monitor machine learning models

4. Improve data quality:

Challenges in Data Visualization

1. Choosing the Right Visualization

2. Data Quality

3. Data Overload

4. Over-Emphasis on Aesthetics

5. Audience Understanding

6. Technical Expertise

Hypothesis function and testing

What is Hypothesis?
➢ The hypothesis is defined as the supposition or proposed explanation based on insufficient
evidence or assumptions. It is just a guess based on some known facts but has not yet been
proven. A good hypothesis is testable, which results in either true or false.
➢ Example: Let's understand the hypothesis with a common example. Some scientist claims that
ultraviolet (UV) light can damage the eyes then it may also cause blindness.

Prof. Saloni Shrivastava


Hypothesis in Machine Learning (ML)

➢ The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that
best maps the input to corresponding outputs with the help of an available dataset.

➢ In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.
➢ There are some common methods given to find out the possible hypothesis from the Hypothesis
space, where hypothesis space is represented by uppercase-h (H) and hypothesis
by lowercase-h (h). These are defined as follows:
Hypothesis space (H):
➢ Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known
as a hypothesis set. It is used by supervised machine learning algorithms to determine the best
possible hypothesis to describe the target function or best maps input to output.
➢ It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.
Hypothesis (h):
➢ It is defined as the approximate function that best describes the target in supervised machine
learning algorithms. It is primarily based on data as well as bias and restrictions applied to data.
➢ Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output
and can be evaluated as well as used to make predictions.
➢ Hypothesis Testing is basically an assumption that we make about the population parameter.
➢ Ex : you say avg student in class is 40 or a boy is taller than girls.
➢ all those example we assume need some statistic way to prove those. We need some
mathematical conclusion whatever we are assuming is true.

Data Distribution

➢ Data distribution refers to the way data values are spread or distributed in a dataset. It aims at
providing valuable insights, informs decision-making, and ensures that appropriate methods
are used for statistical analysis and modeling.

Prof. Saloni Shrivastava


➢ Data distribution in statistics is any population with data scattering or a spread of a range of
values. Statistics typically use various representations, such as charts, tables, histograms, and
box plots. With proper distribution, the raw data becomes more accessible to read and interpret.

Types of Data Distribution

➢ Data distribution can be divided into two types:

A) Continuous Data
➢ In simple terms, such data operates from one extreme to another, gauged on a scale such as
weight and temperature. Such type of data helps in gaining relevant information into trends,
patterns, and relationships typically not observed with other datasets. The continuous data is
categorized into several distributions, such as –
1. Normal data distribution – This is the most common type of distribution, with a bell
curve measuring the mean between equal data points on both sides.
2. Log-normal distribution – In this distribution, the data points are measured in a sigmoid
function. Hence, this distribution is used in financial data to predict future stock prices based
on past data.
3. F distribution: Helps in gauging data points in a broader range than normal distribution with
high variability.
4. Chi-square distribution: It analyzes the gap between observed data and expected results
and helps in identifying differences between two datasets.
5. Exponential distribution: Similar to F distribution, but gauges data points with an
exponential curve beginning at zero and perpetually increasing in value.
6. Non-normal distribution: It includes logistic and gamma distribution. Moreover, it is usually
used when data is highly non-linear and does not fit in the standard data distribution
categories.

Prof. Saloni Shrivastava


B) Discrete Data
➢ It is the opposite of continuous data, which means that it varies in limit and a set range of values.
Examples of it are classroom strength, books in a shop, etc. Such information is generally
visualized through bar graphs. It has four types of distribution.
1. Binomial distribution: Applied to describe the quantified success or failure probability in a given
number of trials. For example, yes or no, heads or tails, right or wrong, etc.
2. Poisson data distribution: Used to define the event probability during a specific period with a
known rate but unknown occurrence.
3. Hypergeometric distribution: Similar to binomial distribution but with multiple items and without
replacement.
4. Geometric distribution: Derives a number of failures before a success. The success probability
is defined in any given trial with a series of independent trials and known individual success
probability.

Data Preprocessing in Machine learning

➢ Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model. When
creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way.

Why do we need Data Preprocessing?


➢ A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
➢ It involves below steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling
Data Augmentation

➢ Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or
using deep learning to generate new data points.

Prof. Saloni Shrivastava


Augmented vs. Synthetic data
➢ Augmented data is driven from original data with some minor changes. In the case of image
augmentation, we make geometric and color space transformations (flipping, resizing,
cropping, brightness, contrast) to increase the size and diversity of the training set.
➢ Synthetic data is generated artificially without using the original dataset. It often uses DNNs
(Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate synthetic
data.
➢ Note: the augmentation techniques are not limited to images. You can augment audio, video,
text, and other types of data too.
➢ When Should You Use Data Augmentation?
1. To prevent models from overfitting.
2. The initial training set is too small.
3. To improve the model accuracy.
4. To Reduce the operational cost of labeling and cleaning the raw dataset.

Limitations of Data Augmentation


➢ The biases in the original dataset persist in the augmented data.
➢ Quality assurance for data augmentation is expensive.
➢ Research and development are required to build a system with advanced applications. For
example, generating high-resolution images using GANs can be challenging.
➢ Finding an effective data augmentation approach can be challenging.

Data Augmentation Techniques

A) Audio Data Augmentation


1. Noise injection: add gaussian or random noise to the audio dataset to improve the model
performance.
2. Shifting: shift audio left (fast forward) or right with random seconds.
3. Changing the speed: stretches times series by a fixed rate.
4. Changing the pitch: randomly change the pitch of the audio.

B) Text Data Augmentation


1. Word or sentence shuffling: randomly changing the position of a word or sentence.
2. Word replacement: replace words with synonyms.

3. Syntax-tree manipulation: paraphrase the sentence using the same word.


4. Random word insertion: inserts words at random.
5. Random word deletion: deletes words at random.

C) Image Augmentation

Prof. Saloni Shrivastava


1. Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images. You need
to be careful about applying multiple transformations on the same images, as this can
reduce model performance.
2. Color space transformations: randomly change RGB color channels, contrast, and
brightness.
3. Kernel filters: randomly change the sharpness or blurring of the image.
4. Random erasing: delete some part of the initial image.
5. Mixing images: blending and mixing multiple images.

Advanced Techniques

1. Generative adversarial networks (GANs): used to generate new data points or images. It
does not require existing data to generate synthetic data.
2. Neural Style Transfer: a series of convolutional layers trained to deconstruct images and
separate context and style.

Normalization in Machine Learning

➢ Normalization is one of the most frequently used data preparation techniques, which helps us
to change the values of numeric columns in the dataset to use a common scale.
➢ Although Normalization is no mandate for all datasets available in machine learning, it is used
whenever the attributes of the dataset have different ranges. It helps to enhance the
performance and reliability of a machine learning model.

Machine Learning Models

➢ A machine learning model is defined as a mathematical representation of the output of the


training process.

Supervised learning
➢ Supervised learning is a category of machine learning that uses labeled datasets to train
algorithms to predict outcomes and recognize patterns.

Prof. Saloni Shrivastava


Unsupervised learning
➢ Unsupervised machine learning models are given unlabeled data and allowed to discover
patterns and insights without any explicit guidance or instruction.

Semi-supervised learning
➢ Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large
amount of unlabeled data to train a model.

Reinforcement Learning
➢ Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make
decisions to achieve the most optimal results. It mimics the trial-and-error learning process that
humans use to achieve their goals.

Prof. Saloni Shrivastava


Prof. Saloni Shrivastava

You might also like