0% found this document useful (0 votes)

589 views22 pages

Unit1 6thsemCS

The document provides an introduction to machine learning including its scope, limitations and some key algorithms like regression, probability, statistics and linear algebra. It discusses supervised and unsupervised learning. It also describes regression techniques like linear regression, polynomial regression, support vector regression, decision tree regression, random forest regression and ridge regression.

Uploaded by

saloni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

589 views22 pages

Unit1 6thsemCS

Uploaded by

saloni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Unit - I

Introduction to machine learning, scope and limitations, regression, probability, statistics and linear
algebra for machine learning, convex optimization, data visualization, hypothesis function and testing,
data distributions, data pre-processing, data augmentation, normalizing data sets, machine learning
models, supervised and unsupervised learning.

Machine Learning
➢ Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that
focuses on the using data and algorithms to enable AI to imitate the way that humans learn,
gradually improving its accuracy.
➢ Machine learning is a type of AI that teaches machines how to learn, interpret and
predict results based on a set of data.

The three basic ingredient of machine learning

There are three basic functional ingredients of ML.

1. Data: The dataset you want to use must be well-structured, accurate. The data you use can be
labeled or unlabeled. Unlabeled data are sample items — e.g. photos, videos, news articles —
that don’t need additional explanation, while labeled ones are augmented: unlabeled
information is bundled and an explanation, with a tag, is added to them.

2. Algorithm: there are different types of algorithms that can be used (e.g. linear regression,
logistic regression). Choosing the right algorithm is both a combination of business need,
specification, experimentation and time available.

3. Model: A “model” is the output of a machine learning algorithm run on data. It represents the
rules, numbers, and any other algorithm-specific data structures required to make predictions.

Use of machine learning

➢ The function of a machine learning system can be descriptive, meaning that the system uses
the data to explain what happened; predictive, meaning the system uses the data to predict
what will happen; or prescriptive, meaning the system will use the data to make suggestions
about what action to take.
➢ What are training data?
Training data is the initial data used to train machine learning models. Training datasets are fed
to machine learning algorithms so that they can learn to make predictions, or perform a desired
task. This type of data is key, because it helps machines achieve results and work in the right
way, as shown in the graph below.

Prof. Saloni Shrivastava

Applications of Machine Learning

Future Scope of Machine Learning

1. Robotics
2. Computer vision
3. Quantum Processing
4. Automotive Industry
5. Digital protection

Limitations of Machine Learning

1. Dependence on Data Quality

2. Lack of Transparency
3. Limited Applicability
4. High Computational Costs
5. Data Privacy Concerns
6. Ethical Considerations

Regression
➢ Regression is a method for understanding the relationship between independent variables or
features and a dependent variable or outcome. Outcomes can then be predicted once the
relationship between independent and dependent variables has been estimated.

Prof. Saloni Shrivastava

Types of regression

1. Linear Regression
➢ Linear regression is a statistical regression method which is used for predictive analysis.
➢ It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
➢ It is used for solving the regression problem in machine learning.
➢ Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
➢ If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
➢ The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.

➢ Below is the mathematical equation for Linear regression:

➢ Y= aX+b

Prof. Saloni Shrivastava

Here, Y = dependent variables (target variables),
X= Independent variables (predictor variables),
a and b are the linear coefficients

➢ Some popular applications of linear regression are:

1. Analyzing trends and sales estimates

2. Salary forecasting
3. Real estate prediction
4. Arriving at ETAs in traffic.

2. Polynomial Regression
➢ Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
➢ It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
➢ Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
➢ In Polynomial regression, the original features are transformed into polynomial features of given
degree and then modeled using a linear model. Which means the datapoints are best fitted
using a polynomial line.

➢ The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.
➢ Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
➢ The model is still linear as the coefficients are still linear with quadratic

Prof. Saloni Shrivastava

Note: This is different from Multiple Linear regression in such a way that in Polynomial regression, a
single element has different degrees instead of multiple variables with the same degree.

3. Support Vector Regression

➢ Support Vector Machine is a supervised learning algorithm which can be used for regression
as well as classification problems. So if we use it for regression problems, then it is termed as
Support Vector Regression.
➢ Support Vector Regression is a regression algorithm which works for continuous variables.
➢ Below are some keywords which are used in Support Vector Regression:
1. Kernel: It is a function used to map a lower-dimensional data into higher dimensional data.

2. Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line
which helps to predict the continuous variables and cover most of the datapoints.

3. Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints.

4. Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.

➢ In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain
a maximum number of datapoints. Consider the below image:

Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.

4. Decision Tree Regression

➢ Decision Tree is a supervised learning algorithm which can be used for solving both
classification and regression problems.
➢ It can solve problems for both categorical and numerical data
➢ Decision Tree regression builds a tree-like structure in which each internal node represents the
"test" for an attribute, each branch represent the result of the test, and each leaf node
represents the final decision or result.

Prof. Saloni Shrivastava

➢ A decision tree is constructed starting from the root node/parent node (dataset), which splits
into left and right child nodes (subsets of dataset). These child nodes are further divided into
their children node, and themselves become the parent node of those nodes.
➢ Consider the below image:

Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.

5. Random Forest Regression

➢ Random forest is one of the most powerful supervised learning algorithms which is capable of
performing regression as well as classification tasks.
➢ The Random Forest regression is an ensemble learning method which combines multiple
decision trees and predicts the final output based on the average of each tree output. The
combined decision trees are called as base models, and it can be represented more formally
as: g(x)= f0(x)+ f1(x)+ f2(x)+....
➢ Random forest uses Bagging or Bootstrap Aggregation technique of ensemble learning in
which aggregated decision tree runs in parallel and do not interact with each other.
➢ With the help of Random Forest regression, we can prevent Overfitting in the model by creating
random subsets of the dataset.

Prof. Saloni Shrivastava

6. Ridge Regression
➢ Ridge regression is one of the most robust versions of linear regression in which a small amount
of bias is introduced so that we can get better long term predictions.
➢ The amount of bias added to the model is known as Ridge Regression penalty. We can
compute this penalty term by multiplying with the lambda to the squared weight of each
individual features.
➢ The equation for ridge regression will be:

➢ A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
➢ Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
➢ It helps to solve the problems if we have more parameters than samples.

7. Lasso Regression
➢ Lasso regression is another regularization technique to reduce the complexity of the model.
➢ It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.
➢ Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
➢ It is also called as L1 regularization. The equation for Lasso regression will be:

Probability and statistics

➢ Probability and statistics both are the most important concepts for Machine Learning.
Probability is about predicting the likelihood of future events, while statistics involves the
analysis of the frequency of past events.

Probability in Machine Learning

➢ Probability is the bedrock of ML, which tells how likely is the event to occur. The value of
Probability always lies between 0 to 1. It is the core concept as well as a primary prerequisite
to understanding the ML models and their applications.
➢ Probability can be calculated by the number of times the event occurs divided by the total
number of possible outcomes. Let's suppose we tossed a coin, then the probability of getting
head as a possible outcome can be calculated as below formula:

Prof. Saloni Shrivastava

➢ P (H) = Number of ways to head occur/ total number of possible outcomes
P (H) = ½
P (H) = 0.5
Where; P (H) = Probability of occurring Head as outcome while tossing a coin.

Types of Probability

➢ For better understanding the Probability, it can be categorized further in different types as
follows:
Empirical Probability: Empirical Probability can be calculated as the number of times the event
occurs divided by the total number of incidents observed.
Theoretical Probability: Theoretical Probability can be calculated as the number of ways the
particular event can occur divided by the total number of possible outcomes.
Joint Probability: It tells the Probability of simultaneously occurring two random events.
P(A ∩ B) = P(A). P(B)
Where; P(A ∩ B) = Probability of occurring events A and B both.
P (A) = Probability of event A
P (B) = Probability of event B
Conditional Probability: It is given by the Probability of event A given that event B occurred.

➢ The Probability of an event A conditioned on an event B is denoted and defined as;

P(A|B) = P(A∩B)/P(B)
Similarly,
P(B|A) = P(A ∩ B)/ P(A) . We can write the joint Probability of as A and B as P(A ∩ B)= p(A).P(B|A),
which means: "The chance of both things happening is the chance that the first one happens, and then
the second one is given when the first thing happened."

Statistics in Machine Learning

➢ Statistics is also considered as the base foundation of machine learning which deals with finding
answers to the questions that we have about data. In general, we can define statistics as:
➢ Statistics is the part of applied Mathematics that deals with studying and developing ways for
gathering, analyzing, interpreting and drawing conclusion from empirical data. It can be used
to perform better-informed business decisions.
➢ Statistics can be categorized into 2 major parts. These are as follows:

1. Descriptive Statistics

2. Inferential Statistics

Prof. Saloni Shrivastava

Use of Statistics in ML

➢ Statistics methods are used to understand the training data as well as interpret the results of
testing different machine learning models. Further, Statistics can be used to make better-
informed business and investing decisions.
➢ Linear Algebra is an essential field of mathematics, which defines the study of vectors, matrices,
planes, mapping, and lines required for linear transformation.
➢ Linear algebra plays a vital role and key foundation in machine learning, and it enables ML
algorithms to run on a huge number of datasets.
➢ The concepts of linear algebra are widely used in developing algorithms in machine learning.
Although it is used almost in each concept of Machine learning, specifically, it can perform the
following task:

1. Optimization of data.

2. Applicable in loss functions, regularisation, covariance matrices, Singular Value

Decomposition (SVD), Matrix Operations, and support vector machine classification.

3. Implementation of Linear Regression in Machine Learning.

➢ Besides the above uses, linear algebra is also used in neural networks and the data science
field.
➢ Basic mathematics principles and concepts like Linear algebra are the foundation of Machine
Learning and Deep Learning systems. To learn and understand Machine Learning or Data
Science, one needs to be familiar with linear algebra and optimization theory.

Linear Algebra

Why learn Linear Algebra before learning Machine Learning?

➢ Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is based
on flour similarly, every Machine Learning Model is also based on Linear Algebra. Further, the
cake also needs more ingredients like egg, sugar, cream, soda. Similarly, Machine Learning
also requires more concepts as vector calculus, probability, and optimization theory. So, we
can say that Machine Learning creates a useful model with the help of the above-mentioned
mathematical concepts.
➢ Below are some benefits of learning Linear Algebra before Machine learning:
1. Better Graphic experience
2. Improved Statistics
3. Creating better Machine Learning algorithms
4. Estimating the forecast of Machine Learning
5. Easy to Learn
Better Graphics Experience:

Prof. Saloni Shrivastava

➢ Linear Algebra helps to provide better graphical processing in Machine Learning like Image,
audio, video, and edge detection. These are the various graphical representations supported
by Machine Learning projects that you can work on. Further, parts of the given data set are
trained based on their categories by classifiers provided by machine learning algorithms. These
classifiers also remove the errors from the trained data.
➢ Moreover, Linear Algebra helps solve and compute large and complex data set through a
specific terminology named Matrix Decomposition Techniques. There are two most popular
matrix decomposition techniques, which are as follows:

1. Q-R

2. L-U

Improved Statistics:
➢ Statistics is an important concept to organize and integrate data in Machine Learning. Also,
linear Algebra helps to understand the concept of statistics in a better manner. Advanced
statistical topics can be integrated using methods, operations, and notations of linear algebra.

Creating better Machine Learning algorithms:

➢ Linear Algebra also helps to create better supervised as well as unsupervised Machine
Learning algorithms.
➢ Few supervised learning algorithms can be created using Linear Algebra, which is as follows:

1. Logistic Regression

2. Linear Regression

3. Decision Trees
4. Support Vector Machines (SVM)
➢ Further, below are some unsupervised learning algorithms listed that can also be created with
the help of linear algebra as follows:

1. Single Value Decomposition (SVD)

2. Clustering

3. Components Analysis

➢ With the help of Linear Algebra concepts, you can also self-customize the various parameters
in the live project and understand in-depth knowledge to deliver the same with more accuracy
and precision.
Estimating the forecast of Machine Learning:
➢ If you are working on a Machine Learning project, then you must be a broad-minded person
and also, you will be able to impart more perspectives. Hence, in this regard, you must increase
the awareness and affinity of Machine Learning concepts. You can begin with setting up

Prof. Saloni Shrivastava

different graphs, visualization, using various parameters for diverse machine learning
algorithms or taking up things that others around you might find difficult to understand.

Easy to Learn:

➢ Linear Algebra is an important department of Mathematics that is easy to understand. It is taken

into consideration whenever there is a requirement of advanced mathematics and its
applications.

Minimum Linear Algebra for Machine Learning

➢ Notation: Notation in linear algebra enables you to read algorithm descriptions in papers, books,
and websites to understand the algorithm's working. Even if you use for-loops rather than matrix
operations, you will be able to piece things together.
➢ Operations: Working with an advanced level of abstractions in vectors and matrices can make
concepts clearer, and it can also help in the description, coding, and even thinking capability.
In linear algebra, it is required to learn the basic operations such as addition, multiplication,
inversion, transposing of matrices, vectors, etc.
➢ Matrix Factorization: One of the most recommended areas of linear algebra is matrix
factorization, specifically matrix deposition methods such as SVD and QR.

Examples of Linear Algebra in Machine Learning

Below are some popular examples of linear algebra in Machine learning:
1. Datasets and Data Files
2. Linear Regression
3. Recommender Systems
4. One-hot encoding
5. Regularization
6. Principal Component Analysis
7. Images and Photographs
8. Singular-Value Decomposition
9. Deep Learning
10. Latent Semantic Analysis

Convex Optimization

➢ Convex optimisation is the foundation of gradient descent, a well-liked optimisation technique

in machine learning. The parameters are updated using gradient descent in the direction of the
objective function's negative gradient. The size of each iteration's step depends on the learning
rate.

Prof. Saloni Shrivastava

Data Visualization in Machine Learning

➢ Data Visualization is a crucial aspect of machine learning that enables analysts to understand
and make sense of data patterns, relationships, and trends. Through data visualization,
insights and patterns in data can be easily interpreted and communicated to a wider audience,
making it a critical component of machine learning.

Significance of Data Visualization in Machine Learning

➢ Data visualization helps machine learning analysts to better understand and analyze complex
data sets by presenting them in an easily understandable format. Data visualization is an
essential step in data preparation and analysis as it helps to identify outliers, trends, and
patterns in the data that may be missed by other forms of analysis.
➢ With the increasing availability of big data, it has become more important than ever to use data
visualization techniques to explore and understand the data. Machine learning algorithms work
best when they have high-quality and clean data, and data visualization can help to identify
and remove any inconsistencies or anomalies in the data.

Types of Data Visualization Approaches

➢ Machine learning may make use of a wide variety of data visualization approaches. That
include:

1. Line Charts: In a line chart, each data point is represented by a point on the graph, and
these points are connected by a line. We may find patterns and trends in the data across
time by using line charts. Time-series data is frequently displayed using line charts.

Prof. Saloni Shrivastava

2. Scatter Plots: A quick and efficient method of displaying the relationship between two
variables is to use scatter plots. With one variable plotted on the x-axis and the other
variable drawn on the y-axis, each datapoint in a scatter plot is represented by a point on
the graph. We may use scatter plots to visualize data to find patterns, clusters, and outliers.

3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar chart,
each category is represented by a bar, with the height of the bar indicating the frequency
or proportion of that category in the data. Bar graphs are useful for comparing several
categories and seeing patterns over time.

4. Heat Maps: Heat maps are a type of graphical representation that displays data in a matrix
format. The value of the data point that each matrix cell represents determines its hue.
Heatmaps are often used to visualize the correlation between variables or to identify
patterns in time-series data.

Prof. Saloni Shrivastava

5. Tree Maps: Tree maps are used to display hierarchical data in a compact format and are
useful in showing the relationship between different levels of a hierarchy.

6. Box Plots: Box plots are a graphical representation of the distribution of a set of data. In a
box plot, the median is shown by a line inside the box, while the center box depicts the
range of the data. The whiskers extend from the box to the highest and lowest values in
the data, excluding outliers. Box plots can help us to identify the spread and skewness of
the data.

Prof. Saloni Shrivastava

Uses of Data Visualization in Machine Learning
Data visualization has several uses in machine learning. It can be used to:
1. Identify trends and patterns in data
2. Communicate insights to stakeholders
3. Monitor machine learning models

4. Improve data quality:

Challenges in Data Visualization

1. Choosing the Right Visualization

2. Data Quality

3. Data Overload

4. Over-Emphasis on Aesthetics

5. Audience Understanding

6. Technical Expertise

Hypothesis function and testing

What is Hypothesis?
➢ The hypothesis is defined as the supposition or proposed explanation based on insufficient
evidence or assumptions. It is just a guess based on some known facts but has not yet been
proven. A good hypothesis is testable, which results in either true or false.
➢ Example: Let's understand the hypothesis with a common example. Some scientist claims that
ultraviolet (UV) light can damage the eyes then it may also cause blindness.

Prof. Saloni Shrivastava

Hypothesis in Machine Learning (ML)

➢ The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that
best maps the input to corresponding outputs with the help of an available dataset.

➢ In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.
➢ There are some common methods given to find out the possible hypothesis from the Hypothesis
space, where hypothesis space is represented by uppercase-h (H) and hypothesis
by lowercase-h (h). These are defined as follows:
Hypothesis space (H):
➢ Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known
as a hypothesis set. It is used by supervised machine learning algorithms to determine the best
possible hypothesis to describe the target function or best maps input to output.
➢ It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.
Hypothesis (h):
➢ It is defined as the approximate function that best describes the target in supervised machine
learning algorithms. It is primarily based on data as well as bias and restrictions applied to data.
➢ Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output
and can be evaluated as well as used to make predictions.
➢ Hypothesis Testing is basically an assumption that we make about the population parameter.
➢ Ex : you say avg student in class is 40 or a boy is taller than girls.
➢ all those example we assume need some statistic way to prove those. We need some
mathematical conclusion whatever we are assuming is true.

Data Distribution

➢ Data distribution refers to the way data values are spread or distributed in a dataset. It aims at
providing valuable insights, informs decision-making, and ensures that appropriate methods
are used for statistical analysis and modeling.

Prof. Saloni Shrivastava

➢ Data distribution in statistics is any population with data scattering or a spread of a range of
values. Statistics typically use various representations, such as charts, tables, histograms, and
box plots. With proper distribution, the raw data becomes more accessible to read and interpret.

Types of Data Distribution

➢ Data distribution can be divided into two types:

A) Continuous Data
➢ In simple terms, such data operates from one extreme to another, gauged on a scale such as
weight and temperature. Such type of data helps in gaining relevant information into trends,
patterns, and relationships typically not observed with other datasets. The continuous data is
categorized into several distributions, such as –
1. Normal data distribution – This is the most common type of distribution, with a bell
curve measuring the mean between equal data points on both sides.
2. Log-normal distribution – In this distribution, the data points are measured in a sigmoid
function. Hence, this distribution is used in financial data to predict future stock prices based
on past data.
3. F distribution: Helps in gauging data points in a broader range than normal distribution with
high variability.
4. Chi-square distribution: It analyzes the gap between observed data and expected results
and helps in identifying differences between two datasets.
5. Exponential distribution: Similar to F distribution, but gauges data points with an
exponential curve beginning at zero and perpetually increasing in value.
6. Non-normal distribution: It includes logistic and gamma distribution. Moreover, it is usually
used when data is highly non-linear and does not fit in the standard data distribution
categories.

Prof. Saloni Shrivastava

B) Discrete Data
➢ It is the opposite of continuous data, which means that it varies in limit and a set range of values.
Examples of it are classroom strength, books in a shop, etc. Such information is generally
visualized through bar graphs. It has four types of distribution.
1. Binomial distribution: Applied to describe the quantified success or failure probability in a given
number of trials. For example, yes or no, heads or tails, right or wrong, etc.
2. Poisson data distribution: Used to define the event probability during a specific period with a
known rate but unknown occurrence.
3. Hypergeometric distribution: Similar to binomial distribution but with multiple items and without
replacement.
4. Geometric distribution: Derives a number of failures before a success. The success probability
is defined in any given trial with a series of independent trials and known individual success
probability.

Data Preprocessing in Machine learning

➢ Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model. When
creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way.

Why do we need Data Preprocessing?

➢ A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data preprocessing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.
➢ It involves below steps:
1. Getting the dataset
2. Importing libraries
3. Importing datasets
4. Finding Missing Data
5. Encoding Categorical Data
6. Splitting dataset into training and test set
7. Feature scaling
Data Augmentation

➢ Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or
using deep learning to generate new data points.

Prof. Saloni Shrivastava

Augmented vs. Synthetic data
➢ Augmented data is driven from original data with some minor changes. In the case of image
augmentation, we make geometric and color space transformations (flipping, resizing,
cropping, brightness, contrast) to increase the size and diversity of the training set.
➢ Synthetic data is generated artificially without using the original dataset. It often uses DNNs
(Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate synthetic
data.
➢ Note: the augmentation techniques are not limited to images. You can augment audio, video,
text, and other types of data too.
➢ When Should You Use Data Augmentation?
1. To prevent models from overfitting.
2. The initial training set is too small.
3. To improve the model accuracy.
4. To Reduce the operational cost of labeling and cleaning the raw dataset.

Limitations of Data Augmentation

➢ The biases in the original dataset persist in the augmented data.
➢ Quality assurance for data augmentation is expensive.
➢ Research and development are required to build a system with advanced applications. For
example, generating high-resolution images using GANs can be challenging.
➢ Finding an effective data augmentation approach can be challenging.

Data Augmentation Techniques

A) Audio Data Augmentation

1. Noise injection: add gaussian or random noise to the audio dataset to improve the model
performance.
2. Shifting: shift audio left (fast forward) or right with random seconds.
3. Changing the speed: stretches times series by a fixed rate.
4. Changing the pitch: randomly change the pitch of the audio.

B) Text Data Augmentation

1. Word or sentence shuffling: randomly changing the position of a word or sentence.
2. Word replacement: replace words with synonyms.

3. Syntax-tree manipulation: paraphrase the sentence using the same word.

4. Random word insertion: inserts words at random.
5. Random word deletion: deletes words at random.

C) Image Augmentation

Prof. Saloni Shrivastava

1. Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images. You need
to be careful about applying multiple transformations on the same images, as this can
reduce model performance.
2. Color space transformations: randomly change RGB color channels, contrast, and
brightness.
3. Kernel filters: randomly change the sharpness or blurring of the image.
4. Random erasing: delete some part of the initial image.
5. Mixing images: blending and mixing multiple images.

Advanced Techniques

1. Generative adversarial networks (GANs): used to generate new data points or images. It
does not require existing data to generate synthetic data.
2. Neural Style Transfer: a series of convolutional layers trained to deconstruct images and
separate context and style.

Normalization in Machine Learning

➢ Normalization is one of the most frequently used data preparation techniques, which helps us
to change the values of numeric columns in the dataset to use a common scale.
➢ Although Normalization is no mandate for all datasets available in machine learning, it is used
whenever the attributes of the dataset have different ranges. It helps to enhance the
performance and reliability of a machine learning model.

Machine Learning Models

➢ A machine learning model is defined as a mathematical representation of the output of the

training process.

Supervised learning
➢ Supervised learning is a category of machine learning that uses labeled datasets to train
algorithms to predict outcomes and recognize patterns.

Prof. Saloni Shrivastava

Unsupervised learning
➢ Unsupervised machine learning models are given unlabeled data and allowed to discover
patterns and insights without any explicit guidance or instruction.

Semi-supervised learning
➢ Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large
amount of unlabeled data to train a model.

Reinforcement Learning
➢ Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make
decisions to achieve the most optimal results. It mimics the trial-and-error learning process that
humans use to achieve their goals.

Prof. Saloni Shrivastava

Machine Learning Bangalore City University 2024
No ratings yet
Machine Learning Bangalore City University 2024
5 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
Deep Learning Unit-II
No ratings yet
Deep Learning Unit-II
19 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
MC4301 - ML Unit 2 (Model Evaluation and Feature Engineering)
No ratings yet
MC4301 - ML Unit 2 (Model Evaluation and Feature Engineering)
40 pages
Question Bank - Machine Learning (Repaired)
100% (1)
Question Bank - Machine Learning (Repaired)
78 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Bayesian Learning Unit 3 PDF
No ratings yet
Bayesian Learning Unit 3 PDF
18 pages
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Machine Learning - WWW - Rgpvnotes.in
23 pages
Supervised Vs Unsupervised Learning
100% (1)
Supervised Vs Unsupervised Learning
7 pages
Manual of GAA21750AK3 Elevator Service Tool: WWW - Escalatorparts.cn
No ratings yet
Manual of GAA21750AK3 Elevator Service Tool: WWW - Escalatorparts.cn
70 pages
Marc Error Codes
No ratings yet
Marc Error Codes
10 pages
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
No ratings yet
CS601 - Machine Learning - Unit 2 - Notes - 1672759753
14 pages
Unit 1 - Machine Learning
No ratings yet
Unit 1 - Machine Learning
21 pages
CS601 - Machine Learning - Unit 3 - Notes - 1672759761
No ratings yet
CS601 - Machine Learning - Unit 3 - Notes - 1672759761
15 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
17 pages
ML UNIT 2 Sir
No ratings yet
ML UNIT 2 Sir
46 pages
Machine Learning UNIT-3
100% (1)
Machine Learning UNIT-3
16 pages
ML Unit 1
No ratings yet
ML Unit 1
44 pages
Unit-5 Alt
No ratings yet
Unit-5 Alt
15 pages
Machine Learning Question Paper Solved ML
No ratings yet
Machine Learning Question Paper Solved ML
55 pages
Deep Learning Interview Questions
No ratings yet
Deep Learning Interview Questions
17 pages
Overfitting vs. Underfitting, Bias vs. Variance
No ratings yet
Overfitting vs. Underfitting, Bias vs. Variance
7 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
Unit I Notes Machine Learning Techniques 1
No ratings yet
Unit I Notes Machine Learning Techniques 1
21 pages
Machine Learning Notes (All Units Merged)
No ratings yet
Machine Learning Notes (All Units Merged)
144 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Lec-1 ML Intro
No ratings yet
Lec-1 ML Intro
15 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Soft Computing (SC) Topper Solution
100% (2)
Soft Computing (SC) Topper Solution
35 pages
ML Unsupervised Notes-New
100% (1)
ML Unsupervised Notes-New
13 pages
CS 601 ML Lab Manual
0% (1)
CS 601 ML Lab Manual
14 pages
Unit-Ii Chapter-3 Beyond Binary Classification Handling More Than Two Classes
No ratings yet
Unit-Ii Chapter-3 Beyond Binary Classification Handling More Than Two Classes
16 pages
Unit - 3 ML
No ratings yet
Unit - 3 ML
17 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
23 pages
Unit-1 ML
No ratings yet
Unit-1 ML
19 pages
DL Unit - 4
No ratings yet
DL Unit - 4
14 pages
ML (U1&u2)
No ratings yet
ML (U1&u2)
51 pages
@vtucode - in Module 4 AI 2021 Scheme 5th Sem
No ratings yet
@vtucode - in Module 4 AI 2021 Scheme 5th Sem
11 pages
Unit - 3-NNDL - Notes
No ratings yet
Unit - 3-NNDL - Notes
17 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
4 pages
R Unit 1 2018 Notes
No ratings yet
R Unit 1 2018 Notes
36 pages
JNTUK R20 ML UNIT-I Final
No ratings yet
JNTUK R20 ML UNIT-I Final
22 pages
CS601 - Machine Learning - Unit 4 - Notes - 1672759767
No ratings yet
CS601 - Machine Learning - Unit 4 - Notes - 1672759767
12 pages
ML Unit-1
No ratings yet
ML Unit-1
32 pages
r22 1 9 ML Lab Manual r22 Regulations
No ratings yet
r22 1 9 ML Lab Manual r22 Regulations
24 pages
Unit 3 Notes
100% (2)
Unit 3 Notes
32 pages
FIND-S Algorithm: Machine Learning 15CSL76
No ratings yet
FIND-S Algorithm: Machine Learning 15CSL76
3 pages
Machine Learning: in Telugu
No ratings yet
Machine Learning: in Telugu
14 pages
ML UNIT-4 Notes PDF
100% (1)
ML UNIT-4 Notes PDF
40 pages
Unit 4
100% (1)
Unit 4
57 pages
A Mini Project Report On: "Big Mart Sales Prediction" by
67% (3)
A Mini Project Report On: "Big Mart Sales Prediction" by
23 pages
AIML Unit Wise Question Bank
100% (1)
AIML Unit Wise Question Bank
4 pages
Machine Learning Unit 5
No ratings yet
Machine Learning Unit 5
43 pages
Optimization For Long-Term Dependencies
No ratings yet
Optimization For Long-Term Dependencies
57 pages
Unit-V Deep Learning Techniques
100% (1)
Unit-V Deep Learning Techniques
31 pages
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
No ratings yet
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
56 pages
ML Unit-1
No ratings yet
ML Unit-1
26 pages
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
No ratings yet
1) Aim: Demonstration of Preprocessing of Dataset Student - Arff
26 pages
Unit-1 Data Mining Metrics
No ratings yet
Unit-1 Data Mining Metrics
2 pages
Regression Bayesian SVM Notes
No ratings yet
Regression Bayesian SVM Notes
6 pages
Aiml Unit 3
No ratings yet
Aiml Unit 3
9 pages
Automated Behavioral Analysis of Malware: A Case Study of Wannacry Ransomware
No ratings yet
Automated Behavioral Analysis of Malware: A Case Study of Wannacry Ransomware
7 pages
C++ Basic
No ratings yet
C++ Basic
13 pages
Chapter 1
No ratings yet
Chapter 1
50 pages
Bus Architecture
No ratings yet
Bus Architecture
18 pages
Lesson 4 Module - 202203160619
No ratings yet
Lesson 4 Module - 202203160619
6 pages
Class 5
No ratings yet
Class 5
8 pages
Online Consent Form
No ratings yet
Online Consent Form
2 pages
A 28 NM 16-kb Sign-Extension-Less Digital-Compute-in-Memory Macro With Extension-Friendly Compute Units and Accuracy-Adjustable Adder-Tree
No ratings yet
A 28 NM 16-kb Sign-Extension-Less Digital-Compute-in-Memory Macro With Extension-Friendly Compute Units and Accuracy-Adjustable Adder-Tree
5 pages
UGCIJSRSETArduino Based Radar Systemfor Short Range Applications
No ratings yet
UGCIJSRSETArduino Based Radar Systemfor Short Range Applications
11 pages
Om MRP PDF
No ratings yet
Om MRP PDF
6 pages
Privacy Security For Iot
No ratings yet
Privacy Security For Iot
39 pages
Lab 2 - Data Modeling and Exploration
No ratings yet
Lab 2 - Data Modeling and Exploration
56 pages
Instruction Manual
No ratings yet
Instruction Manual
886 pages
LI-OV2640-USB-M7 Data Sheet: Key Features
No ratings yet
LI-OV2640-USB-M7 Data Sheet: Key Features
3 pages
3551 3552 3552BT Kyoritsu
No ratings yet
3551 3552 3552BT Kyoritsu
48 pages
NCERT Question Answer of Chapter - 2 Class-X Digital Documentation Advanced
No ratings yet
NCERT Question Answer of Chapter - 2 Class-X Digital Documentation Advanced
3 pages
General Description Features: 380Khz, 18V/2A Synchronous Step-Down DC-DC Converter
No ratings yet
General Description Features: 380Khz, 18V/2A Synchronous Step-Down DC-DC Converter
19 pages
Beginning React and Firebase: Create Four Beginner-Friendly Projects Using React and Firebase Nabendu Biswas Download
No ratings yet
Beginning React and Firebase: Create Four Beginner-Friendly Projects Using React and Firebase Nabendu Biswas Download
43 pages
Ddco Report
No ratings yet
Ddco Report
16 pages
Make Very Easy Robot Car That Is Smartphone Controlled - 7 Steps (With Pictures) - Instructables
No ratings yet
Make Very Easy Robot Car That Is Smartphone Controlled - 7 Steps (With Pictures) - Instructables
11 pages
Chapter 6
No ratings yet
Chapter 6
16 pages
Saab Radio Codes 1999
No ratings yet
Saab Radio Codes 1999
2 pages
Infoblox Deployment Guide Implementing Infoblox Dns Traffic Control in Nios 8 X
No ratings yet
Infoblox Deployment Guide Implementing Infoblox Dns Traffic Control in Nios 8 X
59 pages
BASIS AP Computer Science A Summer Packet
No ratings yet
BASIS AP Computer Science A Summer Packet
10 pages
Load Balancing PD Mikrotik
No ratings yet
Load Balancing PD Mikrotik
16 pages
How To Use Prisma ORM With Turborepo - Prisma Documentation
No ratings yet
How To Use Prisma ORM With Turborepo - Prisma Documentation
8 pages
Responsibility Matrix Template
No ratings yet
Responsibility Matrix Template
2 pages
Pastel Detailed Lesson Plan For Pre-K by Slidesgo
No ratings yet
Pastel Detailed Lesson Plan For Pre-K by Slidesgo
46 pages
Innova 3100 - 4100 Features Rev23
No ratings yet
Innova 3100 - 4100 Features Rev23
128 pages
Lab Exercise 6
0% (1)
Lab Exercise 6
4 pages
Gantt Chart Template For QA Projects
No ratings yet
Gantt Chart Template For QA Projects
1 page
IIT Jodhpur PG Diploma in Data Science Engineering
No ratings yet
IIT Jodhpur PG Diploma in Data Science Engineering
23 pages

Unit1 6thsemCS

Uploaded by

Unit1 6thsemCS

Uploaded by

Unit - I

The three basic ingredient of machine learning

There are three basic functional ingredients of ML.

Use of machine learning

Prof. Saloni Shrivastava

Future Scope of Machine Learning

Limitations of Machine Learning

1. Dependence on Data Quality

Prof. Saloni Shrivastava

➢ Below is the mathematical equation for Linear regression:

Prof. Saloni Shrivastava

➢ Some popular applications of linear regression are:

1. Analyzing trends and sales estimates

Prof. Saloni Shrivastava

3. Support Vector Regression

4. Decision Tree Regression

Prof. Saloni Shrivastava

5. Random Forest Regression

Prof. Saloni Shrivastava

Probability and statistics

Probability in Machine Learning

Prof. Saloni Shrivastava

➢ The Probability of an event A conditioned on an event B is denoted and defined as;

Statistics in Machine Learning

Prof. Saloni Shrivastava

2. Applicable in loss functions, regularisation, covariance matrices, Singular Value

3. Implementation of Linear Regression in Machine Learning.

Why learn Linear Algebra before learning Machine Learning?

Prof. Saloni Shrivastava

Creating better Machine Learning algorithms:

1. Single Value Decomposition (SVD)

Prof. Saloni Shrivastava

➢ Linear Algebra is an important department of Mathematics that is easy to understand. It is taken

Minimum Linear Algebra for Machine Learning

Examples of Linear Algebra in Machine Learning

➢ Convex optimisation is the foundation of gradient descent, a well-liked optimisation technique

Prof. Saloni Shrivastava

Significance of Data Visualization in Machine Learning

Types of Data Visualization Approaches

Prof. Saloni Shrivastava

Prof. Saloni Shrivastava

Prof. Saloni Shrivastava

4. Improve data quality:

Challenges in Data Visualization

1. Choosing the Right Visualization

Hypothesis function and testing

Prof. Saloni Shrivastava

Prof. Saloni Shrivastava

Types of Data Distribution

➢ Data distribution can be divided into two types:

Prof. Saloni Shrivastava

Data Preprocessing in Machine learning

Why do we need Data Preprocessing?

Prof. Saloni Shrivastava

Limitations of Data Augmentation

Data Augmentation Techniques

A) Audio Data Augmentation

B) Text Data Augmentation

3. Syntax-tree manipulation: paraphrase the sentence using the same word.

Prof. Saloni Shrivastava

Normalization in Machine Learning

Machine Learning Models

➢ A machine learning model is defined as a mathematical representation of the output of the

Prof. Saloni Shrivastava

Prof. Saloni Shrivastava

You might also like