Unit1 6thsemCS
Unit1 6thsemCS
Introduction to machine learning, scope and limitations, regression, probability, statistics and linear
algebra for machine learning, convex optimization, data visualization, hypothesis function and testing,
data distributions, data pre-processing, data augmentation, normalizing data sets, machine learning
models, supervised and unsupervised learning.
Machine Learning
➢ Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that
focuses on the using data and algorithms to enable AI to imitate the way that humans learn,
gradually improving its accuracy.
➢ Machine learning is a type of AI that teaches machines how to learn, interpret and
predict results based on a set of data.
1. Data: The dataset you want to use must be well-structured, accurate. The data you use can be
labeled or unlabeled. Unlabeled data are sample items — e.g. photos, videos, news articles —
that don’t need additional explanation, while labeled ones are augmented: unlabeled
information is bundled and an explanation, with a tag, is added to them.
2. Algorithm: there are different types of algorithms that can be used (e.g. linear regression,
logistic regression). Choosing the right algorithm is both a combination of business need,
specification, experimentation and time available.
3. Model: A “model” is the output of a machine learning algorithm run on data. It represents the
rules, numbers, and any other algorithm-specific data structures required to make predictions.
➢ The function of a machine learning system can be descriptive, meaning that the system uses
the data to explain what happened; predictive, meaning the system uses the data to predict
what will happen; or prescriptive, meaning the system will use the data to make suggestions
about what action to take.
➢ What are training data?
Training data is the initial data used to train machine learning models. Training datasets are fed
to machine learning algorithms so that they can learn to make predictions, or perform a desired
task. This type of data is key, because it helps machines achieve results and work in the right
way, as shown in the graph below.
1. Robotics
2. Computer vision
3. Quantum Processing
4. Automotive Industry
5. Digital protection
Regression
➢ Regression is a method for understanding the relationship between independent variables or
features and a dependent variable or outcome. Outcomes can then be predicted once the
relationship between independent and dependent variables has been estimated.
1. Linear Regression
➢ Linear regression is a statistical regression method which is used for predictive analysis.
➢ It is one of the very simple and easy algorithms which works on regression and shows the
relationship between the continuous variables.
➢ It is used for solving the regression problem in machine learning.
➢ Linear regression shows the linear relationship between the independent variable (X-axis) and
the dependent variable (Y-axis), hence called linear regression.
➢ If there is only one input variable (x), then such linear regression is called simple linear
regression. And if there is more than one input variable, then such linear regression is
called multiple linear regression.
➢ The relationship between variables in the linear regression model can be explained using the
below image. Here we are predicting the salary of an employee on the basis of the year of
experience.
2. Polynomial Regression
➢ Polynomial Regression is a type of regression which models the non-linear dataset using a
linear model.
➢ It is similar to multiple linear regression, but it fits a non-linear curve between the value of x and
corresponding conditional values of y.
➢ Suppose there is a dataset which consists of datapoints which are present in a non-linear
fashion, so for such case, linear regression will not best fit to those datapoints. To cover such
datapoints, we need Polynomial regression.
➢ In Polynomial regression, the original features are transformed into polynomial features of given
degree and then modeled using a linear model. Which means the datapoints are best fitted
using a polynomial line.
➢ The equation for polynomial regression also derived from linear regression equation that means
Linear regression equation Y= b0+ b1x, is transformed into Polynomial regression equation Y=
b0+b1x+ b2x2+ b3x3+.....+ bnxn.
➢ Here Y is the predicted/target output, b0, b1,... bn are the regression coefficients. x is
our independent/input variable.
➢ The model is still linear as the coefficients are still linear with quadratic
2. Hyperplane: In general SVM, it is a separation line between two classes, but in SVR, it is a line
which helps to predict the continuous variables and cover most of the datapoints.
3. Boundary line: Boundary lines are the two lines apart from hyperplane, which creates a margin
for datapoints.
4. Support vectors: Support vectors are the datapoints which are nearest to the hyperplane and
opposite class.
➢ In SVR, we always try to determine a hyperplane with a maximum margin, so that maximum
number of datapoints are covered in that margin. The main goal of SVR is to consider the
maximum datapoints within the boundary lines and the hyperplane (best-fit line) must contain
a maximum number of datapoints. Consider the below image:
Here, the blue line is called hyperplane, and the other two lines are known as boundary lines.
Above image showing the example of Decision Tee regression, here, the model is trying to predict the
choice of a person between Sports cars or Luxury car.
➢ A general linear or polynomial regression will fail if there is high collinearity between the
independent variables, so to solve such problems, Ridge regression can be used.
➢ Ridge regression is a regularization technique, which is used to reduce the complexity of the
model. It is also called as L2 regularization.
➢ It helps to solve the problems if we have more parameters than samples.
7. Lasso Regression
➢ Lasso regression is another regularization technique to reduce the complexity of the model.
➢ It is similar to the Ridge Regression except that penalty term contains only the absolute weights
instead of a square of weights.
➢ Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression
can only shrink it near to 0.
➢ It is also called as L1 regularization. The equation for Lasso regression will be:
➢ Probability and statistics both are the most important concepts for Machine Learning.
Probability is about predicting the likelihood of future events, while statistics involves the
analysis of the frequency of past events.
➢ Probability is the bedrock of ML, which tells how likely is the event to occur. The value of
Probability always lies between 0 to 1. It is the core concept as well as a primary prerequisite
to understanding the ML models and their applications.
➢ Probability can be calculated by the number of times the event occurs divided by the total
number of possible outcomes. Let's suppose we tossed a coin, then the probability of getting
head as a possible outcome can be calculated as below formula:
Types of Probability
➢ For better understanding the Probability, it can be categorized further in different types as
follows:
Empirical Probability: Empirical Probability can be calculated as the number of times the event
occurs divided by the total number of incidents observed.
Theoretical Probability: Theoretical Probability can be calculated as the number of ways the
particular event can occur divided by the total number of possible outcomes.
Joint Probability: It tells the Probability of simultaneously occurring two random events.
P(A ∩ B) = P(A). P(B)
Where; P(A ∩ B) = Probability of occurring events A and B both.
P (A) = Probability of event A
P (B) = Probability of event B
Conditional Probability: It is given by the Probability of event A given that event B occurred.
➢ Statistics is also considered as the base foundation of machine learning which deals with finding
answers to the questions that we have about data. In general, we can define statistics as:
➢ Statistics is the part of applied Mathematics that deals with studying and developing ways for
gathering, analyzing, interpreting and drawing conclusion from empirical data. It can be used
to perform better-informed business decisions.
➢ Statistics can be categorized into 2 major parts. These are as follows:
1. Descriptive Statistics
2. Inferential Statistics
➢ Statistics methods are used to understand the training data as well as interpret the results of
testing different machine learning models. Further, Statistics can be used to make better-
informed business and investing decisions.
➢ Linear Algebra is an essential field of mathematics, which defines the study of vectors, matrices,
planes, mapping, and lines required for linear transformation.
➢ Linear algebra plays a vital role and key foundation in machine learning, and it enables ML
algorithms to run on a huge number of datasets.
➢ The concepts of linear algebra are widely used in developing algorithms in machine learning.
Although it is used almost in each concept of Machine learning, specifically, it can perform the
following task:
1. Optimization of data.
➢ Besides the above uses, linear algebra is also used in neural networks and the data science
field.
➢ Basic mathematics principles and concepts like Linear algebra are the foundation of Machine
Learning and Deep Learning systems. To learn and understand Machine Learning or Data
Science, one needs to be familiar with linear algebra and optimization theory.
Linear Algebra
➢ Linear Algebra is just similar to the flour of bakery in Machine Learning. As the cake is based
on flour similarly, every Machine Learning Model is also based on Linear Algebra. Further, the
cake also needs more ingredients like egg, sugar, cream, soda. Similarly, Machine Learning
also requires more concepts as vector calculus, probability, and optimization theory. So, we
can say that Machine Learning creates a useful model with the help of the above-mentioned
mathematical concepts.
➢ Below are some benefits of learning Linear Algebra before Machine learning:
1. Better Graphic experience
2. Improved Statistics
3. Creating better Machine Learning algorithms
4. Estimating the forecast of Machine Learning
5. Easy to Learn
Better Graphics Experience:
1. Q-R
2. L-U
Improved Statistics:
➢ Statistics is an important concept to organize and integrate data in Machine Learning. Also,
linear Algebra helps to understand the concept of statistics in a better manner. Advanced
statistical topics can be integrated using methods, operations, and notations of linear algebra.
1. Logistic Regression
2. Linear Regression
3. Decision Trees
4. Support Vector Machines (SVM)
➢ Further, below are some unsupervised learning algorithms listed that can also be created with
the help of linear algebra as follows:
2. Clustering
3. Components Analysis
➢ With the help of Linear Algebra concepts, you can also self-customize the various parameters
in the live project and understand in-depth knowledge to deliver the same with more accuracy
and precision.
Estimating the forecast of Machine Learning:
➢ If you are working on a Machine Learning project, then you must be a broad-minded person
and also, you will be able to impart more perspectives. Hence, in this regard, you must increase
the awareness and affinity of Machine Learning concepts. You can begin with setting up
Easy to Learn:
Convex Optimization
➢ Data Visualization is a crucial aspect of machine learning that enables analysts to understand
and make sense of data patterns, relationships, and trends. Through data visualization,
insights and patterns in data can be easily interpreted and communicated to a wider audience,
making it a critical component of machine learning.
➢ Machine learning may make use of a wide variety of data visualization approaches. That
include:
1. Line Charts: In a line chart, each data point is represented by a point on the graph, and
these points are connected by a line. We may find patterns and trends in the data across
time by using line charts. Time-series data is frequently displayed using line charts.
3. Bar Charts: Bar charts are a common way of displaying categorical data. In a bar chart,
each category is represented by a bar, with the height of the bar indicating the frequency
or proportion of that category in the data. Bar graphs are useful for comparing several
categories and seeing patterns over time.
4. Heat Maps: Heat maps are a type of graphical representation that displays data in a matrix
format. The value of the data point that each matrix cell represents determines its hue.
Heatmaps are often used to visualize the correlation between variables or to identify
patterns in time-series data.
6. Box Plots: Box plots are a graphical representation of the distribution of a set of data. In a
box plot, the median is shown by a line inside the box, while the center box depicts the
range of the data. The whiskers extend from the box to the highest and lowest values in
the data, excluding outliers. Box plots can help us to identify the spread and skewness of
the data.
2. Data Quality
3. Data Overload
4. Over-Emphasis on Aesthetics
5. Audience Understanding
6. Technical Expertise
What is Hypothesis?
➢ The hypothesis is defined as the supposition or proposed explanation based on insufficient
evidence or assumptions. It is just a guess based on some known facts but has not yet been
proven. A good hypothesis is testable, which results in either true or false.
➢ Example: Let's understand the hypothesis with a common example. Some scientist claims that
ultraviolet (UV) light can damage the eyes then it may also cause blindness.
➢ The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is
specifically used in Supervised Machine learning, where an ML model learns a function that
best maps the input to corresponding outputs with the help of an available dataset.
➢ In supervised learning techniques, the main aim is to determine the possible hypothesis out of
hypothesis space that best maps input to the corresponding or correct outputs.
➢ There are some common methods given to find out the possible hypothesis from the Hypothesis
space, where hypothesis space is represented by uppercase-h (H) and hypothesis
by lowercase-h (h). These are defined as follows:
Hypothesis space (H):
➢ Hypothesis space is defined as a set of all possible legal hypotheses; hence it is also known
as a hypothesis set. It is used by supervised machine learning algorithms to determine the best
possible hypothesis to describe the target function or best maps input to output.
➢ It is often constrained by choice of the framing of the problem, the choice of model, and the
choice of model configuration.
Hypothesis (h):
➢ It is defined as the approximate function that best describes the target in supervised machine
learning algorithms. It is primarily based on data as well as bias and restrictions applied to data.
➢ Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output
and can be evaluated as well as used to make predictions.
➢ Hypothesis Testing is basically an assumption that we make about the population parameter.
➢ Ex : you say avg student in class is 40 or a boy is taller than girls.
➢ all those example we assume need some statistic way to prove those. We need some
mathematical conclusion whatever we are assuming is true.
Data Distribution
➢ Data distribution refers to the way data values are spread or distributed in a dataset. It aims at
providing valuable insights, informs decision-making, and ensures that appropriate methods
are used for statistical analysis and modeling.
A) Continuous Data
➢ In simple terms, such data operates from one extreme to another, gauged on a scale such as
weight and temperature. Such type of data helps in gaining relevant information into trends,
patterns, and relationships typically not observed with other datasets. The continuous data is
categorized into several distributions, such as –
1. Normal data distribution – This is the most common type of distribution, with a bell
curve measuring the mean between equal data points on both sides.
2. Log-normal distribution – In this distribution, the data points are measured in a sigmoid
function. Hence, this distribution is used in financial data to predict future stock prices based
on past data.
3. F distribution: Helps in gauging data points in a broader range than normal distribution with
high variability.
4. Chi-square distribution: It analyzes the gap between observed data and expected results
and helps in identifying differences between two datasets.
5. Exponential distribution: Similar to F distribution, but gauges data points with an
exponential curve beginning at zero and perpetually increasing in value.
6. Non-normal distribution: It includes logistic and gamma distribution. Moreover, it is usually
used when data is highly non-linear and does not fit in the standard data distribution
categories.
➢ Data preprocessing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model. When
creating a machine learning project, it is not always a case that we come across the clean and
formatted data. And while doing any operation with data, it is mandatory to clean it and put in a
formatted way.
➢ Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or
using deep learning to generate new data points.
C) Image Augmentation
Advanced Techniques
1. Generative adversarial networks (GANs): used to generate new data points or images. It
does not require existing data to generate synthetic data.
2. Neural Style Transfer: a series of convolutional layers trained to deconstruct images and
separate context and style.
➢ Normalization is one of the most frequently used data preparation techniques, which helps us
to change the values of numeric columns in the dataset to use a common scale.
➢ Although Normalization is no mandate for all datasets available in machine learning, it is used
whenever the attributes of the dataset have different ranges. It helps to enhance the
performance and reliability of a machine learning model.
Supervised learning
➢ Supervised learning is a category of machine learning that uses labeled datasets to train
algorithms to predict outcomes and recognize patterns.
Semi-supervised learning
➢ Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning. It is a method that uses a small amount of labeled data and a large
amount of unlabeled data to train a model.
Reinforcement Learning
➢ Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make
decisions to achieve the most optimal results. It mimics the trial-and-error learning process that
humans use to achieve their goals.