0% found this document useful (0 votes)
4 views

Notes5_Regression

Uploaded by

czf1643605493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Notes5_Regression

Uploaded by

czf1643605493
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Note Set 5: Predictive Modeling: Regression

Padhraic Smyth,
Department of Computer Science
University of California, Irvine
January 2024

1 Introduction

Prediction is the problem of predicting a variable Y taking values y, given values for an input vector x. If y is
real-valued the prediction problem is known as regression and if y is categorical (taking K values, K ≥ 2)
then the problem is known as classification. In this set of notes we will primarily focus on regression
and discuss some of the basic concepts involved in regression modeling, including: functional forms of
regression models, minimizing empirical risk and expected loss for fitting models to data, connections to
probability (and likelihoods and priors), and classic results in bias-variance.

2 Notation

In the discussion below x will be a d-dimensional column vector with components x1 , . . . , xd . We will
generally assume that one of the components is a constant = 1, which will allow us to have an intercept
term in our models. It is also typical to assume that each of the xj , j = 1, . . . , d are integer or real-valued.
If we have a categorical variable in our input x, taking K values, we can for example recode it as K binary
variables.

For data we will use D = {(xi , yi )}, i = 1, . . . , N where here the subscripts refer to data points. We
can think of D as an N × (d + 1) data matrix if we wish, where each row corresponds to (xTi , yi ), i.e., the
d component values of the transpose of xi plus the corresponding target value yi . In some contexts below
we will refer to xij : this is the value of the jth component (the jth feature) for the ith data vector xi . The
notation is not perfect but should be clear from the context.

We will refer to predictive models in the form of f (x; θ): this is a function that takes a vector x as input
and produces (unless stated otherwise) a scalar value that predicts y. This function is parametrized by a
parameter vector θ which we will treat as a column vector θ1 , . . . , θp with p parameters (sometimes also
referred to as coefficients or weights). The goal of data-driven predictive modeling is to learn the parameters
θ from data D.

1
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 2

3 Examples of Predictive Models

Linear Models Let f (x; θ) be a model for predicting y, with functional form f and parameters θ. A linear
model is one in which is linear in the parameters, e.g.,
d
X
f (x; θ) = θj x j = θ T x
j=1

with parameter vector θ = (θ1 , . . . , θd ). Note that we can define a linear model that is linear in the parame-
ters θ and non-linear in the input variables x. For example, if x = (x1 , x2 , 1) we could define a model such
as
f (x; θ) = θ1 x1 + θ2 x2 + θ3 x2 + θ4 x22 + θ5 x1 x2 + θ6
One way to think about this is that we have in effect replaced our original input x with an augmented input
x′ = (x1 , x2 , x21 , x22 , x1 x2 , 1), and can then write our model as being a linear function of the parameters and
the augmented input, i.e., as θT x′ .

Estimating the parameters θ for a linear model is often quite straightforward. For example, if our loss
function is squared-error and our model is linear then the empirical risk (see next section) will be convex,
i.e., in general it has a single global minimum. And to find this minimum, if we take the derivative of the risk
function with respect to parameters θj and set to 0, we get a set of simultaneous linear equations (referred
to as the Normal equations in least-squares fitting in statistics), which can be solved straightforwardly by
numerical (non-gradient) methods. This numerical approach is in effect equivalent to inverting a matrix of
dimension p × p, with complexity O(p3 ). As a result, this direct approach is fine unless we have a large
number of parameters p (or if our system of equations is ill-conditioned), in which case gradient methods can
be much more effective computationally than trying to directly solve the p simultaneous linear equations1 .

Generalized Linear Models (GLMs): GLMs are well-known and widely-used in statistics and take the
form f (x; θ) = m(θT x) where m() is a scalar-valued function2 (often non-linear) operating on the linear
inner product θT x. This is a relatively simple extension of a linear model, with one parameter per input
dimension (i.e., p = d). The most well-known example of such a model in machine learning is the logistic
regression model, where m(z) = 1/(1 + exp(−z), i.e., m is the logistic function and produces a value
that is bounded between 0 and 1. The use of the logistic function here can be interpreted as mapping the
real-valued inner product θT x (which could take values anywhere on the real line) to a value f ∈ [0, 1],
which can be very useful for example in classification problems (with binary-valued y) where we want f to
represent the conditional probability of a class variable given an input x (we will discuss this in more detail
in a later set of notes on classification). Other forms of GLMs can be used to represent other constraints
1
We will not discuss optimization methods, such as gradient techniques, in any detail in these notes: students are encouraged
to review basic concepts of continuous optimization for machine learning, e.g., Chapter 7 on Continuous Optimization in the
Mathematics for Machine Learning text.
2
For readers familiar with GLMs, in our notation we are using m() to represent the inverse of the link function g(), i.e.,
m = g −1 ().
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 3

on the values we are predicting, such as the Poisson GLM, where f (x; θ) = m(θt x) = exp(θt x), which
constrains f to be non-negative. In general the fitting of GLM models will require iterative algorithms:
second-order methods using the Hessian (the p × p matrix of second derivatives of the loss with respect
to the parameters) are often used in statistics, but for higher-dimensional problems in machine learning we
typically use first-order gradient methods when fitting such models since using the Hessian in optimization
requires inverting the p × p matrix, which scales as O(p3 ) time for every update of the parameters.

Neural Networks: These models are of course widely-used in machine learning and there is a huge liter-
ature on the many different types of neural networks. For regression modeling (i.e., predicting a real-valued
y given an input x) we can loosely think of a neural network as operating in two stages. In the first stage, we
learn some vector representation z ϕ (x) for the input x; this is often referred to as representation learning.
For example, if the inputs x are pixels then we can think of z ϕ (x) as a new low-dimensional feature repre-
sentation of the pixels, a representation parametrized by ϕ and that (in principle) eliminates information in
x that is not relevant to predicting y (e.g., z ϕ (x) might be a translation-invariant representation of pixels x).
There are a wide variety of “architectures” for creating effective representations z (convolutional models,
recurrent models, transformer models, etc) and they often use multiple hierarchical (or “deep”) layers of
intermediate representations to generate z representations that are good for predicting y: we won’t concern
ourselves with the details of how the representation learning aspects of neural networks operate (there are
many other sources for learning about these methods).

In the second (output) stage of a neural network we use (in effect) a generalized linear model to transform
z ϕ = z ϕ (x) into the type of output we want, i.e.,

f (x; θ) = m β T z ϕ


where β is a vector of parameters of the same dimension as z ϕ (x) and the full set of parameters for the
model is θ = {ϕ, β}. If our output y is unconstrained then m() can just be the identity function, but if
(for example) y ∈ [0, 1] then m() can be a logistic function (or “softmax” as it is referred to in the neural
network literature).

A key point in neural networks is that both the β and ϕ are learned simultaneously—in particular the
parameters ϕ of the non-linear transformation z ϕ (x) are directly learned to optimize the prediction of y,
e.g., using gradient methods. This tends to be much more effective in practice to “two-stage learning”
where feature representations for x are pre-defined based on human intuition, e.g., using different types
of manually-defined spatial filters (which up until the early 2000’s was typically how image classification
models were typically built).

A challenge with neural network models, compared to the much simpler linear and GLM models, is
that fitting neural networks models is much more complex since z ϕ (x) is usually a complicated non-linear
function and the number of parameters in ϕ can be very large. Gradient methods, particularly stochastic
gradient methods, have been found empirically to be the most effective method in practice for training
neural networks, with a broad range of different variants and heuristics in use. The stochastic gradient
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 4

method, which uses approximate gradients computed on sequential small batches of randomly selected
examples in the training data, is particularly useful for fast training of large neural network models.

The progression of models above, from linear to GLM to neural networks is intended to give you as the
reader a sense of how these different models are related to each other, acknowledging that we are skipping
over many many details in how these models can be specified, particularly for neural networks. There are of
course many other classes of functional forms for f that can be very useful in different contexts, including
non-parametric regression models (and their Bayesian counterparts, Gaussian processes), tree-structured
regression models, generalized additive models (GAMs), and many more. In what follows below, in our
discussion of regression, we will largely treat our functional form f (x; θ) as some general functional form
without getting into details of its specification, and instead focus on what is happening when we try to fit
these models to data.

4 An Optimization View of Prediction

A useful and common way to view learning of a predictive model is to see it as an optimization problem as
follows. Given data D = {xi , yi } we define empirical risk as:
N
1 X 
R(θ) = ∆ yi , f (xi ; θ) (1)
N
i=1

There are three key aspects to this definition

1. Model: The functional form of f represents our choice of model (linear, etc) and fi = f (xi ; θ) is the
prediction of our model, with input xi and some fixed setting of the parameter vector θ.

2. Loss: The loss ∆(yi , fi ) ≥ 0 defines the scalar-valued loss (or error) when we predict a true y value
with a predicted value fi = f (xi ; θ). For real-valued y we could use for example ∆(y, f ) = (y − f )2
the squared error loss—but there are a variety of other loss functions we can use too.

3. Optimization: Once we define the functional form f and the loss ∆, and given training data D =
{(xi , yi )}, then “all” that remains to do is to solve an optimization problem, i.e., minimize R(θ) as a
function of θ. This optimization is referred to as empirical risk minimization. For complex models f
in machine learning this optimization is typically carried out using gradient-based methods (assuming
that the loss ∆ and function f are such that we can define gradients with respect to the parameters θ).

There are of course many other aspects to predictive modeling. For example, it is common to add a regular-
ization term on the parameters to R(θ) and, as we discussed earlier, there are many many different ways we
can select the functional form f (linear models, generalized linear models, neural networks, tree-structured
models, and so on). Nonetheless the 3 components above capture some of the key elements in empirical
predictive modeling and will typically be present in some form for any predictive modeling methodology.
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 5

A natural question to ask is why no probabilities show up in the statement of the predictive modeling
problem above. From the way we have stated it, via empirical risk minimization, the problem looks entirely
like an optimization problem, and there appears to be no role for probabilistic models in this setup. However,
as we will see below, probability is lurking beneath the surface in an implicit manner and plays a critical
role in predictive modeling.

5 A Probabilistic View of Prediction

Implicit in the formulation of the optimization problem above (as empirical risk minimization) is the assump-
tion that our data D consists of samples (xi , yi ) drawn in an IID manner from some underlying unknown
distribution p(x, y). This implicit assumption is present by virtue of the fact that we are training a model on
training data with the expectation that this data is similar distributionally to future data that the model will
see, i.e., that the model will generalize well to (x, y) pairs where y is unknown and we need to predict y
given x.

Given this, from a probabilistic perspective we will be interested in understanding how p(x, y) affects
our optimization problem. We can factor p(x, y) as p(y|x)p(x). Much of our focus naturally will be on the
conditional p(y|x) since this characterizes, in a stochastic manner, the predictive relationship from x to y
that we are interested in. In addition, as we will see later, p(x) also plays an important implicit role.

Figure 1: An illustration of regression: variation of data points y, distributed according to p(y|x), centered
around E[y|x] at each x; and the distribution of x data points according to p(x).

p(y|x) reflects the stochastic dependence of y given a specific vector x. You might ask “why don’t we
think of the relationship between y and x as being deterministic?”. The reason is that in almost all real-world
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 6

prediction problems the x variables that we have access to (that we can measure and use as inputs to our
model) are often not sufficient to deterministically determine the value of y. In other words, there will be
factors that can affect y that will not be in our prediction model. Thus, as these unmeasured factors change,
y’s values will change (for a fixed x) in a manner that will render y’s dependence on x to appear stochastic
rather than deterministic.

A simple example of this is when y is in the future, e.g,. y is the value of the stock market by the end
of the week. No matter how many x variables we put into our model, there will be some uncertainty about
y: there are so many possible factors that could influence y (e.g., social, political, natural events) that it
is impossible to put them all in our prediction model. Another example is medical diagnosis where y is
whether a particular patient has a specific disease or not: our x variables in our prediction model might be
indirect measures such as test results and the only truly accurate way to determine y would be to do surgery.
Yet another example is where x represents high-resolution pixel data and y is (for example) a binary variable
indicating if there is a human face present or not. The relationship betwene x and y will be stochastic if the
human face is too far away from the camera, or largely obscured, and so on. For these reasons it is natural
to capture uncertainty in the relationship between y and x via p(y|x).

5.1 Empirical Risk as Expected Loss

For simplicity of notation below we will replace vector notation x and θ with just x and θ: but keep in mind
that we are still considering x and θ to be vectors, e.g., p(x) is referring to a density function over a vector
x of dimension d, etc.

Let us consider what happens to the empirical risk as the size of our data set D = (xi , yi ), i = 1, . . . , N
becomes infinitely large, i.e., as N → ∞.
N
1 X 
lim R(θ) = lim ∆ yi , f (xi ; θ)
N →∞ N →∞ N
i=1
Z Z

= ∆ y, f (x; θ) p(x, y)dy dx (2)
x y

by the law of large numbers and by the fact that the empirical samples xi , yi are IID draws from p(x, y).
Thus,
Z Z 

lim R(θ) = ∆ y, f (x; θ) p(y|x)dy p(x)dx
N →∞ x y
Z
 
= Ep(y|x) ∆ y, f (x; θ) ] p(x)dx (3)
x

Thus, the true risk (or the true expected loss on new data), as the amount of data increases, approaches an
 
integral where the innermost term Ep(y|x) ∆ y, f (x; θ) ] reflects the expected loss at a particular value of
x, averaging over y, and the outer integral averages this expected pointwise loss over x weighted by p(x).
In other words, in the limit, a model that wants to minimize overall risk R(θ) should try to minimize the
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 7

pointwise risk at specific parts of the input space x, weighted by how likely those inputs are as determined
by p(x). For finite N , the empirical risk is an unbiased estimator of the true risk, as long as we randomly
sample pairs (xi , yi ) from the underlying joint density. But for small N , the empirical risk will have high
variance—thus, empirically minimizing R(θ) as a function of θ may be minimizing a poor noisy estimate
of the true risk (for small N ) and the resulting model (as represented by the parameters that minimize this
noisy empirical risk) might generalize poorly to new data.

5.2 Example: Squared Error Loss



Consider what happens if we define our loss function to be squared error, i.e., ∆ y, f (x; θ) = MSEx =
(y − f (x; θ))2 . Recall that our implicit assumption is that y is varying even when x is fixed, i.e., there
is a stochastic relationship between x and y. But our deterministic predictor, f (x; θ) can only produce a
deterministic prediction to predict y. So the question is “what value minimizes (y − f (x; θ))2 when y is
stochastic?” Treating fx = f (x; θ) as some unknown constant whose value we wish to optimally select, we
seek to minimize
Z
MSEx = (y − fx )2 p(y|x)dy
y
Z
= (y 2 − 2fx y + fx2 )p(y|x)dy
y
Z Z Z
2 2
= y p(y|x)dy − 2fx y p(y|x)dy + fx p(y|x)dy (4)
y y y
R R
Noting that y p(y|x)dy = 1, and denoting y yp(y|x)dy as E[y|x], taking derivatives wrt fx and setting
to zero we arrive at the well-known result that the minimizing value is fx = E[y|x], i.e., to minimize the
squared error when predicting a random variable (which here is y given x), predict the mean of the random
variable.
Another way to look at this is to rewrite the mean-squared error at x as:
Z
MSEx = (y − fx )2 p(y|x)dy
y
Z
= (y − E[y|x] + E[y|x] − fx )2 p(y|x)dy
y
Z Z
= (y − E[y|x])2 p(y|x)dy + (E[y|x] − fx )2 p(y|x)dy + cross terms
y y
Z Z
= (y − E[y|x])2 p(y|x)dy + (E[y|x] − fx )2 p(y|x)dy (5)
y y
where the cross-terms cancel out (to verify this, note that both fx and E[y|x] can be treated as constants
2 = 2
R
that can be taken outside the integral). Thus, defining σy|x y (y − E[y|x]) p(y|x)dy as the conditional
variance of y at x, we have
Z Z
(y − fx ) p(y|x)dy = σy|x + (E[y|x] − fx )2 p(y|x)dy
2 2
y y
2
= σy|x + (E[y|x] − fx )2 (6)
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 8

since, again, E[y|x] and fx can be taken outside the integral over y. From the last line of the equation above
we see that the mean-squared error at x is composed of two terms:

2 : this is the intrinsic variance of y at x. We have no direct control over this, it effectively represents
1. σy|x
the variation in y that cannot be explained by x, and is a lower bound on the optimal achievable mean
square error at x.

2. (E[y|x] − fx )2 : this term can in principle be minimized to zero by setting our prediction fx , at x, to
the expected value of y at x, i.e., set fx = E[y|x].

Of course in practice we don’t know what E[y|x] is; instead we use samples from p(x, y) to minimize the
empirical risk (or empirical mean-squared error in this case).

Furthermore, we need to minimize the mean-squared error not just at some particular x, but we need to
minimize the total expected squared error with respect to p(x), i.e., minimize
Z 
2
MSEx = Ep(x) (y − fx ) p(y|x)dy
y
Z Z
= (y − fx )2 p(y|x)dy p(x)dx
x y
Z Z 
2 2
= (y − E[y|x]) p(y|x)dy + (E[y|x] − fx ) p(x)dx
x y
= σy2x + Ep(x) [(E[y|x] − fx )2 ] (7)

where each term is now averaged with respect to p(x). The first term is defined as σy2x = Ep(x) [σy|x2 ], which

is the expected irreducible variance in y relative to E[y|x]. The second term is the average error between
the optimal prediction E[y|x] at each x and the actual prediction fx . As with the pointwise case, we can
in theory minimize this by selecting fx = E[y|x] for all values of x, which will set the second term to
0. However, in practice our function fx might not be representationally powerful enough to approximate
E[y|x] at all values of x. For example, imagine if the true (unknown) E[y|x] is a 2nd order (or higher)
polynomial in x and we have selected a linear function of x as our prediction model fx . This type of error
is known as approximation error or bias. What will tend to happen in this situation, is that to minimize
Ep(x) [E[y|x] − fx )2 ] the optimization will tend to focus on trying to match our predictor fx to the unknown
mean E[y|x] as best it can, emphasizing regions of high input density in the x space.

To try to avoid approximation error, we could make our function fx be much richer representationally
(i.e., a more complex function) to try to match the potentially complex (but unknown) E[y|x] function we are
trying to learn. For example, we could use high-order polynomial functions of x (or polynomial functions of
the components of x if it is a vector) or we could use a flexible model such as neural network. The potential
problem we will run into by doing this is that, for a finite amount of data, as make our predictor fx more
complex, we will in general increase the estimation error or variance. For example, lets say that fx is
complex enough to model E[y|x] exactly at all values of x. Even if this is the case, we still have to estimate
E[y|x] as a function of x from noisy observations sampled from p(x, y), and in general this process will
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 9

result in non-zero estimation error—this estimation error can be quite large if we have relatively complex
functions and relatively small amounts data to fit them with.

In the next section we will see that there is a fundamental tradeoff between approximation error (bias)
and estimation error (variance).

6 The Bias-Variance Tradeoff in Regression

Let D be a data set of size N , i.e., D = {(xi , yi )}, i = 1, . . . , N where each xi , yi pair is an IID sample
from some underlying unknown distribution p(x, y). Let p(D) be the distribution over all possible data sets
of size N . For example if x and y were both discrete then p(D) would be a distribution over a finite set of
possible data sets—otherwise P (D) will be a density function.

We will now analyze the average performance of fitting a model with respect to p(D). This analysis
is inherently theoretical in nature since in practice we are usually conditioning on a particular data set
D when fitting a model—but we gain insight by analyzing what would happen on average if (say) many
different individuals fit models using different data sets D with samples of size N drawn IID from the same
underlying process p(x, y).

Note that f (x; θ̂) is a random quantity w.r.t. p(D), since θ̂ is an estimate of θ computed from D (e.g.,
θ̂ will be the result of some optimization procedure given D), Since D is random then θ̂, which depends on
D, is random w.r.t p(D). For different data sets D, we get different θ̂’s, which give different f ’s.
2 + (E[y ] − f (x; θ̂))2 . Since the first part, σ 2 doesn’t depend on the data D,
Recall that MSEx = σy|x x y|x
we will analyze the expectation of the second part with respect to D: Ep(D) [(E[yx ] − f (x; θ̂))2 ]. This is the
component of our error at x that we can affect (by our choice of f and θ, here averaged over all possible
data sets D of size N .

It will be convenient to define f¯x = Ep(D) [f (x; θ̂)]. This can be thought of as the prediction we would
make at x if we averaged over many different data sets D in estimating f (x; θ̂). We can also define the
pointwise bias as Biasx = E[yx ] − f¯x which is the bias between the true E[y|x] (our optimal prediction for
squared error) and the average of our predictions.

Given these definitions we have:

Ep(D) [(E[yx ] − f (x; θ̂))2 ] = Ep(D) [(E[yx ] − f¯x + f¯x − f (x; θ̂))2 ]
= Ep(D) [(E[yx ] − f¯x )2 ] + Ep(D) [(f¯x − f (x; θ̂)2 )]
= (E[yx ] − f¯x )2 + Ep(D) [(f¯x − f (x; θ̂)2 )] (8)

where in the 2nd last line the cross-terms cancel out again and in the last line we can drop the Ep(D) (in the
first term) since the terms E[yx ] and f¯x are constants with respect to the expectation over p(D).

Consider both terms on the right-hand side above:


Predictive Modeling: Regression, CS 274A, Probabilistic Learning 10

• Bias: Bias2x = (E[yx ] − f¯x )2 . Note that bias is inherently a theoretical concept: in practice we won’t
be able to measure bias (except in simulated data) since we don’t in practice know what the true E[yx ]
is.

• Variance: Variancex = Ep(D) [(f¯x − f (x; θ̂)2 )]. This term represents the variability in fitting fx (i.e.,
the estimation error) due the variability in different data sets of size N . This could in principle be
estimated in practice (e.g., via bootstrap methods).

We can average over p(x) (in addition to averaging over P (D)) to define the total bias and variance as
Bias2 = x Bias2x p(x)dx and Variance = x Variancex p(x)dx. And finally, we can add back in the inherent
R R

variance in y (from earlier), to get the average MSE averaged over data sets D of size N for some functional
form f as a predictive model:
MSE = σy2x + Bias2 + Variance
This is the fundamental bias-variance tradeoff for fitting predictive models. This is referred to as a tradeoff
since bias and variance tend to act in opposite directions in practice:

• Simple models with few parameters can have high bias, but will tend to have low variance.

• Complex models with many parameters will have lower bias, but can have high variance when fit to
relatively small amounts of data.

The bias-variance result we derived above is based on the squared-error loss. Somewhat analogous results
can be obtained for other types of loss functions.
The classical bias-variance theory tends to suggest that as the complexity of a model increases (e.g., as
the number of hidden units in a neural network increases) that bias decreases monotonically while variance
increases monotonically. However, there has been some interesting recent work in deep learning that shows
that (in certain cases) that variance at first increases with the complexity of a model but can then decrease.
The precise reasons for this decrease are still not clear theoretically but it is conjectured that variance can in
fact decrease once the complexity of a model reaches the over-parametrized regime (e.g., when a model is
complex enough to completely interpolate a data set), possibly due to various regularization effects in train-
ing deep models. This is currently a hot research topic in machine learning; students interested in additional
perspectives can refer to the chapter on Generalization in Hardt and Recht’s text https://mlstory.
org/generalization.html, as well as papers such as “Rethinking bias-variance trade-off for gener-
alization of neural networks”, ICML 2020 http://proceedings.mlr.press/v119/yang20j/
yang20j.pdf and “Understanding double descent requires a fine-grained bias-variance decomposition”,
NeurIPS 2020 https://proceedings.neurips.cc/paper/2020/hash/7d420e2b2939762031eed0447a
html.

7 Conditional Probabilistic Models for Regression

It is natural to think of how we might connect likelihood-based and Bayesian estimation ideas to regression.
We begin by defining the conditional likelihood as L(θ) = P (Dy |Dx , θ) where we have split the data pairs
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 11

in the data D into Dy = {yi } and Dx = {xi }, i = 1, . . . , N . We use a conditional likelihood since we are
interested in modeling p(y|x) rather than p(x). Under an IID assumption, we have
N
Y
L(θ) = P (Dy |Dx , θ) = p(yi |xi , θ) (9)
i=1

To model p(y|x, θ) a natural approach is to model the mean of y as some function of x (see Figure 1) with
some additive noise. For example, assuming Gaussian noise we can specify a conditional Gaussian model
as 1 2
− 2 (yi −f (xi ;θ))
p(yi |xi , θ) = N (f (x; θ), σy2 ) ∝ exp 2σy
(10)
where for convenience we will assume (i) that the noise is Gaussian, and (ii) that σy2 is known and constant
(e.g., does not vary with x). Note that unlike in density estimation, the mean here is not a constant but
instead varies as an unknown function of the inputs x.
Writing out the log-likelihood and ignoring terms that don’t depend on θ, we get
N 
X 2
log L(θ) = − yi − f (xi ; θ) (11)
i=1

We note that this is (within a constant N ) the same definition as the negative empirical risk with mean-
squared error as the loss. Thus, minimizing squared error in regression is equivalent to maximizing the
conditional log-likelihood under an additive IID Gaussian noise assumption. This is not surprising given
the functional form of the Gaussian model.
This connection between empirical risk and log-likelihood opens up a range of modeling possibilities.
In particular this allows us to design empirical risk and loss functions in a principled way, by defining an
appropriate log-likelihoods. For example, we if we wanted our noise σy2 to depend on x, we could build this
into the likelihood, e.g, by specifying σy2 (x) = h(x; γ) where h is some functional form (e.g., linear) with
unknown parameters γ, with a resulting likelihood L(θ, γ). Or we could specify a model with non-Gaussian
(e.g., asymmetric) noise if that were more appropriate for our problem rather than symmetric Gaussian
noise. We also notice for example that the additive form of the standard empirical risk (with a sum over the
N data points) in fact corresponds directly to an IID assumption in defining the likelihood. If we wanted to
generalize to allow for dependence across the xi , yi pairs (e.g., if i were an index for time) then we can in
principle write down a likelihood model that includes dependence for the data points (such as a Markov or
autoregressive model) and take the log to obtain a correspond empirical risk function that we can minimize.

8 Bayesian Regression

The likelihood-based approach above allows us to make another very useful connection between regression
and topics we discussed earlier in the course, i.e., we can formulate Bayesian approaches to regression. As
an example, consider the conditional Gaussian likelihood discussed in the last section but where we now
have a prior p(θ) on the parameters θ1 , . . . , θp . For convenience we will assume that this is an independent
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 12

prior on each individual parameter, i.e., p(θ) = pj=1 p(θj ). A common approach for example is to assume
Q

a relatively non-informative Gaussian prior for with p(θj ) = N (0, σθ2 ) with zero mean. The role of σθ2
essentially governs how strongly we believe that any of the parameters (e.g., weights or coefficients in a
linear model) are different from 0. In a high-dimensional linear problem (where p = d and d is large) we
might believe that many of the inputs in the input vector x are not in fact relevant to predicting y and that
their coefficients are likely to be 0. Of course we don’t know a priori which ones should be zero and which
should be non-zero, so we can let the data (via the likelihood) nudge us away from the prior which has its
mode at 0.

Proceeding with a Bayesian analysis, and using the conditional Gaussian likelihood from the previous
section, we can define the posterior on the unknown parameters θ as
N
Y p
Y
p(θ|Dy , Dx ) ∝ p(Dy |Dx , θ)p(θ) = N (f (x; θ), σy2 ) N (0, σθ2 ) (12)
i=1 j=1

where p(θ|Dx ) = p(θ) since the priors on θ don’t depend on the data Dx . Taking logs to convert to the
log-posterior, we get
N  2 N
1 X 1 X 2
log p(θ|Dy , Dx ) = − 2 yi − −f (xi ; θ) − 2 θ (13)
2σy
i=1
2σθ i=1 j

which, after negating and multiplying through by various constants (relative to θ) we can rewrite as
N  2 N
′ 1 X X
R (θ) = yi − −f (xi ; θ) + λ θj2 (14)
N
i=1 i=1

σ2
with λ = N σy2 . We recognize R′ (θ) as a standard empirical risk function for squared error loss (the first term)
θ
but now with an additional second “L2-regularization” term that penalizes deviations of the θj parameters
away from 0 where the penalty is in the form of squared error. In other words, what we have shown
above is that maximum a posteriori estimation for Bayesian regression, i.e., maximizing log p(θ|Dy , Dx ),
is equivalent to minimizing an empirical risk function with a regularization term where the regularization
corresponds to a log prior on the parameters θ.

This is an interesting and useful connection, allowing us to bridge the more empirically-oriented reg-
ularization world of predictive modeling with a Bayesian approach to regression. In some cases the two
approaches are exactly equivalent (particularly for point estimates, as we will discuss below), just two dif-
ferent formalizations of the same problem. However, the ability to view regularization from a Bayesian
perspective opens up potentially new ideas and approaches. As an example, if we wanted to impose addi-
tional structure in our regularization, we could do this by using joint priors with covariance structure on our
parameters θ, e.g., where whole groups of parameters are either all zero or all non-zero. We could also look
at non-Gaussian priors: for example the well-known Lasso technique for regularization, where the penalty
term is of the form pj=1 |θp | corresponds to a Laplacian prior (two-sided exponential around 0), a prior
P

that tends to decay faster than the Gaussian and that can lead to sparser solutions in the sense that the fitted
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 13

models (after minimization) will tend to have more parameters that are exactly 0 than if we had used a
Gaussian prior.

We also note above that the Bayesian framework tells us what the value of λ should be, namely a ratio
of data variance σy2 divided by N to prior variance σθ2 . In practice λ is often determined empirically (e.g.,
by searching for the best value using a validation set), but for very small data sets where we have a priori
knowledge about both the noise variance σ 2 in y and the prior uncertainty about our parameters θp , the
σy2
ability to select λ = N σθ2
could be useful.

The discussion above focuses just on MAP point estimation of parameters θ. To be fully Bayesian we
would like to marginalize over parameter uncertainty and make predictions for a new y in this manner.
Specifically, our distribution for y given some future x should be
Z
p(y|x, Dx , Dy ) = p(y, θ|x, Dx , Dy )dθ
θ
Z
= p(y|x; θ)p(θ|Dx , Dy )dθ
θ
Z
∝ p(y|x; θ)p(Dy |Dx , θ)p(θ)dθ (15)
θ

From the second line above we see that the Bayesian formula for prediction in regression is to average over
predictions with specific settings, p(y|x; θ), weighted by the posterior on θ. If we don’t want to compute a
full predictive density for y, but just want to generate a point estimate for y, we could use the mean or mode
of the predictive density p(y|x, Dx , Dy ).

For relatively simple cases, such as linear models with Gaussian likelihoods and Gaussian priors, we
can get closed-form solutions for both the posterior p(θ|Dx , Dy ) and the predictive density p(y|x, Dx , Dy )
for regression. But generally (even for relatively simple models like GLMs) neither the posteriors nor
the predictive densities are available in closed form, which necessitates the use of approximate Bayesian
inference methods, typically either stochastic approaches (such as Markov Chain Monte Carlo (MCMC))
or optimization-based approaches that approximate the posterior with simpler forms (such as variational
inference). Developing efficient and robust Bayesian methods for large-scale deep neural networks, that can
contain potentially millions of parameters, is still considered an open research problem, e.g., see Wilson
and Izmailov (2020), “Bayesian deep learning and a probabilistic perspective on generalization,” https:
//arxiv.org/pdf/2002.08791.pdf.

Additional Reading on Regression

See the class Web page for links to the texts below, all are available online.

• Empirical Risk Minimization: Sections 8.1 and 8.2 in Mathematics for Machine Learning (MML)
(2021).
Predictive Modeling: Regression, CS 274A, Probabilistic Learning 14

• Linear Regression: Chapter 9 in MML (2021) and Chapter 11 in Murphy (2022).

• GLMs: Chapter 8 in Efron and Hastie, and Chapter 12 in Murphy (2022)

• (Deep) Neural Networks: Chapter 18 in Efron and Hastie for an informative brief review of neural
network concepts from a statistical perspective, and Chapter 13 in Murphy (2022) for a more detailed
treatment.

• Bias-Variance: a very nice blog post that provides great intuition: http://scott.fortmann-roe.
com/docs/BiasVariance.html.

• Bayesian Regression: See in particular Section 9.3 in MML and Section 11.6 in Murphy for detailed
treatments of Bayesian linear regression

You might also like