Math For Machine Learning: Linear Algebra Probability Statistics Optimization
Math For Machine Learning: Linear Algebra Probability Statistics Optimization
Math For Machine Learning: Linear Algebra Probability Statistics Optimization
Recommended Textbooks:
I. Linear Algebra
In [1]: from IPython.display import Image
Image('Images/SVMT.png', width=600)
Out[1]:
1. Matrices Fundamentals
A matrix is a two-dimensional table. Here is an example of a 3 × 3 matrix
a11 a12 a13
⎛ ⎞
If ∥x∥ = 0 then x = 0 .
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 1/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Standard norms
The most well-known and widely used norm is Euclidean norm:
−−
n
−−−−−
2
∥x∥2 = √ ∑ |x i | ,
i=1
which corresponds to the distance in our real life (the vectors might have complex elements, thus is the modulus here).
Out[2]:
p -norm
Euclidean norm, or 2-norm, is a subclass of an important class of p-norms:
n
1/p
p
∥x∥p = ( ∑ |x i | ) .
i=1
Infinity norm, or Chebyshev norm which is defined as the maximal element: ∥x∥∞ = maxi |x i |
L1 norm (or Manhattan distance) which is defined as the sum of modules of the elements of x : ∥x∥1 = ∑ |x i |
i
Computing Norms
The numpy package has all we need for computing norms (np.linalg.norm function)
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 2/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
n = 100
a = np.ones(n)
print(a)
print(np.linalg.norm(a, 1)) # L1 norm
print(np.linalg.norm(a, 2)) # L2 norm
print(np.linalg.norm(a, np.inf))
b = a + 1e-3 * np.random.randn(n)
print(b)
print()
print('Relative error:',
np.linalg.norm(a - b, np.inf) / np.linalg.norm(b, np.inf))
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1.]
100.0
10.0
1.0
[1.00037435 0.99953124 0.99992333 1.0014888 1.00032329 1.00050046
1.00078602 0.99924418 0.99969855 1.00210667 1.00094794 1.00080644
0.9995451 0.9973272 0.99983952 1.00145786 1.00006065 1.00228847
1.00024384 0.99927576 1.00074415 1.00052594 1.00042917 1.00072768
0.99929429 1.00141766 1.00173564 0.99915889 1.00009986 0.99906971
0.99996623 0.99950146 0.99925539 1.00130413 1.00083764 0.99949217
0.99928815 0.99995427 1.00000749 0.9989179 0.99995726 0.99777181
1.00036878 1.00070422 0.99823925 1.00020147 0.99977514 0.99915426
0.99980286 1.00108473 0.99983239 1.00131403 1.00066266 1.00175787
0.99936652 1.00029688 1.0008256 1.00069293 0.99890104 1.00069938
1.00032954 0.9997104 1.00069634 1.0006385 0.9987517 0.99944188
0.99896297 1.00068079 0.99903867 1.00045222 1.00141346 1.00191724
0.99962728 0.99963954 0.99829915 1.00199422 1.00041936 1.0002962
0.9986163 1.00145062 1.00210623 0.99923621 1.0001467 1.00030778
1.00013833 0.9994478 0.99868239 1.00042074 0.99946207 0.99804034
0.99985559 1.00228056 1.00085015 1.00133834 0.99957209 1.00028839
1.0024329 1.00053665 1.00051172 1.00138696]
Matrix Norms
How to measure distances between matrices?
i=1 j=1
In [4]: n = 100
a = np.random.randn(n, n) # Random n x n matrix
print('Frobenius:', norm_a)
Frobenius: 98.98238763060469
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 3/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
2. Operations on Matrices
The Inner Product is defined as
n
T
< x, y >= x ¯¯
y = ∑ x̄ ,
i yi
i=1
where x̄
¯¯ denotes the complex conjugate of x .
Orthonormal vectors: Vectors that satsify following condition are orthonormal x Ti x j = 0 when i ≠ j and x Ti x i = 1
Out[5]:
x: [1. 4. 0.]
y: [2. 2. 1.]
Dot product of x and y: 10.0
Inner product of x and y: 10.0
Outer product of x and y: [[2. 2. 1.]
[8. 8. 4.]
[0. 0. 0.]]
Cross product of x and y: [ 4. -1. -6.]
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 4/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
[[49 88 77 40 78 76]
[35 80 40 85 13 29]
[ 7 74 69 35 61 8]
[96 80 42 16 87 65]
[24 47 1 57 93 26]
[36 14 49 35 85 42]]
Determinant: 107531492377.00043
In [8]: # Inverse
A = np.random.randint(100,size=(5,5))
print(A)
print()
Ainv = np.linalg.inv(A)
print(Ainv)
[[61 6 0 64 58]
[78 20 81 95 35]
[57 41 32 29 49]
[ 3 92 23 69 63]
[14 25 70 47 70]]
Out[9]:
A = np.array([[1,2,0], [3,5,9]])
print(A)
print()
print(A.T)
[[1 2 0]
[3 5 9]]
[[1 3]
[2 5]
[0 9]]
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 5/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
print(A)
print()
print(B)
# Verify
np.allclose(A, np.dot(A, np.dot(B, A)))
Out[11]: True
Out[12]:
Matrix Multiplication
1. y = Bx
2. z = Ay
s=1
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 6/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
In [14]: n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = matmul(a, b)
%timeit c = np.dot(a, b)
724 ms ± 27.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
77.3 µs ± 6.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3. Special Matrices
In [15]: from IPython.display import Image
Image('Images/DiagSymm.png', width=400)
Out[15]:
Diagonal Matrix
Identity Matrix
In [17]: np.identity(3)
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 7/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
In [19]: A = np.array([[4,2,1],[4,8,3],[1,1,0]])
I = np.identity(3, dtype=int)
np.dot(A,I)
When U H = U
⊤
, the matrix is called Orthogonal.
In [21]: a = 0.7
b = (1-a**2)**0.5
U = np.array([[a,b], [-b,a]])
print(U)
print()
print(U.dot(U.conj().T))
[[ 0.7 0.71414284]
[-0.71414284 0.7 ]]
[[1.00000000e+00 1.59237766e-18]
[1.59237766e-18 1.00000000e+00]]
4. Matrix Rank
The maximum number of linearly independent rows in a matrix A is called the row rank of A , and the maximum number of
linearly independent columns in A is called column rank of A .
n = 50
a1 = np.ones((n, n))
a2 = np.array([[1, 0, -1], [0, 1, 0], [1, 0, 1]])
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 8/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Example:
8 6 4 1 19
⎛ ⎞ ⎛ ⎞
⎜ 1 4 5 1⎟ ⎜ 11 ⎟
⎜ ⎟ x = ⎜ ⎟
⎜ 8 4 1 1⎟ ⎜ 14 ⎟
⎝ ⎠ ⎝ ⎠
1 4 3 6 14
A = np.array([[8,6,4,1],[1,4,5,1],[8,4,1,1],[1,4,3,6]])
b = np.array([19,11,14,14])
LA.solve(A,b)
LA.solve(A,b)
Note that the tiny perturbations in the outcome vector b cause large differences in the solution! When this happens, we say that the matrix
A is ill-conditioned.
Also, defined as
λ1
cond(A) =
λn
In [25]: U, s, Vt = np.linalg.svd(A)
print('Condition number of A: ', max(s)/min(s))
Out[26]:
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 9/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Out[27]:
The SVD decomposition is a factorization of a matrix, with many useful applications in computer vision, signal processing
and deep learning.
print(A)
print(b)
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 10/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
print(x)
print(xPinv)
[[ 0.82978723]
[ 1.39893617]
[-0.5 ]
[ 2.13297872]]
[[ 0.82978723]
[ 1.39893617]
[-0.5 ]
[ 2.13297872]]
n = 30
324 µs ± 3.12 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
187 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Exercise
Write a function in Python to solve a system Ax = b using SVD decomposition.
Your function should take A and b as input and return x .
Your function should include the following:
First, check that A is invertible - return error message if it is not (Hint: product of singular values should be non-zero for
invertibility)
Invert A using SVD and solve (Remember: A −1 = VΣ
−1
U
T
)
return x
A = np.array([[1,1],[1,2]])
b = np.array([3,1])
print(np.linalg.solve(A,b))
print(svdsolver(A,b))
[ 5. -2.]
[ 5. -2.]
6. Eigen-things
What is an eigenvector
An vector x ≠ 0 is called an eigenvector of a square matrix A if there exists a number λ such that
Ax = λx.
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 11/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Out[33]:
w, v = LA.eig(A)
print('Eigen Values:', w)
print('Eigen Vectors:')
print(v)
[[1 0 0]
[0 2 0]
[0 0 3]]
Eigen Values: [1. 2. 3.]
Eigen Vectors:
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
A = np.cov(x)
print(A)
[[0.63301579 0.22564912]
[0.22564912 0.21385278]]
In [3]: w, v = LA.eig(A)
print(w)
print(v)
[0.73139846 0.11547011]
[[ 0.91666204 -0.39966324]
[ 0.39966324 0.91666204]]
In [37]: ?zip
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 12/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Fit slope and intercept so that the linear regression fit minimizes the sum of the residuals (vertical offsets or distances)
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 13/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
plt.scatter(x, y)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
N
1 ⊺ 2
1 2
1 ⊺
L = ∑(yi − x w) = ∥y − Xw∥ = (y − Xw) (y − Xw)
i
2 2 2
i=1
∂L ⊺ ⊺ ⊺
= −y X + w X X = 0
∂w
T −1 T
w = (X X) X y
100
# Closed-form solution
z = np.linalg.inv(np.dot(Xb.T, Xb))
w = np.dot(z, np.dot(Xb.T, y))
b, w1 = w[0], w[1]
slope: 0.84
y-intercept: 915.59
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 14/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Evaluate
Out[44]: 0.21920128791623675
Out[45]: 0.46818937185313886
II. Probability
In [46]: import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy import stats
Bernoulli Trial
Bernoulli trial (or binomial trial): random experiment with 2 possible outcomes
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 15/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Out[47]: 520
Out[48]: 480
for i in range(7):
num = 10**i
coin_flips = rng.randint(0, 2, size=num)
heads_proba = np.mean(coin_flips)
print('Heads chance: %.2f' % (heads_proba*100))
rng = np.random.RandomState(123)
outcomes = np.empty(n_experiments, dtype=np.float)
for i in range(n_experiments):
coin_flips = rng.randint(0, 2, size=n_bernoulli_trials)
head_counts = np.sum(coin_flips)
outcomes[i] = head_counts
plt.hist(outcomes)
plt.xlabel('Number of heads in 100 coin flips')
plt.ylabel('Probability of outcome')
plt.show()
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 16/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
In [51]: p = 0.7
n_experiments = 1000
n_bernoulli_trials = 100
rng = np.random.RandomState(123)
outcomes = np.empty(n_experiments, dtype=np.float)
for i in range(n_experiments):
coin_flips = rng.rand(n_bernoulli_trials)
head_counts = np.sum(coin_flips < p)
outcomes[i] = head_counts
plt.hist(outcomes)
plt.xlabel('Number of heads in 100 coin flips')
plt.ylabel('Probability of outcome')
plt.show()
Binomial Distribution
Bernoulli trial (or binomial trial): random experiment with 2 possible outcomes
a binomial distribution describes a binomial variable B(n, p) of n of Bernoulli trials (which are statistically independent); p is the
probability of success (and q is the probability of failure, 1-p)
Probability of k successes:
n k n−k
P (k) = ( )p q
k
n
where ( k) ("n choose k") is the binomial coefficient
n n!
( ) =
k k!(n − k)!
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 17/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Out[53]: 0.07958923738717877
Out[54]: 0.07958923738717888
here, the area under the curve give the probability (in contrast to probability mass functions where we have probabilities for every
single value)
two parameters: mean (center of the peak) and standard deviation (spread); μ, σ
we can estimate parameters of N (μ, σ 2 ) by sample mean (x̄ ) and sample variance (s2 )
2
1 (x − μ)
2
f (x ∣ μ, σ ) = exp ( − )
−−−− 2
√2πσ 2 2σ
standard normal distribution with zero mean and unit variance, N (0, 1)
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 18/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
mean = 0
stddev = 1
x = np.arange(-5, 5, 0.01)
y = univariate_gaussian_pdf(x, mean, stddev**2)
plt.plot(x, y)
plt.xlabel('data')
plt.ylabel('Probability Density Function (PDF)')
plt.show()
Out[6]:
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 19/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Out[7]:
tr_data = read_dataset('Data/anomaly_detect_data.csv')
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 20/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
def multivariateGaussian(dataset,mu,sigma):
p = multivariate_normal(mean=mu, cov=sigma)
return p.pdf(dataset)
In [63]: plt.figure()
plt.xlabel('Latency (ms)')
plt.ylabel('Throughput (mb/s)')
plt.plot(tr_data[:,0],tr_data[:,1],'bx')
plt.plot(tr_data[outliers,0],tr_data[outliers,1],'ro')
plt.show()
III. Statistics
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 21/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
# read dataset
df = pd.read_csv('Data/iris.csv')
def histo():
# create histogram
bin_edges = np.arange(0, df['sepal_length'].max() + 1, 0.5)
fig = plt.hist(df['sepal_length'], bins=bin_edges)
histo()
plt.show()
Sample Mean:
n
1
x̄ = ∑ xi
n
i=1
In [65]: x = df['sepal_length'].values
sum(i for i in x) / len(x)
Out[65]: 5.843333333333335
Out[66]: 5.843333333333334
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 22/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
In [67]: histo()
plt.axvline(x_mean, color='darkorange')
plt.show()
Sample Variance:
n
1 2
V ar x = ∑(x i − x̄)
n − 1
i=1
Out[68]: 0.6856935123042504
Out[69]: 0.6856935123042507
In [70]: histo()
plt.axvline(x_mean + var, color='darkorange')
plt.axvline(x_mean - var, color='darkorange')
plt.show()
Out[71]: 0.828066127977863
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 23/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Out[72]: 0.828066127977863
Min/Max:
In [73]: print(np.min(x))
print(np.max(x))
4.3
7.9
In [75]: np.median(x)
Out[75]: 5.8
Out[76]: (150, 4)
Sample Covariance
Measures how two variables differ from their mean
Positive Covariance: that the two variables are both above or both below their respective means
variables are positively "correlated" -- they go up or down together
Negative Covariance: valuables from one variable tends to be above the mean and the other below their mean
negative covariance means that if one variable goes up, the other variable goes down
n
1
σ x,y = ∑(x i − x̄ )(yi − ȳ )
n − 1
i=1
Out[77]: 1.2956093959731545
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 24/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
2
⎡ σ1 σ 1,2 σ 1,3 σ 1,4 ⎤
2
⎢ σ 2,1 σ σ 2,3 σ 2,4 ⎥
2
Σ = ⎢ ⎥
⎢ 2 ⎥
⎢ σ 3,1 σ 3,2 σ
3
σ 4,3 ⎥
⎣ 2 ⎦
σ 4,1 σ 4,2 σ 4,3 σ
4
n
1 2
2
σx = ∑(x i − x̄)
n − 1
i=1
In [78]: np.cov(X.T)
1 n
∑ (x i − x̄ )(yi − ȳ )
n−1 i=1
ρ x,y =
−−−−−−−−−−−−−−− −−−−−−−−−−−−−−
1 n 1 n
2 2
√ ∑ (x i − x̄) √ ∑ (yi − ȳ )
n−1 i=1 n−1 i=1
σ x,y
=
σx σy
Measures degree of a linear relationship between variables, assuming the variables follow a normal distribution
ρ = 1: perfect positive correlation
ρ = 0: no correlation
Out[80]: 0.9628654314027963
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 25/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson
correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are
probably reasonable for datasets larger than 500 or so.
(https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.pearsonr.html (https://docs.scipy.org/doc/scipy-
0.19.0/reference/generated/scipy.stats.pearsonr.html))
Scaled Variables
Standardization
X − μ
Z =
σ
[[ 1. -1. 2.]
[ 2. 0. 0.]
[ 0. 1. -1.]]
[[ 0. -1.22474487 1.33630621]
[ 1.22474487 0. -0.26726124]
[-1.22474487 1.22474487 -1.06904497]]
In [84]: X_scaled.mean(axis=0)
In [85]: X_scaled.std(axis=0)
X − min(X)
Z =
max(X) − min(X)
Out[86]: array([[0.5 , 0. , 1. ],
[1. , 0.5 , 0.33333333],
[0. , 1. , 0. ]])
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 26/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
The same instance of the transformer can then be applied to some new test data unseen during the fit call: the same scaling and shifting
operations will be applied to be consistent with the transformation performed on the train data:
IV. Optimization
Gradient Descent
In [89]: from IPython.display import Image, display
display(Image(filename='Images/GradDescent.jpg', width=700))
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 27/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
J (θ) = ∑ Ji (θ)
i=1
The batch gradient descent algorithm, starts with some initial feasible θ (which we can either fix or assign randomly) and then
repeatedly performs the update:
i=1
Note that in order to make a single update, we need to calculate the gradient using the entire dataset. This can be very
inefficient for large datasets.
In code, batch gradient descent looks like this:
for i in range(n_epochs):
params_grad = evaluate_gradient(loss_function, data, params)
params = params - learning_rate * params_grad
For a given number of epochs nepochs , we first evaluate the gradient vector of the loss function using ALL examples in the data
set, and then we update the parameters with a given learning rate.
Batch gradient descent is guaranteed to converge to the global minimum for convex error surfaces and to a local minimum for
non-convex surfaces.
plt.scatter(points[:,0],points[:,1])
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 28/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Let’s suppose we want to model the above set of points with a line.
To do this we’ll use the standard y = mx + b line equation where m is the line’s slope and b is the line’s y-intercept.
To find the best line for our data, we need to find the best set of slope m and y-intercept b values.
N
1 2
E = ∑(yi − (mx i + b))
N
i=1
N
∂E 2
= ∑ −x i (yi − (mx i + b))
∂m N
i=1
N
∂E 2
= ∑ −(yi − (mx i + b))
∂b N
i=1
An alternative approach, the stochastic gradient descent method, is to update θ sequentially with every observation. The updates
then take the form:
θ := θ − α∇θ Ji (θ)
This allows us to start making progress on the minimization problem right away. It is computationally cheaper, but it results in a
larger variance of the loss function in comparison with GD.
for i in range(nb_epochs):
np.random.shuffle(data)
for example in data:
params_grad = evaluate_gradient(loss_function, example, params)
params = params - learning_rate * params_grad
For a given epoch, we first reshuffle the data (to avoid bias from a particular order), and then for a single example, we evaluate
the gradient of the loss function and then update the params with the chosen learning rate.
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 29/30
12/21/22, 10:07 AM 03_MathforMachineLearning_SC
Mini-batch SGD
What if instead of single example from the dataset, we use a batch of data examples with a given size every time we calculate
the gradient:
(i:i+n) (i:i+n)
θ = θ − η∇θ J (θ; x ;y )
Using mini-batches has the advantage that the variance in the loss function is reduced, while the computational burden is
still reasonable, since we do not use the full dataset.
The size of the mini-batches becomes another hyper-parameter of the problem. In standard implementations it ranges from 50 to
256.
for i in range(nb_epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=50):
params_grad = evaluate_gradient(loss_function, batch, params)
params = params - learning_rate * params_grad
The difference with SGD is that for each update we use a batch of few examples (eg. 100) to estimate the gradient.
file:///D:/ML_Workshops/ML_MPSTME/03_MathforMachineLearning_SC.html 30/30