Gradient and stochastic gradient descent

Table of Contents

Iterative methods — overview

In these algorithms, we hope to find a value of \(w\) such that a loss function \(f\) is minimized. The algorithms are of the form \(w_{k+1} = w_k + t_k d_k\) where \(d_k\) is some vector and \(t_k\) is a positive number referred to as the "step size."

Recall that the gradient \(\nabla f(w_k)\) points in the direction of steepest increase of a surface at the point \(w_k\). If we are minimizing a loss function, we would like vector \(d_k\) to be in a decreasing direction, namely, the angle between \(d_k\) and \(\nabla f(w_k)\) should be between 90 and 270 degrees. That holds iff \(\cos(\theta) \leq 0\), i.e.

$$ \frac{\inner{d_k}{\nabla f(w_k)}}{\norm{d_k}\norm{\nabla f(w_k)}} \in [-1, 0) $$

Thus \(d_k\) is a descent direction iff \(\inner{d_k}{\nabla f(w_k)} \in [\norm{d_k}\norm{\nabla f(w_k)}, 0)\).

Note that the direction of steepest decrease would be \(-\nabla f(w_k)\). This choice gives us "the gradient method" (Fawzi, 2023a, Lecture 4).

Using a constant step size may cause us to not converge or converge slowly (see Murphy, 2022, p. 284-285 for examples). The step size can be chosen at each step point \(k\) in an adaptive manner. This will be covered below.

Some ways of evaluating convergence

These are from Royer, 2024. This is not an exhaustive list, as even the results below have different forms than the three listed here.

(1) \(\norm{w_k - w^\ast} \to 0\) as \(k \to \infty\)

Comment: even if we don't know the value of \(w^\ast\) in advance, under certain assumptions we can guarantee this convergence.

(2) \(f(w_k) \to f^\ast\) as \(k \to \infty\)

Same comment as for (1) applies.

(3) \(\norm{\nabla f(w_k)} \to 0\) as \(k \to \infty\)

Note that \(\nabla f(w_k) \to 0\) as \(k \to \infty\) iff \(\norm{\nabla f(w_k)} \to 0\) as \(k \to \infty\), so this is saying that the first order necessary condition is getting closer and closer to being satisfied. Unfortunately since the condition is not a sufficient condition, (3) does not give us a guarantee of convergence on its own.

The gradient method

For a function \(f\), we calculate \(w_{k+1} = w_k - t_k \nabla f(w_k)\).

Assume \(f\) is \(L\)-smooth (see L-smoothness and strong convexity).

We will show results for three types of functions: non-convex functions, convex functions, and strongly convex functions.

Non-convex functions

Theorem: Let \(f\) be \(L\)-smooth, and let \(k\) be an arbitrary step. If \(t_k \in (0, 2/L)\), then \(f(w_{k+1}) < f(w_k)\).

(Royer, 2024, p. 23, Proposition 2.11)

Proof:

This proof is directly from Royer, 2024, p. 23.

Since \(f\) is \(L\)-smooth,

$$ \begin{align} & f(w_k - t_k \nabla f(w_k)) \notag\\ & \leq f(w_k) + \inner{\nabla f(w_k)}{-t_k \nabla f(w_k)} + \frac{L}{2} \norm{-t_k \nabla f(w_k)}^2 \notag\\ & = f(w_k) - t_k \inner{\nabla f(w_k)}{\nabla f(w_k)} + \frac{L}{2} t_k^2 \norm{\nabla f(w_k)}^2 \notag\\ & = f(w_k) + \left(- t_k + \frac{L}{2} t_k^2\right) \norm{\nabla f(w_k)}^2 \notag\\ \end{align} $$

If \(-t_k + (L/2) t_k^2 < 0 \quad (\iff t_k < 2/L)\), then \(f(w_k - t_k \nabla f(w_k)) < f(w_k)\).

\(\square\)

We now show another Lemma that relates to method (3) for measuring convergence.

Theorem: Let \(f\) be \(L\)-smooth, and let \(k\) be an arbitrary step. If \(t_k = 1/L\), then

$$ \min_{0 \leq k \leq K - 1} \norm{\nabla f(w_k)} \leq \mathcal{O}\left(\frac{1}{\sqrt{K}}\right) $$

(Royer, 2024, p. 23)

Proof:

This is from Royer, 2024, with my additions.

Setting \(t_k = 1/L\) and using the inequality from above gives

$$ \begin{align} f(w_{k+1}) & = f(w_k) - \frac{1}{2L} \norm{\nabla f(w_k)}^2 \notag\\ & \leq f(w_k) - \frac{1}{2L} \left(\min_{0 \leq k \leq K-1}\norm{\nabla f(w_k)}^2\right) \notag \end{align} $$

and therefore

$$ \sum_{k=0}^{K-1} f(w_{k+1}) \leq \sum_{k=0}^{K-1} f(w_k) - \frac{K}{2L} \left(\min_{0 \leq k \leq K-1}\norm{\nabla f(w_k)}^2\right) $$

Subtracting duplicate terms from both sides gives

$$ f(w_K) \leq f(w_0) - \frac{K}{2L} \left(\min_{0 \leq k \leq K-1}\norm{\nabla f(w_k)}^2\right) $$

Using the \(f_{\text{low}}\) assumption, we have

$$ \begin{align} \quad & f_{\text{low}} \leq f(w_0) - \frac{K}{2L} \left(\min_{0 \leq k \leq K-1}\norm{\nabla f(w_k)}^2\right) \notag\\ \therefore \quad & \min_{0 \leq k \leq K-1}\norm{\nabla f(w_k)}^2 \leq 2L \left(\frac{f(w_0) - f(w_K)}{K}\right)\notag\\ \end{align} $$

[Claim: we show that \(\min\{a_1^2, \ldots, a_K^2\} = (\min\{a_1, \ldots, a_K\})^2\).

Define \(f: [K] \to \{a_1, \ldots, a_K\}\) by \(f(k) = a_k\) for all \(k \in [K]\). Recall that for \(a \geq 0, b \geq 0\), \(a < b \iff a^2 < b^2\) (Bartle & Sherbert, 2011). Thus for \(f \geq 0\), \(f(i) < f(j) \iff f(i)^2 < f(j)^2\).

This implies that \(\argmin_k \{f(k)\} = \argmin_k \{f(k)^2\}\).

We now show that \(\min_k f(k)^2 = (\min_k f(k))^2\):

$$ \begin{align} \left(\min_k f(k)\right)^2 & = \left[f\left(\argmin_k f(k)\right)\right]^2 \notag\\ & = \left[f\left(\argmin_k f(k)^2\right)\right]^2 \notag\\ & = \min_k f(k)^2 \notag \end{align} $$

so the claim is proved.]

By the claim, we have

$$ \min_{0 \leq k \leq K-1}\norm{\nabla f(w_k)} \leq \sqrt{2L \left(\frac{f(w_0) - f(w_K)}{K}\right)} $$

\(\square\)

Convex functions

Theorem: If \(f\) is convex and \(L\)-smooth w.r.t. the Euclidean norm, with a constant step-size \(t_k = t \in (0, 1/L]\), then

$$ f(w_k) - f^\ast \leq \frac{1}{2tk} \norm{w_0 - w^\ast}^2 $$

or achieving accuracy \(\varepsilon\) takes

$$ \frac{1}{2tk} \norm{w_0 - w^\ast}^2 \cdot \frac{1}{\varepsilon} $$

steps.

We can also say convergence is on the order \(O(1/k)\).

(Fawzi, 2023a)

Proof: Fawzi, 2023a.

Strongly convex functions

Theorem: If \(f\) is \(m\)-strongly convex and has \(L\)-Lipschitz continuous gradient w.r.t the Euclidean norm and \(t = 2 / (m + L)\), then

$$ f(w_k) - f^\ast \leq \frac{L}{2} \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2k} \norm{w_0 - w^\ast}^2 $$

where \(\kappa = L / m \geq 1\). Achieving accuracy \(\varepsilon\) more than a constant times \(\log(1/\varepsilon)\) steps. We can also say convergence is of the order \(O(\gamma^k)\) where \(\gamma \in (0, 1)\).

(Fawzi, 2023a)

Proof: Fawzi, 2023a.

Note on convergence speed

We observe that for \(\gamma \in (0, 1)\), \(\gamma^k\) converges faster than \(1/k\). In fact, \(\gamma^k = o(k^{-a})\) for any \(a \in \mb{Z}_{+}\). Using the fact that both \(\gamma^k\) and \(k^{-a}\) approach 0 in the limit, we use l'Hopital's rule:

$$ \begin{align*} \lim_{k \to \infty} \frac{\gamma^k}{k^{-a}} & = \lim_{k \to \infty} \frac{k^a}{\gamma^{-k}} = \lim_{k \to \infty} \frac{k^a}{\exp(-k \log(\gamma))} \\ & = \lim_{k \to \infty} \frac{a k^{a-1}}{-\log(\gamma) \cdot \exp(-k \log(\gamma))} \\ & = \cdots \\ & = \lim_{k \to \infty} \frac{\prod_{i=0}^{a-1} (a-i)}{[-\log(\gamma)]^a \gamma^{-k}} \\ & = \frac{\prod_{i=0}^{a-1} (a-i)}{[-\log(\gamma)]^a} \lim_{k \to \infty} \gamma^{k} = 0 \end{align*} $$

Step size

Generally, we don't know the "best" choice of step size. This is one method for finding the largest step size that still satisfies the condition for decrease in the Theorem above:

Backtracking line search (ref. Fawzi, 2023a;, Murphy 2022): starting from a large enough value of \(t_k \leftarrow \widehat{t_k}\) we decrease \(t_k \leftarrow \beta t_k\) (for some \(\beta \in (0, 1)\)) until we have

$$ f(w_k - t_k \nabla f(w_k)) \leq f(w_k) + \left(- t_k + \frac{L}{2} t_k^2\right) \norm{\nabla f(w_k)}^2 $$

is satisfied.

Second-order methods

First-order methods do not use the curvature of the parameter space. One method which does is Newton's method (Fawzi, 2023b; Fawzi, 2023c):

$$ w_{k+1} = w_k - {\nabla^2 f(w_k)}^{-1} \nabla f(w_k) $$

Recall the Taylor formula up to second order:

$$ f(w + h) = f(w) + \inner{\nabla f(w)}{h} + \frac{1}{2} h^T (\nabla^2 f(w)) h + o(\norm{h}^2) $$

If we ignore the \(o(\norm{h}^2)\) term, the right-hand side is an approximation of the value of \(f(w + h)\) based on the value of \(f\) at \(w\). We wish to find an \(h\) that minimizes \(f(w_k + h)\); we use this approximation to do so, and the resulting \(w_k + h\) will be \(w_{k+1}\).

If we minimize the right-hand side with respect to \(h\), we get \(h = -(\nabla^2 f(w))^{-1} \nabla f(w)\):

[Letting \(g(x) = x^T Ax\),

$$ \begin{align*} dg & = g(x + dx) - g(x) \\ & = (x + dx)^T A (x + dx) - x^T Ax \\ & = (dx)^T Ax + x^T A dx + \underbrace{[(dx)^T A dx]}_{\text{higher order}} \\ & = x^T (A^T dx) + x^T A dx \\ & = x^T (A^T + A) dx \end{align*} $$

so \((\partial g(x) / \partial x) = x^T(A + A^T)\) (Edelman & Johnson, 2023).

Now letting \(g(x) = x^T v\) where \(v\) is a vector,

$$ \begin{align*} dg(x) & = g(x + dx) - g(x) \\ & = (x + dx)^T v - x^T v \\ & = (dx)^T v = v^T dx \end{align*} $$

so \((\partial g(x) / \partial x) = v^T\).

Using the above,

$$ \begin{align*} & \frac{d}{dh}(RHS) = (\nabla f(w))^T + h^T \nabla^2 f(w) = 0 \\ \iff & -(\nabla f(w))^T = h^T \nabla^2 f(w) \\ \iff & \nabla^2 f(w) h = -\nabla f(w) \end{align*} $$

so \(h = -(\nabla^2 f(w))^{-1} \nabla f(w)\).]

Thus we set \(w_{k+1} = w_k -(\nabla^2 f(w_k))^{-1} \nabla f(w_k)\).

Theorem: For \(f \in \mc{C}_{L}^{2,2}(\mb{R}^d)\) where \(f\) is \(m\)-strongly convex, Newton's method has quadratic convergence:

$$ \frac{M}{2m^2} \norm{\nabla f(w_{k+1})} \leq \left(\frac{M}{2m^2} \norm{\nabla f(w_k)}\right)^2 $$

(Fawzi, 2023b, Lecture 15) (see same reference for Proof)

Defining \(r_k = (M / 2m^2) \norm{\nabla f(w_k)}\), we have \(r_{k+1} \leq r_k^2\), so \(r_k \leq r_0^{2^k}\). In particular, if \(r_0 < 1\), we will have \(r_k \to 0\) as \(k \to \infty\).

This leads to (Fawzi, Lecture 15):

$$ f(w_k) - f^\ast \leq \frac{1}{2m} \norm{\nabla f(w_k)}^2 $$

thus \(f(w_k) - f^\ast \leq r_0^{2^k}\). We conclude that if \(f\) is \(m\)-strongly convex and if an \(w_0\) can be found such that \(r_0 < 1\), we will have convergence on the order of \(O(\gamma^{2^k})\).

Note that a modified version of Newton's method (called just "Newton's method" by Murphy, 2022) adds a step size: \(w_{k+1} = w_k - t_k {\nabla^2 f(w_k)}^{-1} \nabla f(w_k)\).

Momentum

Polyak's momentum

"Gradient descent can move very slowly along flat regions of the loss landscape" (Murphy, 2022, p. 287).

This can be addressed by including an additional term to the gradient term; this is called Polyak's momentum method. The word "momentum" here refers to the fact that our base value is no longer \(w_k\) but \(w_k\) plus the "momentum" we had from going from \(w_{k-1}\) to \(w_k\).

$$ \begin{array}{ll} y_k = w_k + \beta (w_k - w_{k-1})\\ w_{k+1} = y_k - t_k \nabla f(w_k) \end{array} $$

(Chen, 2023)

(Note that if \(\beta = 0\) we just have the gradient method.)

Nesterov momentum

"One problem with the standard momentum method is that it may not slow down enough at the bottom of a valley, causing oscillation" (Murphy, 2022, p. 288). Nesterov momentum can address this.

$$ \begin{array}{ll} y_k = w_k + \beta (w_k - w_{k-1})\\ w_{k+1} = y_k - t_k \nabla f(y_k) \end{array} $$

Comparison of Polyak momentum and Nesterov momentum (from

(Image and caption from Mitliagkas, 2018, p. 3)

Stochastic gradient descent (SGD)

For the analyses in this section to work, it must be the case that we are minimizing a sum of functions, namely that we minimize \(f(w) = \sum_{i=1}^n f_i(w)\). This fits with minimizing empirical risk (the estimator of theoretical risk; we assume our data is i.i.d. in this framework):

$$ f(w) := \frac{1}{n} \sum_{i=1}^n f_i(w), \quad\text{where}\quad f_i(w) := \ell(h(x_i, w); y_i) $$

for a loss function \(\ell\).

For stochastic gradient descent, we can only state our performance guarantees in terms of expectation, where expectation is taken over the random variable used to sample the subset of gradients chosen.

Under SGD, our update is: \(w_{k+1} = w_k - t_k \nabla_w f_{I_k}(w_k)\), where \(I_k \sim \text{Multinoulli}(n, 1/n, \ldots, 1/n)\) (i.e. a discrete random variable taking values in \(\{1, \ldots, n\}\), with each of these values having probability \(1/n\)). The gradient used at step \(k\) is thus a random variable \(\nabla f_{I_k}(w_k)\) with realized value \(\nabla f_{i_k}(w_k)\). Also, \(w_{k+1}\) and thus \(f(w_{k+1})\) is a function of the random variable \(I_k\).

Defining the trace variance

Let \(X\) be a random vector and let \(f(X) = (f_1(X), \ldots, f_n(X))\). I define the "trace Variance" as \(\text{trVar}_X[f(X)] = \text{E}_X\left[\norm{f(X) - E_X[f(X)]}^2\right]\). Observe that

$$ \begin{align} \trVar_X[f(X)] & = \E_X\left[\norm{f(X) - \E[f(X)]}^2\right] \notag\\ & = \E_X\left[\inner{f(X) - \E[f(X)]}{f(X) - \E[f(X)]}\right] \notag\\ & = \sum_{i=1}^n \E_X[(f(X_i) - \E_X[f(X_i)])^2]\notag\\ & = \sum_{i=1}^n \E_{X_i}[(f(X_i) - \E_{X_i}[f(X_i)])^2]\notag\\ & = \sum_{i=1}^n \Var_{X_i}[f(X_i)] \notag\\ & = \sum_{i=1}^n (\Cov_{X}[f(X)])_{ii} \notag\\ & = \tr(\Cov_X[f(X)]) \notag \end{align} $$

Also note that

$$ \begin{align} \trVar_X[f(X)] & = \E_X\left[\norm{f(X) - \E_X[f(X)]}^2\right] \notag\\ & = \E_X\left[\inner{f(X) - \E_X[f(X)]}{f(X) - \E_X[f(X)]}\right] \notag\\ & = \E_X[\norm{f(X)}^2] + \E_X[\norm{E_X[f(X)]}^2] - 2 E_X[\inner{f(X)}{E_X[f(X)]}] \notag\\ & = \E_X[\norm{f(X)}^2] + \norm{E_X[f(X)]}^2 - 2 E_X[\inner{f(X)}{E_X[f(X)]}] \notag\\ & = \E_X[\norm{f(X)}^2] - \norm{E_X[f(X)]}^2 \notag \end{align} $$

where the last step is by using the fact that the inner product is a sum and applying the expectation operator.

This property is similar to the behavior of the variance operator for scalar-valued functions of random variables.

We will first apply the trace variance to the quantity \(g(I_k) = \nabla f_{I_k}(w_k)\) where \(g\) is our function of the random variable \(I_k\), where \(I_k\) is a Multinoulli as above.

Finding a potential decrease in \(E_{I_k}[f(w_{k+1})]\)

Lemma: Let \(f\) be \(L\)-smooth. Assume there exists \(\sigma^2 > 0\) such that \(\trVar_{I_k}[\nabla f_{I_k}(w_k)] \leq \sigma^2\). Then

$$ E_{I_k}[f(w_{k+1})] \leq f(w_k) - \left(t_k - \frac{1}{2} L t_k^2\right) \norm{\nabla f(w_k)}^2 + \frac{1}{2} L t_k^2 \sigma^2 $$

(Royer, 2023, p. 14, Proposition 2.3.2)

Proof:

We will roughly follow the argument of Royer, 2023, p. 13 - 14, but with clarifying statements and a new definition added, which results in Assumption 2.3.2 (2) not being needed in our proof.

Assume \(f\) is \(L\)-smooth. We attempt to bound \(f(w_{k+1})\) by a Taylor expansion at \(w_k\) (the gradients are all taken with respect to \(w\)):

$$ \begin{align} f(w_{k+1}) & = f(w_k - t_k \nabla f_{I_k}(w_k)) \notag\\ & = f(w_k) + \inner{\nabla f(w_k)}{-t_k \nabla f_{I_k} f(w_k)} \notag\\ & \qquad\qquad \frac{1}{2} (-t_k \nabla f_{I_k}(w_k))^T \nabla^2 f(w_k) (-t_k \nabla f_{I_k}(w_k)) \notag\\ & \leq f(w_k) - t_k \inner{\nabla f(w_k)}{\nabla f_{I_k} f(w_k)} + \frac{1}{2} t_k^2 L \norm{\nabla f_{I_k}(w_k)}^2 \notag \end{align} $$

where the last step is taken by the Lemma from L-smoothness and strong convexity that says \(\inner{\nabla^2 f(x) u}{v} \leq L \norm{u}\norm{v}\).

Take the expectation with respect to \(I_k\):

$$ \begin{align*} E_{I_k}[f(w_{k+1})] & \leq f(w_k) - t_k \inner{\nabla f(w_k)}{E_{I_k}\left[\nabla f_{I_k}(w_k)\right]} \\ & \quad + \frac{1}{2} L t_k^2 E_{I_k}\left[\norm{\nabla f_{I_k}(w_k)}^2\right] \end{align*} $$

Observe that

$$ \E_{I_k}\left[\nabla f_{I_k}(w_k)\right] = \sum_{j=1}^n p(I_k = j) \nabla f_j(w_k) = \sum_{j=1}^n \frac{1}{n} \nabla f_j(w_k) = \nabla f(w_k) $$

so we have

$$ E_{I_k}[f(w_{k+1})] \leq f(w_k) - t_k \norm{\nabla f(w_k)}^2 + \frac{1}{2} L t_k^2 E_{I_k}\left[\norm{\nabla f_{I_k}(w_k)}^2\right] $$

We now assume there exists a positive \(\sigma^2\) such that \(\trVar_{I_k}[\nabla f_{I_k}(w_k)] \leq \sigma^2\). Note that since \(\E_{I_k}\left[\nabla f_{I_k}(w_k)\right] = \nabla f(w_k)\)

$$ \begin{align} \trVar_{I_k}[\nabla f_{I_k}(w_k)] & = \E_{I_k}[\norm{\nabla f_{I_k}(w_k)}^2] - \norm{E_{I_k}[\nabla f_{I_k}(w_k)]}^2 \notag\\ & = \E_{I_k}[\norm{\nabla f_{I_k}(w_k)}^2] - \norm{\nabla f(w_k)}^2 \notag\\ & \leq \sigma^2\notag \end{align} $$

which we could write as

$$ E_{I_k}\left[\norm{\nabla f_{I_k}(w_k)}^2\right] \leq \sigma^2 + \norm{\nabla f(w_k)}^2 $$

Using this with the above gives

$$ \begin{align*} E_{I_k}[f(w_{k+1})] & \leq f(w_k) - t_k \norm{\nabla f(w_k)}^2 + \frac{1}{2} L t_k^2 \left(\sigma^2 + \norm{\nabla f(w_k)}^2\right) \\ & \leq f(w_k) - \left(t_k - \frac{1}{2} L t_k^2\right) \norm{\nabla f(w_k)}^2 + \frac{1}{2} L t_k^2 \sigma^2 \end{align*} $$

\(\square\)

Depending on the quantities on the right hand side, we may have a decrease in the expected value for an arbitrary step.

Bounding the expectation of \(f(w_k) - f^\ast\) for an \(m\)-strongly convex function

If \(f\) is \(m\)-strongly convex, we will be able to guarantee a type of convergence in expectation. (FIX: Assuming this will guarantee (by Royer, 2023, p. 14, Lemma 2.3.1))

Theorem: Let \(f\) be \(L\)-smooth and \(m\)-strongly convex with \(m \leq L\). Let \(t_k = t \in (0, 1/L]\). Assume there exists \(\sigma^2 > 0\) such that \(\trVar_{I_k}[\nabla f_{I_k}(w_k)] \leq \sigma^2\). Then we have

$$ E_{1 \cdots (k-1)}\left[f(w_k) - f^\ast \right] \leq (1 - mt)^k \left[(f(w_0) - f^\ast) - \frac{Lt}{2m} \sigma^2\right] + \frac{Lt}{2m} \sigma^2 $$

and thus since \(1 - mt \in (0, 1)\),

$$ \limsup_{k \to \infty} E_{1 \cdots (k-1)}\left[f(w_k) - f^\ast \right] \leq \frac{Lt}{2m} \sigma^2 $$

(note that the decrease in expectation occurs at rate \(O(\gamma^k)\) where \(\gamma \in (0, 1)\)).

(this is from Royer, 2023, Theorem 2.3.1, but he does not include the last statement involving \(\limsup\).)

Proof:

First, since \(f\) is \(L\)-smooth and \(m\)-strongly convex we have

$$ \norm{\nabla f(w_k)}^2 \geq 2m(f(w_k) - f^\ast) $$

(Royer, 2023, p. 14, Lemma 2.3.1)

Using this with our inequality gives

$$ \begin{align*} E_{I_k}[f(w_{k+1})] - f(w_k) & \leq - \left(t_k - \frac{1}{2} L t_k^2\right) 2m (f(w_k) - f^\ast) + \frac{1}{2} L t_k^2 \sigma^2 \\ & = -2m t_k \left(1 - \frac{1}{2} L t_k\right) (f(w_k) - f^\ast) + \frac{1}{2} L t_k^2 \sigma^2 \end{align*} $$

Choosing a fixed step-size \(t_k = t \in (0, 1/L]\) gives

$$ \begin{align*} & t \leq \frac{1}{L} \iff -\frac{Lt}{2} \geq -\frac{1}{2} \iff 1 - \frac{Lt}{2} \geq \frac{1}{2} \\ & \implies -2mt \left(1 - \frac{Lt}{2}\right) \leq -2mt \left(\frac{1}{2}\right) = -mt \end{align*} $$

so we have

$$ \begin{align} E_{I_k}[f(w_{k+1})] - f(w_k) & \leq -m t (f(w_k) - f^\ast) + \frac{1}{2} L t^2 \sigma^2 \notag \end{align} $$

Adding and subtracting \(f^\ast\) to the LHS gives

$$ \begin{align} E_{I_k}[f(w_{k+1})] - f^\ast + f^\ast - f(w_k) & \leq -m t (f(w_k) - f^\ast) + \frac{1}{2} L t^2 \sigma^2 \notag \end{align} $$

so that

$$ \begin{align} E_{I_k}[f(w_{k+1})] - f^\ast & \leq (1 - mt) (f(w_k) - f^\ast) + \frac{1}{2} L t^2 \sigma^2 \notag \end{align} $$

We note that \(1 - mt \in (0, 1)\) since \(m \leq L \implies t \leq 1/m\) and \(0 \leq t \leq 1/m\) implies \(0 \leq mt \leq 1\), which implies \(1 - mt \in (0, 1)\).

We subtract a term from both sides, giving

$$ \begin{align} E_{I_k}\left[f(w_{k+1}) - f^\ast - \frac{Lt}{2m}\sigma^2\right] & \leq (1 - mt) (f(w_k) - f^\ast) + \frac{1}{2} L t^2 \sigma^2 - \frac{Lt}{2m} \sigma^2 \notag\\ & = (1 - mt) \left[(f(w_k) - f^\ast) - \frac{Lt}{2m} \sigma^2\right] \notag\\ \end{align} $$

We take the expectation of both sides with respect to \(I_{k-1}\):

$$ \begin{align*} & E_{I_{k-1}}\left[E_{I_k}\left[[f(w_{k+1})] - f^\ast - \frac{Lt}{2m} \sigma^2\right]\right] \\ & \leq (1 - mt) E_{I_{k-1}}\left[(f(w_k) - f^\ast) - \frac{Lt}{2m} \sigma^2\right] \\ & = (1 - mt)^2 \left[(f(w_{k-1}) - f^\ast) - \frac{Lt}{2m} \sigma^2\right] \end{align*} $$

Continuing until \(\text{E}_1\), and writing \(E_{1 \cdots k}[\bullet] := E_{I_1}[E_{I_2}[\cdots E_{I_k}[\bullet]\cdots ]\), we have

$$ E_{1 \cdots k}\left[f(w_{k+1}) - f^\ast - \frac{Lt}{2m} \sigma^2\right] \leq (1 - mt)^k \left[(f(w_1) - f^\ast) - \frac{Lt}{2m} \sigma^2\right] $$

Shifting indices down by one and adding a term to both sides gives

$$ E_{1 \cdots (k-1)}\left[f(w_k) - f^\ast \right] \leq (1 - mt)^k \left[(f(w_0) - f^\ast) - \frac{Lt}{2m} \sigma^2\right] + \frac{Lt}{2m} \sigma^2 $$

and thus since \(1 - mt \in (0, 1)\),

$$ \limsup_{k \to \infty} E_{1 \cdots (k-1)}\left[f(w_k) - f^\ast \right] \leq \frac{Lt}{2m} \sigma^2 $$

\(\square\)

It can be shown (Royer, 2023, p. 16-17, Theorem 2.3.2) that for all the same assumptions but using a particular diminishing step size as opposed to a constant step size, we have

$$ E_{1 \cdots (k-1)} \left[f(w_{k}) - f^\ast\right] \leq \frac{\nu}{\gamma + k} $$

for some relevant constants \(\nu, \gamma\) not involving \(k\), i.e. we have convergence with rate \(O(1/k)\).

This is a slower convergence rate than when we use a constant step-size, but we actually converge to the optimal value in expectation rather than to a neighborhood of the optimal value (Royer, 2023, p. 18).

Non-convex case

For stochastic gradient descent with constant step size, we have results that can be used in conjunction with the third of our "convergence" criteria, namely the one which uses the squared norm of the gradient at each step. First, we have

$$ \frac{1}{K} \sum_{k=1}^{K} \text{E}_{1 \cdots (K-1)}\left[\norm{\nabla f(w_k)}^2\right] \leq t L \sigma^2 + \frac{2 (f(w_0) - f^\ast)}{tK} $$

(Royer, 2023, p. 18, Theorem 2.3.3)

where the expectation is again taken over all indices. I.e. the average expected squared norm of the gradients is bounded above by a variance term plus a term that is \(O(1/K)\).

For a sequence of decreasing step sizes satisfying particular assumptions,

$$ \frac{1}{\sum_{i=0}^{k-1} t_k} \sum_{i=0}^{k-1} t_k \text{E}_{1 \cdots K}\left[\norm{\nabla f(w_i)}^2\right] \longrightarrow 0 \quad \text{as } k \to \infty $$

(Royer 2023, p. 19, Theorem 2.3.4)

i.e. regardless of \(\sigma^2\), a weighted sum of the expected value of squared norms of gradients converges to 0 as step size goes to infinity.

Mini-batch SGD: a variance reduction technique

Intuition: SGD uses a sample of size one from the full set of gradients \(\{f_i\}\) as an estimate of \(f\). Using the average of a larger sample should reduce the variance of the estimator (in particular, we will show that it minimizes the trace variance of the estimator). In machine learning, at least in this context, a sample is referred to as a "mini-batch." The mini-batch can be drawn with or without replacement; here we assume it is drawn with replacement.

Applying the trace variance to a mini-batch

Let \(S_k = (I_{k,1}, \ldots, I_{k,M})\) where \(M\) is the size of the mini-batch at step \(k\) (the subscript \(k\) is left off of \(M\) for readability), where for each \(m \in [M]\), \(I_{k,m}\) is \(\text{Multinoulli}(n, 1/n, \ldots, 1/n)\) and where for all \(i, j \in [M], i \neq j\) we have that \(I_{k,i}, I_{k,j}\) are pairwise independent.

Let \(V^2\) denote the trace variance of \(f(I_{k,j})\) for arbitrary \(j\). Note that

$$ \begin{align} \trVar_{S_k}\left[\frac{1}{M} \sum_{m=1}^M f(I_{k,m})\right] & = \tr\left(\Cov_{S_k}\left[\frac{1}{M} \sum_{m=1}^M f(I_{k,m})\right]\right) \notag\\ & = \frac{1}{M^2} \tr\left(\Cov_{S_k}\left[\sum_{m=1}^M f(I_{k,m})\right]\right) \notag\\ & = \frac{1}{M^2} \tr\left(\sum_{m=1}^M \Cov_{S_k}\left[f(I_{k,m})\right]\right) \quad (\text{by indep.})\notag\\ & = \frac{1}{M^2} \tr\left(M \cdot \Cov_{I_{k,1}}\left[f(I_{k,1})\right]\right) \notag\\ & = \frac{1}{M} \tr\left(\Cov_{I_{k,1}}\left[f(I_{k,1})\right]\right) \notag\\ & = \frac{1}{M} \trVar_{I_{k,1}}\left[f(I_{k,1})\right] \notag\\ & = \frac{1}{M} V^2 \notag\\ \end{align} $$

Thus the trace variance behaves like the variance for an i.i.d sample.

Let \(m_k = (1/M) \sum_{m=1}^M \nabla f_{I_{k,m}}(w_k)\). Applying the trace variance result to an \(L\)-smooth function gives

$$ \begin{align} & f(w_{k+1}) = f(w_k - t_k m_k) \notag\\ & = f(w_k) + \inner{\nabla f(w_k)}{- t_k m_k} + \frac{1}{2} (-t_k m_k)^T \nabla^2 f(w_k) (-t_k m_k) \notag\\ & \leq f(w_k) - t_k \inner{\nabla f(w_k)}{m_k} + \frac{1}{2} t_k^2 L \norm{m_k}^2 \notag \end{align} $$

Therefore

$$ \begin{align} \E_{S_k}[f(w_{k+1})] & \leq f(w_k) - t_k \inner{\nabla f(w_k)}{\E_{S_k}[m_k]} + \left(\frac{1}{2} t_k^2 L\right) \E_{S_k}[\norm{m_k}^2] \notag\\ \end{align} $$

We assume that for all \(m \in [M]\), \(\trVar_{I_{k,m}}[\nabla f_{I_{k,m}}(w_k)] \leq \sigma^2\). By the above property, we have that ,

$$ \trVar_{S_k}[m_k] = \frac{1}{M} \trVar_{I_{k,1}}[\nabla f_{I_{k,1}}(w_k)] \leq \frac{1}{M} \sigma^2 $$

so \(\E_{S_k}[\norm{m_k}^2] \leq \norm{\E_{s_k}[m_k]}^2 + (1/M) \sigma^2\).

Recall that

$$ \begin{align} \E_{S_k} \left[\frac{1}{M} \sum_{m=1}^M \nabla f_{I_{k,m}}(w_k)\right] & = \frac{1}{M} \sum_{m=1}^M \E_{S_k} \left[\nabla f_{I_{k,m}}(w_k)\right] \notag\\ & = \frac{1}{M} \sum_{m=1}^M \E_{I_{k,m}} \left[\nabla f_{I_{k,m}}(w_k)\right]\notag\\ & = \frac{1}{M} M \nabla f(w_k) \notag\\ & = \nabla f(w_k) \notag \end{align} $$

which is not reliant on the fact that \(f(w_k)\) is a sum. Thus the above inequality is \(\E_{S_k}[\norm{m_k}^2] \leq \norm{\nabla f(w_k)}^2 + (1/M) \sigma^2\).

Thus we have

$$ \begin{align} \E_{S_k}[f(w_{k+1})] & \leq f(w_k) - t_k \inner{\nabla f(w_k)}{\nabla f(w_k)} + \left(\frac{1}{2} t_k^2 L\right) \E_{S_k}[\norm{m_k}^2] \notag\\ & \leq f(w_k) - t_k \norm{\nabla f(w_k)}^2 + \frac{1}{2} t_k^2 L \left[\norm{\nabla f(w_k)}^2 + (1/M) \sigma^2\right] \notag\\ \end{align} $$

Bounding the expectation of \(f(w_k) - f^\ast\) for an \(m\)-strongly convex function

If we assume that \(f\) is \(m\)-strongly convex, with a constant mini-batch size \(n_b\) and fixed step size \(t\), we can follow the line of argument that was followed for stochastic gradient descent to arrive at a similar inequality:

$$ \E_{1 \cdots (k-1)}\left[f(w_k) - f^\ast\right] \leq (1 - mt)^k \left[f(w_0) - f^\ast - \frac{L t}{2 m n_b} \sigma^2\right] + \frac{L t}{2 m n_b} \sigma^2 $$

(Royer, 2023, p. 24)

So the rate of convergence is the same, but with a narrower upper bound on the range of convergence due to the change from \(\sigma^2\) to \(\sigma^2 / n_b\).

The following discussion is paraphrased from Royer, 2023, p. 24. For a moment refer to SGD as SGD1 (where the 1 refers to the fact that a sample of gradients of size one is used at each step) and refer to SGD with mini-batch as SGDMB. For comparison purposes, recall the result for SGD1 for a \(m\)-strongly convex function with constant step-size, but use as our step-size \(t / n_b\):

$$ E_{1 \cdots (k-1)}\left[f(w_k) - f^\ast \right] \leq \left(1 - \frac{mt}{n_b}\right)^k \left[f(w_0) - f^\ast - \frac{Lt}{2m n_b} \sigma^2\right] + \frac{Lt}{2m n_b} \sigma^2 $$

Thus for a particular upper bound on the range of convergence for SGDMB, SGD1 requires more steps to converge in expectation to within that same range. However, the per-step computational cost of SGD1 is less.

References

Bartle, Robert G. and Donald R. Sherbert. (2011). Introduction to real analysis (Fourth ed.). John Wiley & Sons, Inc.

Chen, Yudong. (2023). Lecture 9-10: Accelerated gradient descent [Lecture notes]

Edelman, Alan and Steven G. Johnson. (2023). Matrix Calculus (for Machine Learning and Beyond) [Lecture notes].

Fawzi, Hamza. (2023a). Topics in convex optimization, Lecture 4: Gradient method [Lecture notes].

Fawzi, Hamza. (2023b). Topics in convex optimization, Lecture 15: Newton's method [Lecture notes].

Fawzi, Hamza. (2023c). Topics in convex optimization, Lecture 16: Newton's method (continued) [Lecture notes].

Mitliagkas, Ioannis. (2018). IFT 6085 - Lecture 6: Nesterov’s Accelerated Gradient, Stochastic Gradient Descent [Lecture notes].

Murphy, Kevin P. (2022). Probabilistic Machine Learning: An Introduction. The MIT Press.

Royer, Clément W. (2023). Lecture notes on stochastic gradient methods [Lecture notes].

Royer, Clément W. (2024). Optimization for Machine Learning [Lecture notes].

How to cite this article

Wayman, Eric Alan. (2025). Gradient and stochastic gradient descent. Eric Alan Wayman's technical notes. https://ericwayman.net/notes/gradient-descent-and-sgd/

@misc{wayman2025gradient-descent-and-sgd,
  title={Gradient and stochastic gradient descent},
  author={Wayman, Eric Alan},
  journal={Eric Alan Wayman's technical notes},
  url={https://ericwayman.net/notes/gradient-descent-and-sgd/},
  year={2025}
}

© 2025-2026 Eric Alan Wayman. All rights reserved.