Gradient descent for neural networks with automatic differentiation

Table of Contents

Neural network models

A neural network (Murphy, 2022, Chapter 13) is a function \(f\), a family of parameter spaces \(\{\Theta_i\}_{i=1}^I\), and a data space \(\mc{X}\), where \(f\) is formed by a composition of \(I\) functions, namely \(f = f_I \circ f_{I-1} \circ \cdots \circ f_2 \circ f_1\), where for all \(i \in \{1, \ldots, I - 1\}\), \(f_i: \mc{X}_i \times \Theta_i \to \mc{X}_{i+1}\), and where \(f_I: \mc{X}_I \times \Theta_I \to \mc{U}\). For notational convenience we define

$$ \begin{align} x_2 & = f_1(x_1, \theta_1) \notag\\ x_3 & = f_2(x_2, \theta_2) \notag\\ \vdots & \notag\\ x_{I-1} & = f_{I-2}(x_{I-2}, \theta_{I-2}) \notag\\ x_{I} & = f_{I-1}(x_{I-1}, \theta_{I-1}) \notag\\ \widehat{y} & = f_I(x_I, \theta_I) \end{align} $$

With the above definitions, the values taken by \(f\) are \(f(\theta_I, \theta_{I-1}, \ldots, \theta_1, x_1)\).

Fitting neural network models

Fitting a neural network model can be accomplished by minimizing a loss function \(\mc{L}(f(\theta, x), x)\) with respect to \((\theta_I, \ldots, \theta_1)\), where \(\mc{L}: \mc{U} \times \mc{X} \to \mb{R}_{+} \cup \{0\}\).

One method for doing this is the gradient method (see "The gradient mehod" section of Gradient and stochastic gradient descent, where we iteratively define \(\theta^{(k+1)} := \theta^{(k)} - t_k \nabla \mc{L}(\theta^{(k)})\), and where hopefully \(\theta^{(k+1)} < \theta^{(k)}\) (see the article for details about when we can guarantee this).

We take the derivative of the loss function with respect to the parameters:

$$ \begin{align*} \frac{\partial \mc{L}(\theta, x)}{\partial \theta_I} & = \frac{\partial \mc{L}(\theta, x)}{\partial \widehat{y}} \frac{\partial f_I(x_I, \theta_I)}{\partial \theta_I} \\ \frac{\partial \mc{L}(\theta, x)}{\partial \theta_{I-1}} & = \frac{\partial \mc{L}(\theta, x)}{\partial \widehat{y}} \frac{\partial f_I(x_I, \theta_I)}{\partial x_I} \frac{\partial f_{I-1}(x_{I-1}, \theta_{I-1})}{\partial \theta_{I-1}} \\ \vdots\quad\quad & \quad\quad\vdots \\ \frac{\partial \mc{L}(\theta, x)}{\partial \theta_1} & = \frac{\partial \mc{L}(\theta, x)}{\partial \widehat{y}} \frac{\partial f_I(x_I, \theta_I)}{\partial x_I} \cdots \frac{\partial f_2(x_2, \theta_2)}{\partial x_2} \frac{\partial f_{1}(x_1, \theta_1)}{\partial \theta_1} \end{align*} $$

Some of the derivatives are reused in multiple calculations, namely

$$ \begin{align*} \frac{\partial \mc{L}(\theta, x)}{\partial \widehat{y}}, \quad \left\{\frac{\partial f_i(x_i, \theta_i)}{\partial x_i}\right\}_{i \in [I] \setminus \{1\}}. \end{align*} $$

If we store these as we calculate them, they can be re-used as we move through the above calculations.

Using automatic differentiation to reduce computation time

Our goal is to calculate and evaluate derivatives (often Jacobians) at a particular point, i.e. to calculate and evaluate \(Df(x)\), where \(f: \mb{R}^n \to \mb{R}^m\). We do this in a particular order such that calculations from earlier steps may be reused in alter steps.

Note that using standard basis vectors, for \(f: \mathbb{R}^n \to \mathbb{R}^m\), we can calculate \(Df(x)\) by calculating either \(Df(x) e_i\) for each \(i \in [n]\), or by calculating \(e_i^T Df(x)\) for each \(i \in [m]\). It turns out that doing so using automatic differentiation (or "autodiff") is generally faster than computing each element of \(Df(x)\). Autodiff works by breaking a computation down into component parts and evaluating them numerically, in an efficient manner.

Let \(u\) be a vector. In the framework of automatic differentiation (see e.g. Margossian, 2019), "forward mode" autodiff calculates \(Df(x) u\) and "reverse mode" autodiff calcuates \(u^T Df(x)\).

We denote a calculation performed by autodiff by \([\text{calc}]_{AD}\). According to Margossian (2019, p. 6-7), for forward mode, "using fused-multiply adds, denoted \(\text{OPS}\), as a metric for computational complexity," we have

$$ \text{OPS}([Df(x) u]_{AD}) \leq 2.5 \times \text{OPS}(f(x)) $$

namely, calculating \(Df(x) u\) via autodiff is linear in the complexity of simply evaluating \(f(x)\).

Similarly, for reverse mode,

$$ \text{OPS}([u^T Df(x)]_{AD}) \leq 2.5 \times \text{OPS}(f(x)) $$

We use \(\mathcal{O}\) in place of OPS in what follows.

Generally speaking, \(\mathcal{O}(f(x)) < \mathcal{O}(D f(x))\) (if it weren't, we wouldn't gain from using autodiff).

For a Jacobian that has \(n\) columns, forward-mode autodiff costs \(n \, \mathcal{O}(f(x))\).

For a Jacobian that has \(m\) rows, reverse-mode autodiff costs \(m \, \mathcal{O}(f(x))\).

Generally we will use the mode that gives the lower cost of the two above expressions.

Note that usually in neural network applications, \(m << n\).

Applying autodiff to the above calculations

First, we calculate \(e_j^T (\partial \mc{L} / \partial \widehat{y})\) which as cost \(\mc{O}(\mc{L})\), and denote the result by \(v^T\).

Next, we move on to the main calculations. We will calculate a single row, \(j\), of \(\partial f / \partial x_i\). To do this, we first perform the calculations in the following table. The table is read as follows: for each row, the quantity in the "Result" column is defined as equal to the result of the "Calculation" column. Rows in the table make use of quantities defined in previous rows.

$$ \begin{array}{ccc} \text{Result} & \text{Calculation} & \text{Cost} \\ \hline\\ u_I^T & v^T (\partial f_I(x_I, \theta_I) / \partial x_I) & \mc{O}(f_I) \\\\ u_{I-1}^T & u_I^T (\partial f_{I-1}(x_{I-1}, \theta_{I-1}) / \partial x_{I-1}) & \mc{O}(f_{I-1}) \\\\ \vdots & \vdots & \vdots \\\\ u_1^T & u_2^T (\partial f_2(x_2, \theta_2) / \partial x_1) & \mc{O}(f_2) \\ \end{array} $$

(Note that the bounds in the above table are slightly loose since we are not finding the full Jacobians in the above calculations, as we are only taking the derivatives with respect to a subset of the function arguments.)

We then use these results to calculate the desired derivatives. These are again a series of pre-multiplications by row vectors, so we once again use autodiff.

Let \((\partial \mc{L} / \partial \theta_i)|_{\text{row}(j)}\) denote the \(j\)th row of the Jacobian \((\partial \mc{L} / \partial \theta_i)\).

$$ \begin{array}{ccc} \text{Result} & \text{Calculation} & \text{Cost} \\ \hline \\ (\partial \mc{L}(\theta, x) / \partial \theta_I) |_{\text{row}(j)} & v^T (\partial f_I(x_I, \theta_I) / \partial \theta_I) & \mc{O}(f_I) \\\\ (\partial \mc{L}(\theta, x) / \partial \theta_{I-1}) |_{\text{row}(j)} & u_I^T (\partial f_{I-1}(x_{I-1}, \theta_{I-1}) / \partial \theta_{I-1}) & \mathcal{O}(f_{I-1}) \\\\ \vdots & \vdots & \vdots \\\\ (\partial \mc{L}(\theta, x) / \partial \theta_1) |_{\text{row}(j)} & u_1^T (\partial f_1(x_1, \theta_1) / \partial \theta_1) & \mathcal{O}(f_1) \end{array} $$

(A similar remark as that for the above table regarding the bound on the costs applies here as well.)

Summing up the costs in the two above tables, we have that the cost of calculating row \(j\) of all the Jacobians we need for one step of gradient descent is \(\mc{O}(\mc{L}) + \mc{O}(f_1) + 2 \sum_{i=2}^I \mc{O}(f_i)\). Thus computing all the Jacobians for one step of gradient descent will have cost

$$ \mc{O}(\mc{L}) + m\left(\mc{O}(f_1) + 2 \sum_{i=2}^I \mc{O}(f_i)\right). $$

References

Griewank, Andreas and Andrea Walther. (2008). Evaluating derivatives: Principles and techniques of algorithmic differentiation. Society for Industrial and Applied Mathematics (SIAM). https://doi.org/10.1137/1.9780898717761

Margossian, Charles C. (2019). "A Review of Automatic Differentiation and its Efficient Implementation." arXiv.

Murphy, Kevin P. (2022). Probabilistic Machine Learning: An Introduction. The MIT Press.

How to cite this article

Wayman, Eric Alan. (2025). Gradient descent for neural networks with automatic differentiation. Eric Alan Wayman's technical notes. https://ericwayman.net/notes/neural-nets-autodiff/

@misc{wayman2025neural-nets-autodiff,
  title={Gradient descent for neural networks with automatic differentiation},
  author={Wayman, Eric Alan},
  journal={Eric Alan Wayman's technical notes},
  url={https://ericwayman.net/notes/neural-nets-autodiff/},
  year={2025}
}

© 2025-2026 Eric Alan Wayman. All rights reserved.