Automatic differentiation methods in general have been the topic of past questions, but this is specifically about automatic differentiation of neural networks.
The motivation is an old but useful trick for solving ODEs and PDEs. For example, in solving an ODE of the form: $u'(x) = f(x)$, where $f(x)$ is some known function and the goal is to find $u(x)$. Let's assume $f(x) = 2x$ and $u(0) = 1$. Then, we can implement neural network layers for our functions as follows:
Clear[u, f, g]
g = NetChain[ (just some ansatz for the auxiliary function)
{LinearLayer[32], LogisticSigmoid, LinearLayer[32], LogisticSigmoid, LinearLayer[1]},
"Input" -> "Real"]
(create u from the auxiliary function g)
u = With[{u0 = 1}, FunctionLayer[#*g[#] + u0 &]]
(function we want to match (right hand side) )
f = FunctionLayer[{2*#} &, "Input" -> "Real"]
and then fit $u$ by training a neural network function of the form:
(* finite difference approximation of du/dx *)
deriv[netU_] := With[{dx = 0.0001},
NetChain[
{ FunctionLayer[{#, # + dx} &],(*given input x, compute x and x+dx*)
NetMapOperator[netU],(*apply the u function to both inputs*)
FunctionLayer[(#[[2]] - #1[[1]])/dx &]}, (*take the differences and divide by dx*)
"Input" -> "Real"]
]
(* define a network where we minimize the difference between the learned du/dx and known f(x) *)
nn = NetGraph[
{deriv[u], f, MeanSquaredLossLayer[]},
{NetPort["Input"] -> {1, 2} -> 3}]
(* train on randomly generated points *)
trained = NetTrain[nn, RandomReal[{-1, 1}, 64], BatchSize -> 64]
(extract the trained u function)
trainedU = NetExtract[trained, {1, 2, "Net"}]
where deriv computes the derivative of the neural network function implementing $u(x)$ by finite-differences. This is fine in a simple 1D-example, but is inefficient as the dimensionality increases and opens up problems about the appropriate finite difference value to use. A better approach would be to compute the derivative of the network defining $u(x)$ with respect to its inputs, $x$, using so-called "autograd" methods.
MXNet (the underlying engine used by Mathematica) supports calculating the gradient of networks with respect to inputs (and is described in the MXNet Tutorial on mxnet.autograd) So in principle this is doable. (A related question on how to calculate the Hessian of a neural network in Mathematica was posted 2 years ago, but at the time MXNet didn't support Hessians.) What's the best way to do this in Mathematica?
My understanding is that this is not quite the functionality of NetPortGradient (which takes a derivative with respect to particular specified data). That is, while it allows for the following:
example = NetInitialize[u]
example[5, NetPortGradient["Input"]] (* computes the derivative at x=5 *)
it cannot be used to define new networks that can be used elsewhere:
FunctionLayer[example[#, NetPortGradient["Input"]] &] (*throws errors*)
.backwards()calls, one withretain_graph=Trueand one without. – Yaroslav Bulatov Oct 05 '21 at 23:50