FindMinimum doesn't increase step size when necessary

Question

Behavior remains unchanged through Mathematica version 13.0.0

I've spent much time finding a minimal example demonstrating this problem with FindMinimum. Normally one faces this problem when fitting very large and complicated functions which cannot be posted here. But the following is a simplified case which reproduces this behavior.

This behavior is somehow connected to the scale of the residuals. The essence of the problem seems to be that FindMinimum doesn't increase step size when it should. The outcome is a ridiculously slow minimization process where FindMinimum takes so little steps in the direction of the minimum that it virtually never achieves it. But with a simple workaround it is possible to get FindMinimum coming to the minimum in a few steps. I hope this post will help many people working with FindMinimum and also motivate the developers to improve this function.

The model

Let us define a model function kM[f] which depends on parameters ζ[1], ν0[1], Γ[1], ζ[2], ν0[2], Γ[2], ζ[3]:

ν2f = 29979245800*^-9;
f2ν = 1/ν2f;
oscRe[ν, j] := ((ζ[j] - ζ[j + 1]) (ν0[j]^2 - ν^2) ν0[j]^2)/((ν0[j]^2 - ν^2)^2 + (Γ[j] ν)^2);
oscIm[ν, j] := ((ζ[j] - ζ[j + 1]) Γ[j] ν0[j]^2 ν)/((ν0[j]^2 - ν^2)^2 + (Γ[j] ν)^2);
ϵRe[ζRe_, ζIm_] := ((1 - ζRe) (1 + 2 ζRe) - 2 ζIm^2)/((1 - ζRe)^2 + ζIm^2);
ϵIm[ζRe_, ζIm_] := (3 ζIm)/((1 - ζRe)^2 + ζIm^2);
nIm[εRe_, εIm_] := Sqrt[(Sqrt[εRe^2 + εIm^2] - εRe)/2];
ζReM[f_] = Total[Table[oscRe[f2ν f, j], {j, 2}]];
ζImM[f_] = Total[Table[oscIm[f2ν f, j], {j, 2}]];
εReM[f_] = ϵRe[ζReM[f], ζImM[f]];
εImM[f_] = ϵIm[ζReM[f], ζImM[f]];
kM[f_] = nIm[εReM[f], εImM[f]];

Data, residual vector and initial guess

data1 = Import[
   "https://raw.githubusercontent.com/AlexeyPopkov/FindMinimum/master/data1.m"];
(* Uncomment the next line if you want higher speed: the overall picture won't change! *)
(* data1 = data1[[;; ;; 25]]; *)
ζ[1] = 0.2346231895551235`30 (* fixed parameter *)
residualVect = Table[data1[[i, 2]] - kM[ν2f data1[[i, 1]]], {i, Length[data1]}];
init = {{ν0[1], 665.2694828745317914180715698144229362278430.}, {Γ[1], 263.0114435543297811058213734371058867205430.}, 
        {ζ[2], 0.210704184928628552537414816962272593332062980706619687734330.}, {ν0[2], 750}, {Γ[2], 100}, {ζ[3], 0.210704184928628552537414816962272593332062980706619687734330.}};

Note that all the parameters have precision 30 or higher as well as data1:

Precision[data1]

An auxiliary function

This function passes initial guess and options to FindMinimum, monitors the process of minimization and returns statistics and achieved minimum:

findMinimum[init_, opts__] := 
 Module[{startTime}, steps = 0; PrintTemporary[Dynamic[steps]];
  startTime = AbsoluteTime[];
  min = FindMinimum[, init, opts];
  {DateString[AbsoluteTime[] - startTime, {"Hour", ":", "Minute", ":", "Second"}], steps, 
   Block[{ζ, ν0, Γ}, Set @@@ min[[2]]; Total[residualVect^2]], 
   NumberForm[min[[2]], 5]}]

Note that the variable min is not scoped by Module on purpose.

Testing

^{All the timings here are for Mathematica 8.0.4. With versions from 9 to 10.4.1 these timings are more than 15 times slower on the same machine as compared to version 8.0.4 (the bug is fixed in version 11.0.0). You can uncomment the commented code line in the "Data, residual vector and initial guess" section in order to get higher speed: the overall picture won't change, but in this case you should also pay attention to the achieved minimum in every case - it will be telling!}

The default setup seems to be working well in this simplified example:

findMinimum[init, MaxIterations -> 500, WorkingPrecision -> 20, PrecisionGoal -> 3, 
 StepMonitor :> ++steps, Method -> {"LevenbergMarquardt", "Residual" -> residualVect}]

{"00:02:39", 220, 0.00405321003823167,
{ν0[1]->406.18, Γ[1]->346.16, ζ[2]->0.22879, ν0[2]->666.41, Γ[2]->239.54, ζ[3]->0.20278}}

But what if we multiply residualVect by 10:

findMinimum[init, MaxIterations -> 500, WorkingPrecision -> 20, PrecisionGoal -> 3, 
 StepMonitor :> ++steps, Method -> {"LevenbergMarquardt", "Residual" -> 10 residualVect}]

{"00:00:43", 51, 0.00405321003823167, 
{ν0[1]->406.18, Γ[1]->346.16, ζ[2]->0.22879, ν0[2]->666.41, Γ[2]->239.54, ζ[3]->0.20278}}

We have got the same minimum in 51 steps instead of 220 steps!

And what if we divide residualVect by 2:

findMinimum[init, MaxIterations -> 2000, WorkingPrecision -> 20, PrecisionGoal -> 3, 
 StepMonitor :> ++steps, Method -> {"LevenbergMarquardt", "Residual" -> residualVect/2}]

{"00:12:37", 1005, 0.00405321003823167, 
{ν0[1]->406.18, Γ[1]->346.16, ζ[2]->0.22879, ν0[2]->666.41, Γ[2]->239.54, ζ[3]->0.20278}}

Now we have got the same in 1005 steps! If we will divide residualVect by 20, even 3000 steps won't allow us to get this minimum (I have tried)!

The workaround

If we restrict FindMinimum to 100 iterations and then feed the results (saved in the global variable min) to the next FindMinimum we get the minimum in a reasonable time even if we divide residualVect by 20:

Quiet@findMinimum[init, MaxIterations -> 100, WorkingPrecision -> 20, PrecisionGoal -> 3, 
   StepMonitor :> ++steps, 
   Method -> {"LevenbergMarquardt", "Residual" -> residualVect/20}];
findMinimum[List @@@ min[[2]], MaxIterations -> 100, WorkingPrecision -> 20, 
 PrecisionGoal -> 3, StepMonitor :> ++steps, 
 Method -> {"LevenbergMarquardt", "Residual" -> residualVect/20}]

{"00:00:26", 20, 0.00405321003823471, 
{ν0[1]->406.18, Γ[1]->346.16, ζ[2]->0.22879, ν0[2]->666.41, Γ[2]->239.54, ζ[3]->0.20278}}

Now we have got the minimum in 100 + 20 = 120 steps!

The question

Why this happens? Is correct my hypothesis that FindMinimum for some reason can't increase the step size? Is it a bug? And the most important: is there a way to overcome this problem and always get the result in as little number of steps as possible?

Reported to the support as [CASE:3967239]. – Alexey Popkov Nov 01 '17 at 10:20 — Alexey Popkov, Nov 01 '17 at 10:20

Henrik Schumacher · Answer 1 · 2017-10-24T10:52:46.967

I do not really know what the issue is with the choice of step sizes internally to FindMinimum. What I can say is that the Hessian of the objective (sum of squares of residualVect) is pretty ill-conditioned (at least two eigenvalues at one minimum (one that I compute below) are smaller than 10^-12). But that is independent of the scaling. However, Levenberg-Marquardt tries to regularize (an approximation to) the Hessian by adding an $\varepsilon$ (or several small and positive parameters) to the diagonal to make it better conditioned. It is not so easy to guess a suitable $\varepsilon$ and maybe, the heuristic in use is scale-dependent...

I suggest another approach: Since the number of variables is small, Newton's method with Moore-Penrose pseudoinverse works also pretty well; the only issue is about dealing with the huge amount of data in data1. I do not know why you aim at that high precision; using machine precision (enforced by compile), this can look like this:

Quiet[Block[{xx, x, t},
   xx = Table[x[[i]], {i, 1, Length[init[[All, 1]]]}];
   cf = With[{code = 
       N[t[[2]] - kM[t[[1]] \[Nu]2f] /. Thread[init[[All, 1]] -> xx]]},
     Compile[{{x, _Real, 1}, {t, _Real, 1}},
      code,
      RuntimeAttributes -> {Listable},
      Parallelization -> True
      ]
     ];
   cDf = With[{code = N[D[t[[2]] - kM[t[[1]] \[Nu]2f] /. Thread[init[[All, 1]] -> xx], {xx, 1}]]},
     Compile[{{x, _Real, 1}, {t, _Real, 1}},
      code,
      RuntimeAttributes -> {Listable},
      Parallelization -> True
      ]
     ];
   cDDf = 
    With[{code =  N[D[t[[2]] - kM[t[[1]] \[Nu]2f] /. Thread[init[[All, 1]] -> xx], {xx, 2}]]},
     Compile[{{x, _Real, 1}, {t, _Real, 1}},
      code,
      RuntimeAttributes -> {Listable},
      Parallelization -> True
      ]
     ]
   ]];
F = x \[Function] With[{y = cf[x, data1]}, 1/2 y.y];
DF = x \[Function] With[{y = cf[x, data1], Dy = cDf[x, data1]}, y.Dy];
DDF = x \[Function] With[{y = cf[x, data1], Dy = cDf[x, data1], DDy = cDDf[x, data1]}, Dy\[Transpose].Dy + y.DDy];

Here, F is the objective (the fitting error), while DF and DDF are its derivatives. Newton's method with pseudoinverse is now quickly written down:

x = init[[All, 2]];
tol = 1. 10^-8;
step = 0;
residual = Norm[DF[x]];
time = AbsoluteTiming[While[residual > tol,
     ++step;
     x = x - PseudoInverse[DDF[x]].DF[x];
     residual = Norm[DF[x]];
     ]][[1]];
sol = {"MinValue" -> F[x], "Residual" -> residual, "Steps" -> step, "Timing" -> time, "Solution" -> Thread[init[[All, 1]] -> x]}

(* {"MinValue" -> 0.00533939, "Residual" -> 4.97386*10^-14, 
    "Steps" -> 6, "Timing" -> 0.080098,
    "Solution" -> {\[Nu]0[1] -> 662.896, \[CapitalGamma][1] ->  262.585, 
    \[Zeta][2] -> 0.205823, \[Nu]0[2] -> 759.038, 
    \[CapitalGamma][2] -> 164.926, \[Zeta][3] -> 0.205823}} *)

PS.: I know that this is a toy example but this can still work if the number of variables is in the hundreds or thousand when using SVD. For very large amounts of parameters, the pseudoinverse is probably too slow.

The "MinValue" -> 0.00533939 is much larger than the one I obtained with Levenberg-Marquardt using high precision: 0.0040532/2 = 0.0020266. Actually (in a more complicated setup) without the high precision Levenberg-Marquardt didn't allow to obtain such minimum - that's the reason why I was forced to increase the precision... — Alexey Popkov, Oct 24 '17 at 11:43
The problem (in a more complicated setup) is loss of precision when calculating derivatives. As complexity increases, I was forced to set larger and larger values of WorkingPrecision in order to get the actual minimum. — Alexey Popkov, Oct 24 '17 at 14:42
Note that FindMinimum[residualVect.residualVect, init, Method -> "Newton"] gives the correct minimum for this simplified problem even with MachinePrecision (but takes a lot of time and memory seemingly for obtaining symbolic derivatives). — Alexey Popkov, Oct 24 '17 at 14:43