1

I want to fit data in the hope of obtaining an estimate for other data points (i.e. a good guess for the last 4 variables as a function of the first).

Example data:

data={{3.38, 1.028877662, 2.009398505, 2.067322478, 4.214191194}, {3.4, 
  1.030082372, 1.995543604, 2.105894366, 4.234656059}, {3.5, 
  1.035994874, 1.992385102, 2.200815333, 4.282937808}, {3.57, 
  1.036731784, 1.986961442, 2.224357922, 4.307824219}, {3.6, 
  1.036978228, 1.985081926, 2.231988058, 4.315914728}, {3.62, 
  1.037229736, 1.983076125, 2.239730469, 4.323988127}, {3.78, 
  1.038461995, 1.969909372, 2.283628754, 4.374960036}, {3.8, 
  1.038741973, 1.96716995, 2.291334094, 4.384253554}}

This data was obtained using a computationally expensive method. I want to obtain good guesses for more data points located in between (interpolation) and slightly outside this region (extrapolation). This guess will then be used in the same computationally expensive method were a better initial guess results in a cheaper computation of the exact answer for these additional points.

Thus, I am seeking for a fit that will hopefully be predictive in the current data range and slightly outside it.

There is no general expectation for the shape of the functions except that the function is expected to be continuous and monotonic (increasing or decreasing) (and probably once differentiable).

(The data is exact so ideally the fit goes exactly through the provided data points. Also I do not care about accuracy for points say further than 0.2 away from current data points. The purpose is not to guess the true underlying function which will no doubt be too complicated and not even well defined for all real values.)


I tried using NetChain which was not producing amazing results (probably due to my ignorance in how to use it). I had expected that as I added layers it would eventually at least overfit and go through the data points but in fact I couldn't produce any fit at all that actually went through the data points. (Of course this approach does not use the monotonicity constraint and something else might be better.)


Here are example fits that I would consider a decent fit (although the first function is not monotonic and would be a slightly better fit if it were):

enter image description here

These were obtained by fitting polynomials of order 7 and adding points at the end to get a better asymptotic behavior. I would be very happy if I could reproduce a similar fit without having to fine tune the polynomial order. (The asymptotic behavior is not actually important I only care about points a distance less than 0.2 away from current data points, so I guess I did that mostly for aesthetic reasons.)


P.S. The data as posted here is obviously only accurate to at most 9 digits but that is just the truncation I used here to keep the post readable. I don't think the error introduced by this rounding will pose any problem to finding a fit but if for some method it does I can provide the high precision data


Also I used the word fit in the title but perhaps interpolation (and slight extrapolation) are more accurate. The resulting function should pass through the data points (or very close to them).


Context:

The first parameter defining a family of non-convex optimization problems. The other parameters are parameters within that non-convex optimization problem over which an objective is optimized. The point in parameter space (i.e. the last 4 variables), give the point that minimizes the last parameter under the constraint that some objective function is positive.

Monotonicity in the last variable (as a function of the first) is mathematically guaranteed. Monotonicity in the other variables is simply observed.

Kvothe
  • 4,419
  • 9
  • 28
  • 1
    I guess I don't understand how you can guess FOUR variables as a function of ONE if you don't have a functional form in mind. Wouldn't you have an infinite number of possibilities? – MarcoB Jul 20 '23 at 19:17
  • I don't see how the four variables as a function of one is important. You can see it as 4 independent fits if you want. Of course I agree that there is an infinite space of possible functions going through these points. I'm hoping to somehow get one that is predictive. This might mathematically be difficult but the human mind is quite good at it. (Just drawing a nice looking curve between these points. I believe that given enough points neural networks are also often quite good at it although perhaps only if there is more data available.) – Kvothe Jul 20 '23 at 19:24
  • Of course effectively I have a functional form in mind in some complicated intuitive sense. It should be reasonably smooth. The derivative should not needlessly change unless there is data indicating it should.

    But the space is definitely not as simple as a low order polynomial or a log or something. Some combination of multiple logs etc could perhaps work but I would like to avoid spending a lot of time on a case by case basis looking what function looks like my data. I'm hoping to find an automated process for this.

    – Kvothe Jul 20 '23 at 19:28
  • I guess the simplest thing I could do is a linear fit based on the closest two points. But I believe the human mind can do slightly better. It can see patterns involving more nearby points. I'm just hoping to reproduce this. – Kvothe Jul 20 '23 at 19:30
  • 1
    Including a plot of the 4 pairs of data would be helpful. – JimB Jul 20 '23 at 20:01
  • 2
    This issue is your unrealistic expectations about what kind of fit is possible just given the data. As such you should seek statistical help rather than Mathematica help. You've just got 8 data points per regression and even if one could assume that the predictor variable is without error, all 4 response variables are highly correlated with each other. In short, a more complex error structure is indicated. But, again, with only 8 data points for each regression, one is severely restricted as to what is possible and reasonable. – JimB Jul 31 '23 at 21:02
  • It might be important to note that I am not hoping and definitely not expecting to understand the nature of the curve for all R. I am trying to make an educated case in the vicinity of current data points say always within a distance of 0.2 away (on the x axis in the units of the given data). Also I should say again since you mention an "error structure" the data is exact (up to say 30 digits). – Kvothe Aug 01 '23 at 13:15
  • @JimB, I added plots of the data including examples of fits that would be considered "good". – Kvothe Aug 01 '23 at 13:16
  • A note to myself since I don't have time to test it now: ResourceFunction["MonotonicInterpolation"] seems like it could be helpful. – Kvothe Aug 01 '23 at 13:18
  • 1
    I'm not trying to be facetious (although I suspect it will still look that way) but because you now use the phrase "perhaps interpolation (and slight extrapolation)" and therefore cannot and should not ascribe any justification to a statistical or automated procedure and need a data manipulation exercise, just draw the curves by hand that meets what you know about the data generation process. Clearly you believe (and I'm not saying you're wrong) that you have subject matter knowledge that knows when the "fit" is adequate. – JimB Aug 01 '23 at 17:01
  • Well drawing by hand, (or playing around with polynomial fits) obviously works but is also infeasible. The point is to find some automated process that replicates this. Basically the intuition I have for this curve, which the same as the one I have for any curve that has not been constructed by hand to have weird behavior, is that it will continue on similar to the previous behavior unless something weird happens (if this does it is obviously bad for our prediction but potentially interesting). – Kvothe Aug 01 '23 at 17:11
  • In practice it is clear that there is a better guess than just a random one. There are strategies that will work for generic situations. Perhaps the best one is just a linear interpolation between the nearest points. – Kvothe Aug 01 '23 at 17:14
  • @Kvothe, there is no way of telling Mathematica (or any other program) about your intuition. You either have to give precise and definite specifications and requirements, or do things by hand and with your intution. – Domen Aug 01 '23 at 17:14
  • @Domen. Right, so I am trying to find a mathematical procedure that will match my intuition (my intuition is by the way not something very specific to this problem but just about the behavior of generic (naturally occurring) curves. I don't believe this is an impossible task and there are already clearly better and worse whether a procedure is good enough is testable by computing additional points. There is probably a formulation in terms of minimizing some Shannon Entropy. In the end the intuition is basically nothing more than Occam's Razor. – Kvothe Aug 01 '23 at 17:18
  • I suspect the problem would be a pretty standard machine learning problem except for the fact that I have way too little points to for example split my data into training sets and points for verification. – Kvothe Aug 01 '23 at 17:20
  • @Domen, it might be useful to think of the More-Thuente line search. This also attempts interpolation/extrapolation on an in principle unknown function. On a "generic" function it behaves better than for example a linear interpolation. (Of course fine-tuned counter examples always exist. In this latter example when the function really is linear.) – Kvothe Aug 01 '23 at 17:28
  • I'd suggest cubic monotonic interpolation resource function, of which a variant may be found here: https://mathematica.stackexchange.com/a/14705 – Michael E2 Aug 01 '23 at 18:13

1 Answers1

4
$Version

(* "13.3.0 for Mac OS X ARM (64-bit) (June 3, 2023)" *)

Clear["Global`*"]

data = {{3.38, 1.028877662, 2.009398505, 2.067322478, 
    4.214191194}, {3.4, 1.030082372, 1.995543604, 2.105894366, 
    4.234656059}, {3.5, 1.035994874, 1.992385102, 2.200815333, 
    4.282937808}, {3.57, 1.036731784, 1.986961442, 2.224357922, 
    4.307824219}, {3.6, 1.036978228, 1.985081926, 2.231988058, 
    4.315914728}, {3.62, 1.037229736, 1.983076125, 2.239730469, 
    4.323988127}, {3.78, 1.038461995, 1.969909372, 2.283628754, 
    4.374960036}, {3.8, 1.038741973, 1.96716995, 2.291334094, 
    4.384253554}};

xdata = data[[All, 1]];

plotData = Transpose[{xdata, #}] & /@ Transpose[data[[All, 2 ;;]]];

funcs = FindFormula[#, x, SpecificityGoal -> 1] & /@ plotData

(* {1.03539, 1.98619, 0.616559 x, 2.93046 + 0.383771 x} *)

Show[
 Plot[Evaluate[Tooltip /@ funcs], {x, 3.2, 4},
  PlotLegends -> Placed[
    LineLegend[StringForm["col. ``", #] & /@ Range[2, 5],
     LegendLayout -> "Row"],
    Below]],
 ListPlot[plotData],
 Frame -> True]

enter image description here

EDIT: If you want a "better" fit increase the SpecificityGoal; however you may be overfitting the data, particularly since there are so few data points.

Manipulate[
 funcs = FindFormula[#, x, SpecificityGoal -> spec] & /@ plotData;
 Show[
  Plot[
   Evaluate[Tooltip /@ funcs], {x, 3.2, 4},
   PlotLegends -> Placed[
     LineLegend[StringForm["col. ``", #] & /@ Range[2, 5],
      LegendLayout -> "Row"], Below]],
  ListPlot[plotData], Frame -> True],
 {{spec, 1, "SpecificityGoal"}, {1, 2, 5, 10}},
 SynchronousUpdating -> False,
 TrackedSymbols :> {spec}]

enter image description here

Bob Hanlon
  • 157,611
  • 7
  • 77
  • 198
  • Thanks for the suggestion. Unfortunately these fits are terrible. They are basically constant or linear and most don't even come close to even just a few points. This can be seen easily if you shift the different components to lie close together (add the line data=(# - {0, 0, 1, 1.2, 3.25}) & /@data for example to see this. – Kvothe Jul 31 '23 at 16:06
  • Thanks this now works much better with the higher SpecificityGoal. I would still hope to get an answer that uses the monotonicity of the data but this gives a quick decent fit. – Kvothe Jul 31 '23 at 17:46
  • 1
    The fits certainly reproduce the data (especially with spec = 5 but that requires the estimation of 4, 2, 5, and 4 coefficients for the regressions, respectively. Despite the OP's claim that the observations are exact, 4 or 5 coefficients for just 8 data points is extreme at best. (Using spec = 10 for one of the regressions ends up with 9 coefficients!) As mentioned in my comment above, the OP's expectations are not realistic. But your answer with spec = 1 is good and the best that one can do. – JimB Jul 31 '23 at 21:10