How should I iteratively refine a MixtureDistribution?

Question

I am trying to find a stopping time for a randomized simulation such that say 95% of trials that will be successful will have already finished. I would like to do this by creating and then refining some representative distribution dist and then calling InverseCDF[dist, 0.95].

Some background

My simulation tests a variety of configurations to the same problem. Each configuration is essentially a set of parameters and for a given instance of the problem, each configuration is either mathematically plausible or not. Because of the complexity of the problem, I test each configuration with random simulation. When the simulation reaches a solution, it will terminate, but a simulation of a configuration that cannot be solved will run forever until it reaches a point where I kill it. Currently this kill point is a constant that I have chosen arbitrarily, but I would like to have this kill point move dynamically so that it does not kill many simulations that would eventually finish, but also does not waste too much time on configurations that have taken much longer than most successful configurations.

I'd like a distribution that represents the expected stopping time of a randomly selected configuration. Unfortunately, the distribution of stopping times for any given configuration is not normal, different configuration will have a different distributions, and the same configuration in different instances of the problem will have a different distribution. This seems to mean that the distribution will need to be constructed on the fly during testing.

Using normal approximations do not come close to workable answers. Using a KernelMixtureDistribution does produce good results, however it does not seem easy to refine based on new data.

My plan

What I intend to do:

Do a few trials on a configuration, then if it is successful, create dist1 = KernelMixtureDistribution[len1]
Create a similar approximation dist2 for the next successful configuration
Merge these two distributions MixtureDistribution[{1,1}, {dist1,dist2}]
For each new successful trial, create an approximation and then mix the new distribution with the old one (weighting appropriately)

So:

Module[{stopT, cfg, failPoint, n = 0, dist, distNew, c = 0.95},
 Reap[Do[
    cfg = configs[[i]];
    stopT = Reap[Do[
        simulation[cfg, failPoint], {10}]][[2, 1]];
    If[Mean[stopT] < failPoint,  (*sim was sucessful*)
     n++;
     distNew = KernelMixtureDistribution[stopT];
     dist = MixtureDistribution[{n - 1, 1}, {dist, distNew}];
     failPoint = InverseCDF[dist, c]];
    Sow[{stopT, cfg}],
    {i, 1, Length[configs]}
    ]][[2, 1]]]

Mathematica does not simplify these distributions, so it turns out to be a nested mess that takes forever to do anything with after a just few successful trials have been mixed in.

I considered creating an InterpolatingFunction of the CDF of the mixture. This worked well enough, but if I use the function to then define a ProbabilityDistribution it can not evaluate a CDF or its inverse. I cannot just use the InterpolatingFunction because I will later need to mix this distribution with the distribution of the next successful configuration.

I think that I need to use these empirical distributions to get good results, so is there

A way to simplify mixture distributions
A way to approximate distributions (the way a SmoothKernelDistribution does) that is still able to be evaluated and mixed agian
A functionality in Mathematica that I am overlooking

I think you will need to give some data driven example to give us a better idea how to help you. — Andy Ross, Jul 24 '15 at 01:52
What Andy says. Additionally, you have the problem that you have a truncated distribution, i.e., you throw away all stopping values larger than your fixed waiting time. This will skew you distribution (check e.g. the second example of the TruncatedDistribution page). As an alternative, why don't you go for a EmpiricalDistribution of all the stopping times of all configurations? InverseCDF is defined for that — Sjoerd C. de Vries, Jul 24 '15 at 11:29
@AndyRoss What kind of information should I add to be more helpful? Actual data points that I generate? Or explicit code that shows what I am trying to do? — aschankler, Jul 24 '15 at 15:58
Part of the trouble I have is understanding what you mean by "configurations". If you could give an illustrative example of what you are attempting that would probably help. Data never hurts and short, clean, code always helps. — Andy Ross, Jul 24 '15 at 16:01

Andy Ross · Accepted Answer · 2015-08-02T00:44:07.613

Though this doesn't completely answer your question it may be helpful in solving your problem. The issue is that you want to extend a KernelMixtureDistribution which isn't a particularly efficient thing to do in the built in framework. To solve this I've put together a sort of "online" KernelMixtureDistribution that lets you extend the data and add new kernel functions and bandwidths as you go.

Online KernelMixtureDistribution:

(* Helper functions to  convert numbers to lists of length 1 and 
   pack values for faster evaluation*)
to1d[x_?NumericQ] := Developer`ToPackedArray[{x}, Real]
to1d[x_List] := Developer`ToPackedArray[x, Real]


(* Construct an online kernel mixture distribution *)
kmd[data_List, bw_?NumericQ, ker_?DistributionParameterQ] := 
  With[{d = to1d[data]}, kmd[{Length[d]}, d, {ker}, {bw}]]


(* Extend the kernel mixture by giving new data *)
kmd[n_, d_, ker_, bw_][new_] := 
 With[{newd = to1d[new]}, 
  kmd[ReplacePart[n, -1 -> n[[-1]] + Length[newd]], Join[d, newd], 
   ker, bw]]


(* Extend with both data and a different bandwidth and kernel *)
kmd[n_, d_, ker_, bw_][new_, nbw_, nker_] := 
 With[{newd = to1d[new]}, 
  kmd[Append[n, Length[newd]], Join[d, newd], Append[ker, nker], 
   Append[bw, nbw]]]


(* PDFs *)
kmd /: PDF[k : kmd[{n_}, d_, {ker_}, {bw_}], x_?NumericQ] := 
  Mean[PDF[ker, (x - d)/bw]]/bw
kmd /: PDF[k : kmd[n_, d_, ker_, bw_], x_?NumericQ] :=
  Mean[Join @@ 
    MapThread[(1/#3 PDF[#1, #2/#3]) &, {ker, 
      Internal`PartitionRagged[x - d, n], bw}]]
kmd /: PDF[k_kmd, x_List] := PDF[k, #] & /@ x


(* CDFs *)
kmd /: CDF[k : kmd[{n_}, d_, {ker_}, {bw_}], x_?NumericQ] := 
  Mean[CDF[ker, (x - d)/bw]]
kmd /: CDF[k : kmd[n_, d_, ker_, bw_], x_?NumericQ] := 
  Mean[Join @@ (MapThread[#1[#2] &, {Thread[CDF[ker]], 
       Internal`PartitionRagged[x - d, n]/bw}])]
kmd /: CDF[k_kmd, x_List] := CDF[k, #] & /@ x

(* Quantiles *)
kmd /: Quantile[k : kmd[n_, d_, ker_, bw_], q_ /; 0 <= q <= 1] := 
  FindArgMin[(CDF[k, \[FormalX]] - q)^2, {\[FormalX], Quantile[d, q]},
     Method -> "PrincipalAxis", Evaluated -> False][[1]]
kmd /: Quantile[k_kmd, q_List] := Quantile[k, #] & /@ q


(* Formatting so the data doesn't try to display and unpack *)
Format[HoldPattern[kmd[n_, d_, ker_, bw_]], StandardForm] := kmd[n]

Examples:

Define a distribution with 6 data points, bandwidth .5 and a Gaussian kernel.

k = kmd[{-4, -3, 0, 1, 3, 6}, .5, NormalDistribution[]]
(* kmd[{6}] *)

Plot[PDF[k, x], {x, -6, 12}]

Add a single data point to the distribution at 4.5.

k = k[4.5];
Plot[PDF[k, x], {x, -6, 12}]

Add multiple points at one time.

k = k[{0, 6}];

Add four more points and switch to a different kernel and bandwidth for those points.

k = k[{9, 9, 9, 9}, 1, CauchyDistribution[0, 1]]
(* kmd[{9,2}] *)

The PDF, CDF, and Quantile can be computed. It is fairly trivial to extend to other functions that work with distributions.

CDF[k, 5]

(* 0.633271 *)

Quantile[k, .95]

(* 9.42704 *)

This doesn't pass arguments by reference. In your case, the number of points doesn't seem like a big issue so you can afford to be a little wasteful making copies of the data. If it is an issue you could probably make clever use of Bag in the Internal context so that multiple copies of the data aren't made.

Also note that you use InverseCDF in your code which is equivalent to Quantile for distributions.

I've been experimenting with this for a bit and it does exactly what I need. It let's me slowly build up a distribution but doesn't add huge overhead at each step or slow down excessively with many data points — aschankler, Aug 06 '15 at 15:30

How should I iteratively refine a MixtureDistribution?

Some background

My plan

1 Answers1