Suppose you have a function
fun[x_] := (Pause[.05*x]; x^2);
of which you know that its evaluation time increases with its argument - in this case obviously linearly. Consider the following piece of code:
ClearAll["Global`*"];
CloseKernels[];
list = Range[1, 12];
listparallel1 = Partition[list, 3];
listparallel2 = {{1, 5, 12}, {2, 6, 11}, {3, 7, 10}, {4, 8, 9}};
f11 := (Table[fun[i], {i, listparallel1[[1]]}]);
f21 := (Table[fun[i], {i, listparallel1[[2]]}]);
f31 := (Table[fun[i], {i, listparallel1[[3]]}]);
f41 := (Table[fun[i], {i, listparallel1[[4]]}]);
f12 := (Table[fun[i], {i, listparallel2[[1]]}]);
f22 := (Table[fun[i], {i, listparallel2[[2]]}]);
f32 := (Table[fun[i], {i, listparallel2[[3]]}]);
f42 := (Table[fun[i], {i, listparallel2[[4]]}]);
Now compare timings:
LaunchKernels[4];
DistributeDefinitions[f11, f21, f31, f41, f12, f22, f32, f42];
res1 = Table[fun[i], {i, list}]; // AbsoluteTiming
(* 3.905648 *)
res2 = ParallelTable[fun[i], {i, list}]; // AbsoluteTiming
(* 1.670125 *)
AbsoluteTiming[
res3tmp = {ParallelSubmit[f11], ParallelSubmit[f21],
ParallelSubmit[f31], ParallelSubmit[f41]};
res3 = Flatten@WaitAll[res3tmp];]
(* 1.674126 *)
AbsoluteTiming[
res4tmp = {ParallelSubmit[f12], ParallelSubmit[f22],
ParallelSubmit[f32], ParallelSubmit[f42]};
res4 = Flatten@WaitAll[res4tmp];]
(* 1.068721 *)
We see, ParallelTable already does a goodish job, but since it does not have the insight into the function as we do, this can be optimized via ParallelSubmit. Since runtime increases with elements in list, it is natural to partition the list as done in listparallel2 and clearly, timings for res4 are much better than for the ParallelTable version res2.
My question is, what is an elegant way to partition list into the form of listparallel2 such that I provide a number of sublists n (here, n=4) and it then fills the first sublist starting with first and last element of list, going on to the second sublist with second and second to last element of list and so on.
Alternatively, how can one determine the optimal distribution of jobs to parallel kernels?
Update
Unfortunately, the first method in the answer by @unlikely leads to kernel crash on my machine for sufficiently large problems (see this question). Since I am using Mathematica 10.0.1 I cannot use the second method in unlikely's answer because RepeatedTiming and EchoFunction are not available. So, even if it is not the most optimal way - I am again interested in a customized Partition like function, that brings list into the form of listparallel2.
RepeatedTimingandEchoFunctionare just for benchmarking. You can remove all these calls and the following//Last. You can also use the sub/optimal partitioning as of my second answer. – unlikely Feb 25 '16 at 16:24