Mma 10: Half the parallel power (Macs)?

Question

Here is a comparison of the parallel kernels launched under Mathematica under v9 and v10, on the same identical current 2014 R2-D2 Mac Pro ...

[ Update: Valerio has commented that the same issue arises on the Macbook Air.]

Under v9.01

$ProcessorCount

12

Issuing:

 LaunchKernels[]

... launches 12 kernels, and actually uses them ... notice that the ParallelTable is 12 times the speed of Table[] for this construct:

In[5]:= Table[Pause[1]; f[i], {i, 12}] // AbsoluteTiming

Out[5]= {12.003106, {f1, f2, f[3], f[4], f[5], f[6], f[7], f[8], f[9], f[10], f[11], f[12]}}

In[6]:= ParallelTable[Pause[1]; f[i], {i, 12}] // AbsoluteTiming

Out[6]= {1.010648, {f1, f2, f[3], f[4], f[5], f[6], f[7], f[8], f[9], f[10], f[11], f[12]}}

So, to perform the same operation, the parallel result under v9 is 12 times the speed of the single kernel result.

Under v10 -- half my potential processing power has gone

$ProcessorCount

6

... down from 12 - even though I am running on the identical machine. Now, I know that my Mac Pro actually has 6 processors, and each runs 2 threads ... and under v9, that yielded 12 processor kernels for Mma 9 ... but under v10, it is only yielding 6 kernels ... ON THE SAME MACHINE. And this has real effects ... it effectively reduces by 50% the maximum potential power of my Mac:

 LaunchKernels[]

... launches 6 kernels (not 12 kernels as under v9).

Compare the performance:

 In[3]:= ParallelTable[Pause[1]; f[i], {i, 12}] // AbsoluteTiming

Out[3]= {2.009933, {f1, f2, f[3], f[4], f[5], f[6], f[7], f[8], f[9], f[10], f[11], f[12]}}

So, under the new v10, I am getting half the parallel performance here and half the kernels that I got under v9. Even more perplexing is that this worked fine in an earlier pre-release version of v10.

I am very confused. Anyone have any ideas how I can get my missing kernels back? Or why a decision may have been made to hobble the performance of the Mac Pro under v10?

Addendum

Just noticed that if I go to:

Evaluation Menu -> Parallel Kernel Configuration

... the automatic setting for:

Number of kernels to use: is set to: Automatic (which Mma sets to 6)

If I change this to:

Manual setting

and set it to 12 ... then it seems to use 12.

But I am still confused as to why, if Mathematica 10 can actually support 12 kernels on the machine, ... why would Wolfram set it to use only half of them by default, when v9 supported all of them by default?

Reply to Szabolcs: real-world test

Szabolcs suggests below that Mathematica may not practically use more kernels than physical cores, even if your processor supports virtual cores ... so there is no real difference. In reply, here is a quick timing test of a real-world application (kernel density estimation) from the mathStatica benchmarking test suite. The task is to plot 12 kernel density estimates, corresponding to 12 different bandwidths.

bandwidths = {.2, .35, .45, .55, .65, 1, 1.5, 2, 2.2, 2.5, 3, 3.2};

Here are the results running under:

v9 (default: 12 kernels): 3.38 seconds
v10 (default: 6 kernels): 9.53 seconds
v10 (manual overide to 12 kernels): 7.46 seconds

I don't know what has changed to cause such a performance hit under v10 ... but even so, that is not the point. The point is that the v10 default kernel setting fails to take advantage of the power of the Mac Pro ... and results in worse performance in a typical parallel-processing application.

More extensive real-world test:

Update: 1 August 2014

I have now had the opportunity to run the full mathStatica (primarily symbolic) benchmark suite under both:

the default v10 parallel setting (6 kernels)
the manual override v10 setting (12 kernels)

Here are the results:

The results fall into 2 categories:

For problems that have more than 6 separate components to them: ... For such problems, using 12 kernels is ALWAYS unambiguously faster, and significantly so.
For problems that have 6 or less separate components: ...For instance, Examples 7 and 9 can only be broken down into 2 symbolic components, so the benefits of parallelism max out with 2 kernels. In these cases, the 6 automatic kernels case is sometimes marginally faster than the 12 kernel case (presumably due to running overheads etc) ... but the difference is tiny, and essentially unnoticeable.

In summary: for problems that CAN benefit from more than 6 kernels, the default Mma 10 (automatic) setting of 6 kernels on a Mac Pro appears to be sub-optimal, and fails to take advantage of the full capability of the machine. This problem is new to v10, and does not occur under v9.

+1. This is a very well-researched post and even the example is pretty. That said, I would suggest to entertain the possibility that the cause of the performance hit is not directly due to the number of kernels launched or even the number of recognized cores in v10 but that it could be due to some other reason. It may also be worthwhile to track what MMA is doing in Activity Monitor. — heropup, Jul 11 '14 at 14:53
Something else that is also worth asking is whether we see a similar performance hit between v9 and v10 on Windows. — heropup, Jul 11 '14 at 16:36
@rcollyer mathStatica's NPKDEPlot function uses ParallelTable[Plot[ Funky ] ] to produce each separate curve on a separate kernel, where Funky is a Compiled application of (Map of Total of some Ifs and Buts). — wolfies, Jul 11 '14 at 18:28
Have you compared the speed of Plot[Funky] between v9 and v10 on a single core? That could be a major factor in the difference between the two platforms. — rcollyer, Jul 11 '14 at 18:31
I have the same issue on my MacBook Air: by default Mathematica 10 uses 2 instead of the 4 cores that Mathematica 9 would use. — Valerio, Jul 31 '14 at 14:17
@Valerio This post also provides the solution: if you prefer to use 4 kernels (even though you only have 2 physical cores), then set it up in the preferences. — Szabolcs, Jul 31 '14 at 14:29
@Szabolcs As I wrote below (comment to the proposed answer) even if you set manually the number of kernels to 4 the performance is not as good as it was with Mathematica 9. I think this is very funny and, as suggested, I contacted Wolfram support. — Valerio, Jul 31 '14 at 15:04
@wolfies In its current form, this Q/A is not very useful for people who have the same problem. Would you help me to clean it up, if you agree about the changes I propose? I'd like to: 1. remove my answer 2. make the question concise and focused on why only half the number of kernels are launched 3. write a community wiki answer explaining how to launch the desired number of kernels. 4. you can have a second longer section of your question demonstrating that more kernels help with performance even on CPUs using hyperthreading ... — Szabolcs, Aug 06 '14 at 18:19
... 5. in this question I'd like to focus on how the number of kernels affects performance. There was a separate issue about why v10 is slower with the same number of kernels as v9, but it was not clear that this is also related to parallelization. I think this deserves its own separate QA. Do you agree to these changes? Please keep in mind that this is not a Wolfram site, so complaining here is less likely to trigger a response from Wolfram (writing to support is better for that). But it would be good to have a concise Q/A written up that is actually helpful for other users. — Szabolcs, Aug 06 '14 at 18:20
(not to say that the complaint is not legitimate, just that it is probably not useful for anyone to complain on this site.) — Szabolcs, Aug 06 '14 at 18:23
@Szabolcs I am quite happy with the question as it stands: it is fairly concise and it covers the main issues. If you would like to delete your answer, or modify your answer to improve it, ... that is of course up to you. The issue of v10 performance versus v9 performance is entirely another question -- not this one. Nor is it dealt with here. If you would like to focus on how the number of kernels affects performance, that could be very nicely done within your answer (I look forwards to seeing alternative calculations), or even as a separate self-contained answer. — wolfies, Aug 06 '14 at 18:35
@Szabolcs Mostly, I think your existing answer "this is an intentional and beneficial change in v10" would benefit from some form of substantiation in support of the claim. — wolfies, Aug 06 '14 at 18:37
@wolfies Note that I proposed to remove my answer completely and turn this into a set of posts that's actually helpful for readers other than you. Right now it reads as a complaint. Considering the number of comments I can't just delete the answer without a full revision of both the question and the answer. If you wish to keep the question as is, of course we'll do that but then I won't put in any effort to rewrite the answer either. Also: I am not certain any more that this change was intentional. — Szabolcs, Aug 06 '14 at 18:45
To clarify my comment: I'm short on time so I will put in the effort only if I am convinced that the end result is worth it. I consider making the Q/A helpful for a wider audience worthwhile. — Szabolcs, Aug 06 '14 at 18:52
It's an observation, a workaround, and an empirical check. I would hope it is helpful to others. I would prefer to keep the question as is. But please do not let that discourage you from modifying your answer, because I am sure many would be interested in alternative parallel kernel speed comparisons. — wolfies, Aug 06 '14 at 18:54

score 3 · Answer 1 · answered Sep 16 '19 at 09:35

Let's answer this to get it off the list of questions without answers.

Prelim

By default Mathematica uses a number of parallel kernels equal to the number of physical processor cores present in the machine. You can override this default in the settings under Edit > Preferences... > Parallel > Local Kernels (or Evaluation > Parallel Kernel Configuration ...)

You can also override the default in a single session/notebook by manually lauching the desired number of kernels with LaunchKernels[n] where n is the desired number of kernels.

In principle, there is no real limit to the number of parallel kernels that you can run other than the number of licenses available. (At some point you are bound to run out of memory though.)

Why does Mathematica default to the number of physical processors ?

The question asked by the OP amounts to why Mathematica defaults to using the number of physical processors rather than the number of logical cores. A key part of the answer lies in dimishing returns of using more than the number physical processors. This is because (depending on the hardware implementation) multiple logical cores share some or all resources of a physical core. This means that for straight-up numerical tasks, which can often reach very high efficient uses of a processor, will have limited benefit from using more than the physically available cores.

However, as shown by the benchmark in the OP's post, generally there will be some peformance benefit, and rarely any preformance disadvantage to using all available logical cores. So why not just always use the number of logical cores if that is the cases? Well, there are other considerations accept pure performance.

Licences The first (and IMHO probably the most important) consideration is license availability. For each parallel kernel run, Mathematica needs a separate subkernel license. When using shared licenses at an institution (one of the more common setups), using double the number of licenses for little to no performance gain, may not be the most efficient use of licenses. Since users may not be aware of this, it is not a bad idea to have the default set to a conservative number.
Overhead There is additional overhead to running more parallel kernels. Each parallel kernel is a full functional mathematica kernel that runs completely independently of the other kernels. In particular, there is no shared memory usage between kernels. Consequently, the kernels each duplicate a lot of memory usage. This can quickly run out of control when using many kernels. Since many users will not be aware of this, it is again smart to have the default number be conservative.
In cases where the extra performance is really desired and the user knows what she is doing, it is easy enough to increase to number of kernels used.

The answer begins: "The question asked by the OP amounts to why Mathematica defaults to using the number of physical processors rather than the number of logical cores" That characterisation seems plainly incorrect. The question is (a) why the default Mma setting changed from using the number of logical cores, to the number of physical processors, (b) in circumstances where this has a power hit on performance. By the logic expressed in the above answer, perhaps they should also restrict the number of kernels to be a subset of the number of physical processors (say half of them, or quarter)? — wolfies, Sep 18 '19 at 08:42
@wolfies, why? Because it was a bug. v9 on Windows reported the correct value (the available number of physical cores). So in v10 we fixed it. — ihojnicki, Oct 16 '19 at 11:42
@ihojnicki while this explanation makes some sense, it is only in regards to the reported processor #. Why, then, would the system suddenly cease using whatever optimizations it had access to prior to the bug fix? — CA Trevillian, Oct 16 '19 at 13:10

score 3 · Answer 2 · answered Oct 16 '19 at 13:02

I would like to contest the claim that

For problems that have more than 6 separate components to them: ... For such problems, using 12 kernels is ALWAYS unambiguously faster, and significantly so.

I cannot repeat the benchmarks shown in the original post. Therefore I created a small toy benchmark that everyone can try, and that can in principle benefit from more than 4 cores. My CPU has 4 physical and 8 logical cores. This specific benchmark shows no improvement from using more than 4 subkernels on my computer.

While the OP's example shows that there are problems for which there is at least some benefit from using all logical cores, there are also disadvantages to doing so by default:

the parallelization overhead will be higher, so trying to use all logical cores may end up being slower in the end (depending on the specific problem)
memory use will be higher (as for many problems it scales with the number of subkernels)
using up all resources affects the responsiveness of other programs

Overall, it seems to me that using only as many subkernels as the number of physical cores is the best default choice. Advanced users who know that for a specific problem launching more kernels can have benefits can of course manually launch more kernels.

In[1]:= $Version    
Out[1]= "12.0.0 for Mac OS X x86 (64-bit) (April 7, 2019)"

In[2]:= (* custom implementation of Newton's method *)

findRoot[expr_, x_, x0_, steps_] :=
 Block[{x},
  Module[{iter, r},
   iter = x - expr/D[expr, x];
   x = N[x0];
   Do[
    x = iter,
    {steps}
    ];
   x
   ]
  ]

In[3]:= (* compute a Newton's fractal; this will be the basis of our benchmark *)
matrix = Table[
    Quiet@Arg@findRoot[x^3 - 1, x, a + b I, 10]/(2 Pi),
    {a, -2, 2, 0.01}, {b, -2, 2, 0.01}
    ]; // RepeatedTiming

Out[3]= {6.23, Null}

In[4]:= MatrixPlot[matrix]

In[5]:= (* this will let the CPU heat up and the clock rate be \
throttled before we start the main benchmark *)
CloseKernels[]
ParallelTable[
 With[{m = RandomReal[1, 3 {1000, 1000}]}, 
  Total@Eigenvalues[m + Transpose[m]]],
 {50}
]

In[7]:= (* run the benchmark *)
CloseKernels[];
timings = Table[
  LaunchKernels[1];
  {
   Length@Kernels[],
   First@RepeatedTiming@ParallelTable[
      Quiet@Arg@findRoot[x^3 - 1, x, a + b I, 10]/(2 Pi),
      {a, -2, 2, 0.01}, {b, -2, 2, 0.01}
      ]
   },
  {8}
  ]

Out[8]= {{1, 7.9}, {2, 3.954}, {3, 2.739}, {4, 2.18}, {5, 2.13}, {6, 2.06}, {7, 2.0}, {8, 1.95}}

ListPlot[
 timings,
 FrameLabel -> {"number of cores", "timing"},
 Frame -> True,
 PlotStyle -> PointSize[0.02]
]

I’m excited to try this and compare my 6 core intel laptop to my 16 core AMD desktop!! With physical considerations this is quite useful beyond this example here. I think, though, that the ultimate question OP has is where did the optimization for 12 cores go after v9.01? They state they can get 12 to run, but it is measurably slower than it was in v9.01, by an order of 2! More than double the length of time needed with the same set up from v9.01 to v10. Seemingly this is a result of several changes, but is still glaring and concerning at present. Wonderful answer! +1 — CA Trevillian, Oct 16 '19 at 13:17
@CATrevillian That measurement was for version 10.0, which was a release with many problems. Do we have an example that runs considerably slower in 12.0 than it did in 9.0? I don't have links ready, but I do recall performance problems in 10.0 which were solved soon after. I think the best thing to do is to boil any problem case--if there is still one--down to a minimal example and report it. It would have been useful if @ wolfies did this (perhaps he did but it is not mentioned here and not included in the question). — Szabolcs, Oct 16 '19 at 13:21
Ah yes I agree with this explanation. So we may need @ wolfies to update with v12 runs. Or someone to have access to v9 that would run multiple cores. May be a niche example, but one of a larger set of issues and potential benefit. Many thanks again, this is fantastic I cannot state enough that the physical considerations are an intelligently implemented part of this kind of testing! — CA Trevillian, Oct 16 '19 at 13:26
This is great! One suggestion for the graph: use ListPlot[{timings, Table[{n, timings[[1, 2]]/n}, {n, 1, 8}]}, FrameLabel -> {"number of cores", "timing"}, Frame -> True, PlotStyle -> {PointSize[0.02], PointSize[.01]}, PlotLegends -> {"measured gain", "theoretical gain"}] This shows the gain from the starting point. — Mark R, Jan 11 '20 at 00:33