44

Here is a comparison of the parallel kernels launched under Mathematica under v9 and v10, on the same identical current 2014 R2-D2 Mac Pro ...

[ Update: Valerio has commented that the same issue arises on the Macbook Air.]

Under v9.01

$ProcessorCount 

12

Issuing:

 LaunchKernels[]

... launches 12 kernels, and actually uses them ... notice that the ParallelTable is 12 times the speed of Table[] for this construct:

In[5]:= Table[Pause[1]; f[i], {i, 12}] // AbsoluteTiming

Out[5]= {12.003106, {f1, f2, f[3], f[4], f[5], f[6], f[7], f[8], f[9], f[10], f[11], f[12]}}

In[6]:= ParallelTable[Pause[1]; f[i], {i, 12}] // AbsoluteTiming

Out[6]= {1.010648, {f1, f2, f[3], f[4], f[5], f[6], f[7], f[8], f[9], f[10], f[11], f[12]}}

So, to perform the same operation, the parallel result under v9 is 12 times the speed of the single kernel result.

Under v10 -- half my potential processing power has gone

$ProcessorCount

6

... down from 12 - even though I am running on the identical machine. Now, I know that my Mac Pro actually has 6 processors, and each runs 2 threads ... and under v9, that yielded 12 processor kernels for Mma 9 ... but under v10, it is only yielding 6 kernels ... ON THE SAME MACHINE. And this has real effects ... it effectively reduces by 50% the maximum potential power of my Mac:

 LaunchKernels[]

... launches 6 kernels (not 12 kernels as under v9).

Compare the performance:

 In[3]:= ParallelTable[Pause[1]; f[i], {i, 12}] // AbsoluteTiming

Out[3]= {2.009933, {f1, f2, f[3], f[4], f[5], f[6], f[7], f[8], f[9], f[10], f[11], f[12]}}

So, under the new v10, I am getting half the parallel performance here and half the kernels that I got under v9. Even more perplexing is that this worked fine in an earlier pre-release version of v10.

I am very confused. Anyone have any ideas how I can get my missing kernels back? Or why a decision may have been made to hobble the performance of the Mac Pro under v10?

Addendum

Just noticed that if I go to:

  • Evaluation Menu -> Parallel Kernel Configuration

... the automatic setting for:

  • Number of kernels to use: is set to: Automatic (which Mma sets to 6)

If I change this to:

  • Manual setting

and set it to 12 ... then it seems to use 12.

But I am still confused as to why, if Mathematica 10 can actually support 12 kernels on the machine, ... why would Wolfram set it to use only half of them by default, when v9 supported all of them by default?

Reply to Szabolcs: real-world test

Szabolcs suggests below that Mathematica may not practically use more kernels than physical cores, even if your processor supports virtual cores ... so there is no real difference. In reply, here is a quick timing test of a real-world application (kernel density estimation) from the mathStatica benchmarking test suite. The task is to plot 12 kernel density estimates, corresponding to 12 different bandwidths.

bandwidths = {.2, .35, .45, .55, .65, 1, 1.5, 2, 2.2, 2.5, 3, 3.2};

enter image description here

Here are the results running under:

  • v9 (default: 12 kernels): 3.38 seconds
  • v10 (default: 6 kernels): 9.53 seconds
  • v10 (manual overide to 12 kernels): 7.46 seconds

I don't know what has changed to cause such a performance hit under v10 ... but even so, that is not the point. The point is that the v10 default kernel setting fails to take advantage of the power of the Mac Pro ... and results in worse performance in a typical parallel-processing application.

More extensive real-world test:

Update: 1 August 2014

I have now had the opportunity to run the full mathStatica (primarily symbolic) benchmark suite under both:

  • the default v10 parallel setting (6 kernels)
  • the manual override v10 setting (12 kernels)

Here are the results:

enter image description here

The results fall into 2 categories:

  • For problems that have more than 6 separate components to them: ... For such problems, using 12 kernels is ALWAYS unambiguously faster, and significantly so.

  • For problems that have 6 or less separate components: ...For instance, Examples 7 and 9 can only be broken down into 2 symbolic components, so the benefits of parallelism max out with 2 kernels. In these cases, the 6 automatic kernels case is sometimes marginally faster than the 12 kernel case (presumably due to running overheads etc) ... but the difference is tiny, and essentially unnoticeable.

In summary: for problems that CAN benefit from more than 6 kernels, the default Mma 10 (automatic) setting of 6 kernels on a Mac Pro appears to be sub-optimal, and fails to take advantage of the full capability of the machine. This problem is new to v10, and does not occur under v9.

wolfies
  • 8,722
  • 1
  • 25
  • 54
  • 8
    +1. This is a very well-researched post and even the example is pretty. That said, I would suggest to entertain the possibility that the cause of the performance hit is not directly due to the number of kernels launched or even the number of recognized cores in v10 but that it could be due to some other reason. It may also be worthwhile to track what MMA is doing in Activity Monitor. – heropup Jul 11 '14 at 14:53
  • 1
    Something else that is also worth asking is whether we see a similar performance hit between v9 and v10 on Windows. – heropup Jul 11 '14 at 16:36
  • What functions are run in the benchmark? – rcollyer Jul 11 '14 at 18:05
  • @rcollyer mathStatica's NPKDEPlot function uses ParallelTable[Plot[ Funky ] ] to produce each separate curve on a separate kernel, where Funky is a Compiled application of (Map of Total of some Ifs and Buts). – wolfies Jul 11 '14 at 18:28
  • Have you compared the speed of Plot[Funky] between v9 and v10 on a single core? That could be a major factor in the difference between the two platforms. – rcollyer Jul 11 '14 at 18:31
  • v9 SINGLE-core: NPKDEPlot[data, bandwidths, {15, 25}] // AbsoluteTiming -----------> 23 seconds $$\text{ }$$
  • v10 SINGLE-core: NPKDEPlot[data, bandwidths, {15, 25}] // AbsoluteTiming -----------> 42 seconds $$\text{ }$$As a guess, the v9 vs v10 timing difference might be due to increased plot points used under v10 (just a guess??) ... but the v9 vs v10 timing is not the issue here: it is the v10 timings on the same computer that are the subject of interest.
  • – wolfies Jul 11 '14 at 18:38
  • 1
    I have the same issue on my MacBook Air: by default Mathematica 10 uses 2 instead of the 4 cores that Mathematica 9 would use. – Valerio Jul 31 '14 at 14:17
  • @Valerio This post also provides the solution: if you prefer to use 4 kernels (even though you only have 2 physical cores), then set it up in the preferences. – Szabolcs Jul 31 '14 at 14:29
  • @Szabolcs As I wrote below (comment to the proposed answer) even if you set manually the number of kernels to 4 the performance is not as good as it was with Mathematica 9. I think this is very funny and, as suggested, I contacted Wolfram support. – Valerio Jul 31 '14 at 15:04
  • @wolfies In its current form, this Q/A is not very useful for people who have the same problem. Would you help me to clean it up, if you agree about the changes I propose? I'd like to: 1. remove my answer 2. make the question concise and focused on why only half the number of kernels are launched 3. write a community wiki answer explaining how to launch the desired number of kernels. 4. you can have a second longer section of your question demonstrating that more kernels help with performance even on CPUs using hyperthreading ... – Szabolcs Aug 06 '14 at 18:19
  • ... 5. in this question I'd like to focus on how the number of kernels affects performance. There was a separate issue about why v10 is slower with the same number of kernels as v9, but it was not clear that this is also related to parallelization. I think this deserves its own separate QA. Do you agree to these changes? Please keep in mind that this is not a Wolfram site, so complaining here is less likely to trigger a response from Wolfram (writing to support is better for that). But it would be good to have a concise Q/A written up that is actually helpful for other users. – Szabolcs Aug 06 '14 at 18:20
  • (not to say that the complaint is not legitimate, just that it is probably not useful for anyone to complain on this site.) – Szabolcs Aug 06 '14 at 18:23
  • @Szabolcs I am quite happy with the question as it stands: it is fairly concise and it covers the main issues. If you would like to delete your answer, or modify your answer to improve it, ... that is of course up to you. The issue of v10 performance versus v9 performance is entirely another question -- not this one. Nor is it dealt with here. If you would like to focus on how the number of kernels affects performance, that could be very nicely done within your answer (I look forwards to seeing alternative calculations), or even as a separate self-contained answer. – wolfies Aug 06 '14 at 18:35
  • @Szabolcs Mostly, I think your existing answer "this is an intentional and beneficial change in v10" would benefit from some form of substantiation in support of the claim. – wolfies Aug 06 '14 at 18:37
  • @wolfies Note that I proposed to remove my answer completely and turn this into a set of posts that's actually helpful for readers other than you. Right now it reads as a complaint. Considering the number of comments I can't just delete the answer without a full revision of both the question and the answer. If you wish to keep the question as is, of course we'll do that but then I won't put in any effort to rewrite the answer either. Also: I am not certain any more that this change was intentional. – Szabolcs Aug 06 '14 at 18:45
  • To clarify my comment: I'm short on time so I will put in the effort only if I am convinced that the end result is worth it. I consider making the Q/A helpful for a wider audience worthwhile. – Szabolcs Aug 06 '14 at 18:52
  • It's an observation, a workaround, and an empirical check. I would hope it is helpful to others. I would prefer to keep the question as is. But please do not let that discourage you from modifying your answer, because I am sure many would be interested in alternative parallel kernel speed comparisons. – wolfies Aug 06 '14 at 18:54